Hi Bela I think I found why you weren't seeing the warnings. The bench-log4j.xml in github master is configured to log only to the log file (benchmark.log). If you add an <appender-ref ref="CONSOLE"/> you'll see the warnings on the console as well.
I am now able to reproduce it pretty reliably, even from IntelliJ. I created a run configuration for the Transactional class, I set the module to infinispan-5.1-SNAPSHOT, and I added these JVM arguments: -Dbench.loops=3000 -Dbench.writerThreads=33 -Dbench.readerThreads=10 -Dbench.dist=true -Dbench.transactional=true -Dbench.numkeys=500 -Dbench.payloadsize=10240 -Dbench.nodes=12 -Djava.net.preferIPv4Stack -Dlog4j.configuration=bench-log4j.xml In Sanne's thread dump it was a ClusteredGetCommand that was waiting for a physical address, but in my test runs it's usually a JOIN that's failing. And it seems that it happens because we got a discovery response from 3 nodes (that is our initial num_initial_members), but none of them was the coordinator: 2012-02-02 08:04:41,672 TRACE [org.jgroups.protocols.PING] (main) discovery took 8 ms: responses: 3 total (2 servers (0 coord), 1 clients) 2012-02-02 08:04:41,672 TRACE [org.jgroups.protocols.pbcast.GMS] (main) localhost-47135: initial_mbrs are localhost-36539 localhost-30921 2012-02-02 08:04:41,672 DEBUG [org.jgroups.protocols.pbcast.GMS] (main) election results: {localhost-32436=2} 2012-02-02 08:04:41,672 DEBUG [org.jgroups.protocols.pbcast.GMS] (main) sending JOIN(localhost-47135) to localhost-32436 2012-02-02 08:04:41,673 TRACE [org.jgroups.protocols.UNICAST2] (main) localhost-47135: created connection to localhost-32436 (conn_id=0) 2012-02-02 08:04:41,673 TRACE [org.jgroups.protocols.UNICAST2] (main) localhost-47135 --> DATA(localhost-32436: #1, conn_id=0, first) 2012-02-02 08:04:41,673 TRACE [org.jgroups.protocols.UDP] (main) sending msg to localhost-32436, src=localhost-47135, headers are GMS: GmsHeader[JOIN_REQ]: mbr=localhost-47135, UNICAST2: DATA, seqno=1, first, UDP: [channel_name=ISPN] 2012-02-02 08:04:41,673 WARN [org.jgroups.protocols.UDP] (main) localhost-47135: no physical address for localhost-32436, dropping message Cheers Dan On Wed, Feb 1, 2012 at 7:12 PM, Sanne Grinovero <sa...@infinispan.org> wrote: > On 1 February 2012 16:40, Bela Ban <b...@redhat.com> wrote: >> Your benchmark is giving me the creeps ! > > Manik was the original author, I've only been adapting it slightly to > identify performance issues. I wouldn't have used Maven either, but > it's serving me well especially since it turns out I have to > frequently change JGroups version ;-) > >> First, which version of JGroups / Infinispan does this use ? Second, is >> there a way to start this in an IDE rather than through maven ? Third, I >> don't think bench-jgroups.xml is picked up at all ! How do I make a change >> to bench-jgroups.xml and have Transactional *use* it ? > > 1) The intention is to compare Infinispan versions, and see how we > make progress, so the JGroups version is defined as each different > Infinispan version was released with, or in case of a SNAPSHOT the > version it's currently using. > > So, assuming you have a fresh snapshot of Infinispan 5.1.0-SNAPSHOT, > it's JGroups 3.0.4.Final > > 2) As I pointed out in the previous mail, I've reconfigured it for you > to not pick up the bench-jgroups.xml but go with Infinispan's default > UDP configuration. > You have to comment line 15 of bench.sh to use the bench-jgroups.xml, > but then I can't reproduce the issue anymore. > >> This maven crap hides so much, I never know what's going on under the covers >> !@$@ Do I have to do mvn install or mvn package when I make a change to >> bench-jgroups.xml ? > > Aren't you using bench.sh as I pointed out in the script in my > previous mail? It does mvn install before running it. > But it's also configured to NOT use bench-jgroups.xml, but rather > jgroups-udp.xml. > >> I don't think I'll ever switch to this f*cking piece of shit ! > > I don't think Maven is to be blamed today! What's wrong? > > Anyway one of the nice things of this little benchmark is exactly that > it it's a single class with a main file, so you can just import it in > you IDE and run. Any IDE will pick the correct dependencies, thanks to > Maven. Just that if you do, it will use the default test properties as > hardcoded in the test class org.infinispan.benchmark.Transactional: > please set the same environment variable as bench.sh does, unless you > don't want to run my same configuration. > > Just ping me for any doubt. > > Cheers, > Sanne > >> >> Please ping when as soon as you've calmed down... :-) >> Cheers, >> >> >> >> >> On 2/1/12 4:52 PM, Sanne Grinovero wrote: >>> >>> On 1 February 2012 15:18, Bela Ban<b...@redhat.com> wrote: >>>> >>>> >>>> >>>> On 2/1/12 10:25 AM, Dan Berindei wrote: >>>> >>>>>> That's not the way it works; at startup of F, it sends its IP address >>>>>> with the discovery request. Everybody returns its IP address with the >>>>>> discovery response, so even though we have F only talking to A (the >>>>>> coordinator) initially, F will also know the IP addresses of A,B,C,D >>>>>> and E. >>>>>> >>>>> >>>>> Ok, I stand corrected... since we start all the nodes on the same >>>>> thread, each of them should reply to the discovery request of the next >>>>> nodes. >>>> >>>> >>>> >>>> Hmm, can you reproduce this every time ? If so, can you send me the >>>> program so I can run it here ? >>>> >>>> >>>>> However, num_initial_members was set to 3 (the Infinispan default). >>>>> Could that make PING not wait for all the responses? If it's like >>>>> that, then I suggest we set a (much) higher num_initial_members and a >>>>> lower timeout in the default configuration. >>>> >>>> >>>> >>>> Yes, the discovery could return quickly, but the responses would even be >>>> processed if they were received later, so I don't think that's the issue. >>>> >>>> The initial discovery should discover *all* IP addresses, later >>>> triggering a discovery because an IP address wasn't found should always >>>> be the exceptional case ! >>>> >>>> If you start members in turn, then they should easily form a cluster and >>>> not even merge. Here's what can happen on a merge: >>>> - The view is A|1={A,B}, both A and B have IP addresses for A and B >>>> - The view splits into A|2={A} and B|2={B} >>>> - A now marks B's IP address as removable and B marks A's IP address as >>>> removable >>>> - If the cache grows to over 500 entries >>>> (TP.logical_addr_cache_max_size) or TP.logical_addr_cache_expiration >>>> milliseconds elapse (whichever comes first), the entries marked as >>>> removable are removed >>>> - If, *before that* the merge view A|3={A,B} is installed, A unmarks B >>>> and B unmarks A, so the entries won't get removed >>>> >>>> So a hypothesis of how those IP addresses get removed could be that the >>>> cluster had a couple of merges, that didn't heal for 2 minutes (?) hard >>>> to believe though... >>>> >>>> We have to get to the bottom of this, so it would be great if you had a >>>> program that reproduced this, that I could send myself. The main >>>> question is why the IP address for the target is gone and/or why the IP >>>> address wasn't received in the first place. >>>> >>>> In any case, replacing MERGE2 with MERGE3 might help a bit, as MERGE3 >>>> [1] periodically broadcasts IP address/logical name and logical address: >>>> "An INFO message carries the logical name and physical address of a >>>> member. Compared to MERGE2, this allows us to immediately send messages >>>> to newly merged members, and not have to solicit this information first. >>>> " (copied from the documentation) >>>> >>>> >>>> >>>>>>> Â Note that everything is blocked at this point, we >>>>>>> >>>>>>> won't send another message in the entire cluster until we got the >>>>>>> physical address. >>>> >>>> >>>> >>>> Understood. Let me see if I can block sending of the message for a max >>>> time (say 2 seconds) until I get the IP address. Not very nice, and I >>>> prefer a different approach (plus we need to see why this happens in the >>>> first place anyway)... >>>> >>>> >>>>>> As I said; this is an exceptional case, probably caused by Sanne >>>>>> starting 12 channels inside the same JVM, at the same time, therefore >>>>>> causing a traffic spike, which results in dropped discovery requests or >>>>>> responses. >>>>>> >>>>> >>>>> Bela, we create the caches on a single thread, so we never have more >>>>> than one node joining at the same time. >>>>> At most we could have some extra activity if one node can't join the >>>>> existing cluster and starts a separate partition, but hardly enough to >>>>> cause congestion. >>>> >>>> >>>> >>>> Hmm, does indeed not sound like an issue... >>>> >>>> >>>>>> After than, when F wants to talk to C, it asks the cluster for C's IP >>>>>> address, and that should be a few ms at most. >>>>>> >>>>> >>>>> Ok, so when F wanted to send the ClusteredGetCommand request to C, >>>>> PING got the physical address right away. But the ClusteredGetCommand >>>>> had to wait for STABLE to kick in and for C to ask for retransmission >>>>> (because we didn't send any other messages). >>>> >>>> >>>> >>>> Yep. Before I implement some blocking until we have the IP address, or a >>>> timeout elapses, I'd like to try to get to the bottom of this problem >>>> first ! >>>> >>>> >>>>> Maybe *we* should use RSVP for our ClusteredGetCommands, since those >>>>> can never block... Actually, we don't want to retransmit the request >>>>> if we already got a response from another node, so it would be best if >>>>> we could ask for retransmission of a particular request explicitly ;-) >>>> >>>> >>>> >>>> I'd rather implement the blocking approach above ! :-) >>>> >>>> >>>>> I wonder if we could also decrease desired_avg_gossip and >>>>> stability_delay in STABLE. After all, an extra STABLE round can't slow >>>>> us when we're not doing anything, and when we are busy we're going to >>>>> hit the max_bytes limit much sooner than the desired_avg_gossip time >>>>> limit anyway. >>>> >>>> >>>> >>>> I don't think this is a good idea as it will generate more traffic. The >>>> stable task is not skipped when we have a lot of traffic, so this will >>>> compound the issue. >>>> >>>> >>>>>>> I'm also not sure what to make of these lines: >>>>>>> >>>>>>>>>> [org.jgroups.protocols.UDP] sanne-55119: no physical address for >>>>>>>>>> sanne-53650, dropping message >>>>>>>>>> [org.jgroups.protocols.pbcast.GMS] JOIN(sanne-55119) sent to >>>>>>>>>> sanne-53650 timed out (after 3000 ms), retrying >>>>>>> >>>>>>> >>>>>>> It appears that sanne-55119 knows the logical name of sanne-53650, and >>>>>>> the fact that it's coordinator, but not its physical address. >>>>>>> Shouldn't all of this information have arrived at the same time? >>>>>> >>>>>> >>>>>> Hmm, correct. However, the logical names are kept in (a static) >>>>>> UUID.cache and the IP addresses in TP.logical_addr_cache. >>>>>> >>>>> >>>>> Ah, so if we have 12 nodes in the same VM they automatically know each >>>>> other's logical name - they don't need PING at all! >>>> >>>> >>>> >>>> Yes. Note that logical names are not the problem; even if we evict some >>>> logical name from the cache (and we do this only for removed members), >>>> JGroups will still work as it only needs UUIDs and IP addresses. >>>> >>>> >>>>> Does the logical cache get cleared on channel stop? I think that would >>>>> explain another weird thing I was seeing in the test suite logs, >>>>> sometimes everyone in a cluster would suddenly forget everyone else's >>>>> logical name and start logging UUIDs. >>>> >>>> >>>> >>>> On a view change, we remove all entries which are *not* in the new view. >>>> However, 'removing' is again simply marking those members as >>>> 'removable', and only if the cache grows beyond 500 >>>> (-Djgroups.uuid_cache.max_entries=500) entries will all entries older >>>> than 5 seconds (-Djgroups.uuid_cache.max_age=5000) be removed. (There is >>>> no separate reaper task running for this). >>>> >>>> So, yes, this can happen, but on the next discovery round, we'll have >>>> the correct values. Again, as I said, UUID.cache is not as important as >>>> TP.logical_addr_cache. >>>> >>>> >>>>> This is running the Transactional benchmark, so it would be simpler if >>>>> we enabled PING trace in the configuration and disabled it before the >>>>> actual benchmark starts. I'm going to try it myself :) >>>> >>>> >>>> >>>> How do you run 12 instances ? Did you change something in the config ? >>>> I'd be interested in trying the *exact* same config you're running, to >>>> see what's going on ! >>> >>> >>> Pushed it for you, committing the exact configuration changes as well: >>> >>> git clone git://github.com/Sanne/InfinispanStartupBenchmark.git >>> cd InfinispanStartupBenchmark >>> git co my >>> sh bench.sh >>> >>> If you look into bench.sh >>> (as it is at >>> https://github.com/Sanne/InfinispanStartupBenchmark/blob/my/bench.sh >>> ) >>> the lines 9, 15, 28 should be the most interesting. >>> >>> You need to run with Infinispan's default configuration to reproduce the >>> issue, >>> but I wouldn't mind you commenting on >>> >>> https://github.com/Sanne/InfinispanStartupBenchmark/blob/my/benchmark/src/main/resources/bench-jgroups.xml >>> as well. That's what I'm now using as default for further performance >>> tests. >>> >>> Cheers, >>> Sanne >>> >>>> >>>> >>>> [1] http://www.jgroups.org/manual-3.x/html/protlist.html#MERGE3 >>>> >>>> -- >>>> Bela Ban >>>> Lead JGroups (http://www.jgroups.org) >>>> JBoss / Red Hat >>>> _______________________________________________ >>>> infinispan-dev mailing list >>>> infinispan-dev@lists.jboss.org >>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev >> >> >> -- >> Bela Ban >> Lead JGroups (http://www.jgroups.org) >> JBoss / Red Hat > > _______________________________________________ > infinispan-dev mailing list > infinispan-dev@lists.jboss.org > https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev