Re: [infinispan-dev] DIST.retrieveFromRemoteSource
On Sat, Feb 4, 2012 at 5:23 PM, Bela Ban b...@redhat.com wrote: No, Socket.send() is not a hotspot in the Java code. On the networking level, with UDP datagrams, a packet is placed into a buffer and send() returns after the packet has been placed into the buffer successfully. Some network stacks simply discard the packet if the buffer is full, I assume. With TCP, send() blocks until the packet has been placed into the buffer and an ack has been received. If the receiver can't keep up with the sending rate, it'll reduce the window to 0, so send() blocks until the receiver catches up. So if you send 2 requests in paralle over TCP, both threads would block() on the send. So I don't think parallelization would make sense here, unless of course you're doing something else, such as serialization, lock acquisition etc... You're probably right... When I sent the email I had only seen that DataGramSocket.send takes 0.4% of the get worker's total time (which is huge considering that 99.3% is spent waiting for the remote node's response). I didn't notice that it's synchronized and the actual sending only takes half of that - sending the messages in parallel would create more lock contention on the socket *and* in JGroups. On the other hand, we're sending the messages to two different nodes, on two different sockets, so sending in parallel *may* improve the response time in scenarios with few worker threads. Certainly not worth making our heavy-load performance worse, but I thought it was worth mentioning. Anyway, my primary motivation for this question was that I believe we could use GroupRequest instead of our FutureCollator to send our commands in parallel. They both do essentially the same thing, except FutureCollator has and extra indirection layer because it uses UnicastRequest. If there is any performance discrepancy at the moment between GroupRequest and FutureCollator+UnicastRequest, I don't think there is any fundamental reason why we can't fix GroupRequest to be just as efficient as FutureCollator+UnicastRequest. I think I just saw one such fix: in RequestCorrelator.sendRequest, if anycasting is enabled then it's making a copy of the buffer for each target member. I don't think that is necessary at all, in fact I think it should reuse both the buffer and the headers and only change the destination address. Cheers Dan On 2/4/12 3:43 PM, Manik Surtani wrote: Is that a micro-optimisation? Do we know that socket.send() really is a hotspot? On 1 Feb 2012, at 00:11, Dan Berindei wrote: It's true, but then JGroups' GroupRequest does exactly the same thing... socket.send() takes some time too, I thought sending the request in parallel would mean calling socket.send() on a separate thread for each recipient. Cheers Dan On Fri, Jan 27, 2012 at 6:41 PM, Manik Surtanima...@jboss.org wrote: Doesn't setBlockForResults(false) mean that we're not waiting on a response, and can proceed to the next message to the next recipient? On 27 Jan 2012, at 16:34, Dan Berindei wrote: Manik, Bela, I think we send the requests sequentially as well. In ReplicationTask.call: for (Address a : targets) { NotifyingFutureObject f = sendMessageWithFuture(constructMessage(buf, a), opts); futureCollator.watchFuture(f, a); } In MessageDispatcher.sendMessageWithFuture: UnicastRequestT req=new UnicastRequestT(msg, corr, dest, options); req.setBlockForResults(false); req.execute(); Did we use to send each request on a separate thread? Cheers Dan On Fri, Jan 27, 2012 at 1:21 PM, Bela Banb...@redhat.com wrote: yes. On 1/27/12 12:13 PM, Manik Surtani wrote: On 25 Jan 2012, at 09:42, Bela Ban wrote: No, parallel unicasts will be faster, as an anycast to A,B,C sends the unicasts sequentially Is this still the case in JG 3.x? -- Bela Ban Lead JGroups (http://www.jgroups.org) JBoss / Red Hat ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev -- Manik Surtani ma...@jboss.org twitter.com/maniksurtani Lead, Infinispan http://www.infinispan.org ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev -- Manik Surtani ma...@jboss.org twitter.com/maniksurtani Lead, Infinispan http://www.infinispan.org ___ infinispan-dev mailing list
Re: [infinispan-dev] again: no physical address
On 2/4/12 5:53 PM, Manik Surtani wrote: On 2 Feb 2012, at 07:53, Bela Ban wrote: I can also reproduce it by now, in JGroups: I simply create 12 members in a loop... Don't need the bombastic Transactional test Yup; Transactional was made to benchmark and profile 2-phase transactions - as the name suggests! ;) - in Infinispan. Not JGroups. Looking into it. So this is what was addressed in 3.0.5? Yes, there are 2 changes which reduce the chance of missing IP addresses. The workaround is to set PING.num_initial_members to 12 -- Bela Ban Lead JGroups (http://www.jgroups.org) JBoss / Red Hat ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] DIST.retrieveFromRemoteSource
On 2/5/12 9:44 AM, Dan Berindei wrote: On Sat, Feb 4, 2012 at 5:23 PM, Bela Banb...@redhat.com wrote: No, Socket.send() is not a hotspot in the Java code. On the networking level, with UDP datagrams, a packet is placed into a buffer and send() returns after the packet has been placed into the buffer successfully. Some network stacks simply discard the packet if the buffer is full, I assume. With TCP, send() blocks until the packet has been placed into the buffer and an ack has been received. If the receiver can't keep up with the sending rate, it'll reduce the window to 0, so send() blocks until the receiver catches up. So if you send 2 requests in paralle over TCP, both threads would block() on the send. So I don't think parallelization would make sense here, unless of course you're doing something else, such as serialization, lock acquisition etc... You're probably right... When I sent the email I had only seen that DataGramSocket.send takes 0.4% of the get worker's total time (which is huge considering that 99.3% is spent waiting for the remote node's response). I didn't notice that it's synchronized and the actual sending only takes half of that - sending the messages in parallel would create more lock contention on the socket *and* in JGroups. On the other hand, we're sending the messages to two different nodes, on two different sockets, so sending in parallel *may* improve the response time in scenarios with few worker threads. Certainly not worth making our heavy-load performance worse, but I thought it was worth mentioning. If you think this makes a difference, why don't you make a temporary code change and measure its effect on performance ? Anyway, my primary motivation for this question was that I believe we could use GroupRequest instead of our FutureCollator to send our commands in parallel. They both do essentially the same thing, except FutureCollator has and extra indirection layer because it uses UnicastRequest. Ah, ok. Yes, I think that 1 GroupRequest of 2 might be more efficient than 2 UnicastRequests for sending an 'anycast' message to 2 members. Perhaps the marshalling is done only once (I'd have to confirm that though) and we're only creating 1 data structure and add it to a hashmap (request/response correlation)... Certainly worth giving a try... If there is any performance discrepancy at the moment between GroupRequest and FutureCollator+UnicastRequest, I don't think there is any fundamental reason why we can't fix GroupRequest to be just as efficient as FutureCollator+UnicastRequest. You're assuming that FutureCollator+UnicastRequest is faster than GroupRequest, what are you basing your assumption on ? As I outlined above, I'd rather assume the opposite, although I don't know FutureCollator. Maybe we should have a chat next week to go through the code that sends an anycast to 2 members, wdyt ? I think I just saw one such fix: in RequestCorrelator.sendRequest, if anycasting is enabled then it's making a copy of the buffer for each target member. I don't think that is necessary at all, in fact I think it should reuse both the buffer and the headers and only change the destination address. No, this *is* necessary, I don't just make copies because I think this is fun !! :-) Remember that a message might be retransmitted, so it is placed into a retransmit buffer. If M1 has destination A and M2 has destination B, and we send M1 first (to A), then change M1's destination to B, and send it, everything is fine. However, if we later get a retransmit request from B, we'd resend the message to A instead ! This is just 1 example, modifications of headers is another one. Note that the copy does *not* copy the buffer (payload) itself, but only references it, so this is fast. Of course, nobody is supposed to modify the contents of the buffer itself... -- Bela Ban Lead JGroups (http://www.jgroups.org) JBoss / Red Hat ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] Proposal: ISPN-1394 Manual rehashing in 5.2
On Sat, Feb 4, 2012 at 4:49 PM, Manik Surtani ma...@jboss.org wrote: On 1 Feb 2012, at 12:23, Dan Berindei wrote: Bela, you're right, this is essentially what we talked about in Lisbon: https://community.jboss.org/wiki/AsymmetricCachesAndManualRehashingDesign For joins I actually started working on a policy of coalescing joins that happen one after the other in a short time interval. The current implementation is very primitive, as I shifted focus to stability, but it does coalesce joins 1 second after another join started (or while that join is still running). I don't quite agree with Sanne's assessment that it's fine for getCache() to block for 5 minutes until the administrator allows the new node to join. We should modify startCaches() instead to signal to the coordinator that we are ready to receive data for one or all of the defined caches, and wait with a customizable time limit until the caches have properly joined the cluster. The getCache() timeout should not be increased at all. Instead I would propose that getCache() returns a functional cache immediately, even if the cache didn't receive any data, and it works solely as an L1 cache until the administrator allows it to join. I'd even make it possible to designate a cache as an L1-only cache, so it's never an owner for any key. I presume this would be encoded in the Address? That would make sense for a node permanently designated as an L1 node. But then how would this work for a node temporarily acting as L1 only, until it has been allowed to join? Change the Address instance on the fly? A delegating Address? :/ Nope, not in the Address. Since we now have virtual cache views, every node has to explicitly request to join each cache. It would be quite easy to add an L1-only flag to the join request, and nodes with that flag would never be included in the proper cache view. They would however get a copy of the cache view on join and on every cache view installation so they could update their CH and send requests to the proper owners. Nodes acting temporarily as L1-only would send a normal join request, but they would also receive a copy of the cache view and getCache() would return immediately instead of waiting for the node to receive state. When the coordinator finally installs a cache view that includes them, they will perform the initial state transfer as they do now. For leaves, the main problem is that every node has to compute the same primary owner for a key, at all times. So we need a 2PC cache view installation immediately after any leave to ensure that every node determines the primary owner in the same way - we can't coalesce or postpone leaves. Yes, manual rehashing would probably just be for joins. Controlled shutdown in itself is manual, and crashes, well, need to be dealt with immediately IMO. We could extend the policy for craches as well, by adding a minNumOwners setting and only triggering an automatic rehash when there is a segment on the hash wheel with = minNumOwners owners. We would have a properly installed CH, that guarantees at some point in the past each key had numOwners owners, and a filter on top of it that removes any leavers from the result of DM.locate(), DM.getPrimaryLocation() etc. It would probably undo our recent optimizations around locate and getPrimaryLocation, so it's slowing the normal case (without any leavers) in order to make the exceptional case (organized shutdown or a part of the cluster) faster. The question is how big the cluster has to get before the exceptional case becomes common enough that it's worth optimizing for... For 5.2 I will try to decouple the cache view installation from the state transfer, so in theory we will be able to coalesce/postpone the state transfer for leaves as well (https://issues.jboss.org/browse/ISPN-1827). I'm kind of need it for non-blocking state transfer, because with the current implementation a leave forces us to cancel any state transfer in progress and restart with the updated cache view - a state transfer rollback will be very expensive with NBST. Erik does raise a valid point - with TACH, if we bring up a node with a different siteId, then it will be an owner for all the keys in the cache. That node probably isn't provisioned to hold all the keys, so it would very likely run out of memory or evict much of the data. I guess that makes it a 5.2 issue? Yes. Shutting down a site should be possible even with what we have now - just insert a DISCARD protocol in the JGroups stack of all the nodes that are shutting down, and when FD finally times out on the nodes in the surviving datacenter they won't have any state transfer to do (although it may cause a few failed state transfer attempts). We could make it simpler though. Cheers Dan On Tue, Jan 31, 2012 at 6:21 PM, Erik Salter an1...@hotmail.com wrote: ...such as bringing up a backup data center. -Original Message-
Re: [infinispan-dev] DIST.retrieveFromRemoteSource
On Sun, Feb 5, 2012 at 12:24 PM, Bela Ban b...@redhat.com wrote: On 2/5/12 9:44 AM, Dan Berindei wrote: On Sat, Feb 4, 2012 at 5:23 PM, Bela Banb...@redhat.com wrote: No, Socket.send() is not a hotspot in the Java code. On the networking level, with UDP datagrams, a packet is placed into a buffer and send() returns after the packet has been placed into the buffer successfully. Some network stacks simply discard the packet if the buffer is full, I assume. With TCP, send() blocks until the packet has been placed into the buffer and an ack has been received. If the receiver can't keep up with the sending rate, it'll reduce the window to 0, so send() blocks until the receiver catches up. So if you send 2 requests in paralle over TCP, both threads would block() on the send. So I don't think parallelization would make sense here, unless of course you're doing something else, such as serialization, lock acquisition etc... You're probably right... When I sent the email I had only seen that DataGramSocket.send takes 0.4% of the get worker's total time (which is huge considering that 99.3% is spent waiting for the remote node's response). I didn't notice that it's synchronized and the actual sending only takes half of that - sending the messages in parallel would create more lock contention on the socket *and* in JGroups. On the other hand, we're sending the messages to two different nodes, on two different sockets, so sending in parallel *may* improve the response time in scenarios with few worker threads. Certainly not worth making our heavy-load performance worse, but I thought it was worth mentioning. If you think this makes a difference, why don't you make a temporary code change and measure its effect on performance ? I will give it a try, it's just that writing emails is a bit easier ;-) Anyway, my primary motivation for this question was that I believe we could use GroupRequest instead of our FutureCollator to send our commands in parallel. They both do essentially the same thing, except FutureCollator has and extra indirection layer because it uses UnicastRequest. Ah, ok. Yes, I think that 1 GroupRequest of 2 might be more efficient than 2 UnicastRequests for sending an 'anycast' message to 2 members. Perhaps the marshalling is done only once (I'd have to confirm that though) and we're only creating 1 data structure and add it to a hashmap (request/response correlation)... Certainly worth giving a try... We are only serializing once with the FutureCollator approach, and we don't copy the buffer either. If there is any performance discrepancy at the moment between GroupRequest and FutureCollator+UnicastRequest, I don't think there is any fundamental reason why we can't fix GroupRequest to be just as efficient as FutureCollator+UnicastRequest. You're assuming that FutureCollator+UnicastRequest is faster than GroupRequest, what are you basing your assumption on ? As I outlined above, I'd rather assume the opposite, although I don't know FutureCollator. I did say *if*... Maybe we should have a chat next week to go through the code that sends an anycast to 2 members, wdyt ? Sounds good, we should also also talk about how to implement staggered get requests best. I think I just saw one such fix: in RequestCorrelator.sendRequest, if anycasting is enabled then it's making a copy of the buffer for each target member. I don't think that is necessary at all, in fact I think it should reuse both the buffer and the headers and only change the destination address. No, this *is* necessary, I don't just make copies because I think this is fun !! :-) Remember that a message might be retransmitted, so it is placed into a retransmit buffer. If M1 has destination A and M2 has destination B, and we send M1 first (to A), then change M1's destination to B, and send it, everything is fine. However, if we later get a retransmit request from B, we'd resend the message to A instead ! This is just 1 example, modifications of headers is another one. Note that the copy does *not* copy the buffer (payload) itself, but only references it, so this is fast. Of course, nobody is supposed to modify the contents of the buffer itself... I wasn't clear enough, but I didn't mean we should reuse the entire Message object. I meant we should copy the Message but not the buffer or the headers. I see now that protocols may be adding new headers, so it wouldn't be safe to reuse the headers collection. I think this line in RequestCorrelator.sendRequest(RequestCorrelator.java:152) means that the contents of the buffer is copied in the new message, not just the buffer reference: Message copy=msg.copy(true); Cheers Dan ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] DIST.retrieveFromRemoteSource
On Sun, Feb 5, 2012 at 5:56 PM, Bela Ban b...@redhat.com wrote: On 2/5/12 4:40 PM, Dan Berindei wrote: Remember that a message might be retransmitted, so it is placed into a retransmit buffer. If M1 has destination A and M2 has destination B, and we send M1 first (to A), then change M1's destination to B, and send it, everything is fine. However, if we later get a retransmit request from B, we'd resend the message to A instead ! This is just 1 example, modifications of headers is another one. Note that the copy does *not* copy the buffer (payload) itself, but only references it, so this is fast. Of course, nobody is supposed to modify the contents of the buffer itself... I wasn't clear enough, but I didn't mean we should reuse the entire Message object. I meant we should copy the Message but not the buffer or the headers. I see now that protocols may be adding new headers, so it wouldn't be safe to reuse the headers collection. I think this line in RequestCorrelator.sendRequest(RequestCorrelator.java:152) means that the contents of the buffer is copied in the new message, not just the buffer reference: Message copy=msg.copy(true); No, this does *not* copy the buffer, but simply references the same buffer. Aha, I thought copy_buffer == true meant copy the contents and copy_buffer == false meant share the contents. I see copy_buffer == true actually means copy the reference, share the contents and copy_buffer == false means don't copy anything. I will modify our CommandAwareRpcDispatcher to use GroupRequest and see how they compare, then we can continue this discussion with the results. Cheers Dan ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] Proposal: ISPN-1394 Manual rehashing in 5.2
On 5 Feb 2012, at 16:24, Dan Berindei wrote: On Sat, Feb 4, 2012 at 4:49 PM, Manik Surtani ma...@jboss.org wrote: On 1 Feb 2012, at 12:23, Dan Berindei wrote: Bela, you're right, this is essentially what we talked about in Lisbon: https://community.jboss.org/wiki/AsymmetricCachesAndManualRehashingDesign For joins I actually started working on a policy of coalescing joins that happen one after the other in a short time interval. The current implementation is very primitive, as I shifted focus to stability, but it does coalesce joins 1 second after another join started (or while that join is still running). I don't quite agree with Sanne's assessment that it's fine for getCache() to block for 5 minutes until the administrator allows the new node to join. We should modify startCaches() instead to signal to the coordinator that we are ready to receive data for one or all of the defined caches, and wait with a customizable time limit until the caches have properly joined the cluster. The getCache() timeout should not be increased at all. Instead I would propose that getCache() returns a functional cache immediately, even if the cache didn't receive any data, and it works solely as an L1 cache until the administrator allows it to join. I'd even make it possible to designate a cache as an L1-only cache, so it's never an owner for any key. I presume this would be encoded in the Address? That would make sense for a node permanently designated as an L1 node. But then how would this work for a node temporarily acting as L1 only, until it has been allowed to join? Change the Address instance on the fly? A delegating Address? :/ Nope, not in the Address. Since we now have virtual cache views, every node has to explicitly request to join each cache. It would be quite easy to add an L1-only flag to the join request, and nodes with that flag would never be included in the proper cache view. They would however get a copy of the cache view on join and on every cache view installation so they could update their CH and send requests to the proper owners. Nodes acting temporarily as L1-only would send a normal join request, but they would also receive a copy of the cache view and getCache() would return immediately instead of waiting for the node to receive state. When the coordinator finally installs a cache view that includes them, they will perform the initial state transfer as they do now. Makes sense. For leaves, the main problem is that every node has to compute the same primary owner for a key, at all times. So we need a 2PC cache view installation immediately after any leave to ensure that every node determines the primary owner in the same way - we can't coalesce or postpone leaves. Yes, manual rehashing would probably just be for joins. Controlled shutdown in itself is manual, and crashes, well, need to be dealt with immediately IMO. We could extend the policy for craches as well, by adding a minNumOwners setting and only triggering an automatic rehash when there is a segment on the hash wheel with = minNumOwners owners. We would have a properly installed CH, that guarantees at some point in the past each key had numOwners owners, and a filter on top of it that removes any leavers from the result of DM.locate(), DM.getPrimaryLocation() etc. It would probably undo our recent optimizations around locate and getPrimaryLocation, so it's slowing the normal case (without any leavers) in order to make the exceptional case (organized shutdown or a part of the cluster) faster. The question is how big the cluster has to get before the exceptional case becomes common enough that it's worth optimizing for… Well, we could just leave the crashed node in the owners list. When any caller attempts to do something with this list, i.e., perform an RPC, don't we check again at that time and remove leavers from an RPC recipient list? Cheers Manik -- Manik Surtani ma...@jboss.org twitter.com/maniksurtani Lead, Infinispan http://www.infinispan.org ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] Proposal: ISPN-1394 Manual rehashing in 5.2
On Mon, Feb 6, 2012 at 3:30 AM, Manik Surtani ma...@jboss.org wrote: On 5 Feb 2012, at 16:24, Dan Berindei wrote: On Sat, Feb 4, 2012 at 4:49 PM, Manik Surtani ma...@jboss.org wrote: On 1 Feb 2012, at 12:23, Dan Berindei wrote: Bela, you're right, this is essentially what we talked about in Lisbon: https://community.jboss.org/wiki/AsymmetricCachesAndManualRehashingDesign For joins I actually started working on a policy of coalescing joins that happen one after the other in a short time interval. The current implementation is very primitive, as I shifted focus to stability, but it does coalesce joins 1 second after another join started (or while that join is still running). I don't quite agree with Sanne's assessment that it's fine for getCache() to block for 5 minutes until the administrator allows the new node to join. We should modify startCaches() instead to signal to the coordinator that we are ready to receive data for one or all of the defined caches, and wait with a customizable time limit until the caches have properly joined the cluster. The getCache() timeout should not be increased at all. Instead I would propose that getCache() returns a functional cache immediately, even if the cache didn't receive any data, and it works solely as an L1 cache until the administrator allows it to join. I'd even make it possible to designate a cache as an L1-only cache, so it's never an owner for any key. I presume this would be encoded in the Address? That would make sense for a node permanently designated as an L1 node. But then how would this work for a node temporarily acting as L1 only, until it has been allowed to join? Change the Address instance on the fly? A delegating Address? :/ Nope, not in the Address. Since we now have virtual cache views, every node has to explicitly request to join each cache. It would be quite easy to add an L1-only flag to the join request, and nodes with that flag would never be included in the proper cache view. They would however get a copy of the cache view on join and on every cache view installation so they could update their CH and send requests to the proper owners. Nodes acting temporarily as L1-only would send a normal join request, but they would also receive a copy of the cache view and getCache() would return immediately instead of waiting for the node to receive state. When the coordinator finally installs a cache view that includes them, they will perform the initial state transfer as they do now. Makes sense. For leaves, the main problem is that every node has to compute the same primary owner for a key, at all times. So we need a 2PC cache view installation immediately after any leave to ensure that every node determines the primary owner in the same way - we can't coalesce or postpone leaves. Yes, manual rehashing would probably just be for joins. Controlled shutdown in itself is manual, and crashes, well, need to be dealt with immediately IMO. We could extend the policy for craches as well, by adding a minNumOwners setting and only triggering an automatic rehash when there is a segment on the hash wheel with = minNumOwners owners. We would have a properly installed CH, that guarantees at some point in the past each key had numOwners owners, and a filter on top of it that removes any leavers from the result of DM.locate(), DM.getPrimaryLocation() etc. It would probably undo our recent optimizations around locate and getPrimaryLocation, so it's slowing the normal case (without any leavers) in order to make the exceptional case (organized shutdown or a part of the cluster) faster. The question is how big the cluster has to get before the exceptional case becomes common enough that it's worth optimizing for… Well, we could just leave the crashed node in the owners list. When any caller attempts to do something with this list, i.e., perform an RPC, don't we check again at that time and remove leavers from an RPC recipient list? We do check against the JGroups cluster membership, but we only remove leavers from the recipient list if the response mode is SYNCHRONOUS_IGNORE_LEAVERS. If the response mode is SYNCHRONOUS, we fail the RPC with a SuspectException. If the response mode is ASYNCHRONOUS, we don't care about the response anyway, so we don't do anything. In SYNCHRONOUS mode, we then catch the SuspectExceptions in StateTransferLockInterceptor and we retry to run the command after waiting for state transfer to end (even if there is no state transfer in progress, we know there will be one because of the SuspectException). This ensures that we always execute the command on the current primary owner - if we would not fail PREPAREs when the primary owner is down, we could have multiple PREPAREs claiming to have locked the same key at the same time. Anyway, we can't rely on the JGroups view membership, we need to check against the cache view membership. When a