Re: [infinispan-dev] DIST.retrieveFromRemoteSource

2012-02-05 Thread Dan Berindei
On Sat, Feb 4, 2012 at 5:23 PM, Bela Ban b...@redhat.com wrote:
 No, Socket.send() is not a hotspot in the Java code.

 On the networking level, with UDP datagrams, a packet is placed into a
 buffer and send() returns after the packet has been placed into the
 buffer successfully. Some network stacks simply discard the packet if
 the buffer is full, I assume.

 With TCP, send() blocks until the packet has been placed into the buffer
 and an ack has been received. If the receiver can't keep up with the
 sending rate, it'll reduce the window to 0, so send() blocks until the
 receiver catches up.

 So if you send 2 requests in paralle over TCP, both threads would
 block() on the send.

 So I don't think parallelization would make sense here, unless of course
 you're doing something else, such as serialization, lock acquisition etc...


You're probably right...

When I sent the email I had only seen that DataGramSocket.send takes
0.4% of the get worker's total time (which is huge considering that
99.3% is spent waiting for the remote node's response). I didn't
notice that it's synchronized and the actual sending only takes half
of that - sending the messages in parallel would create more lock
contention on the socket *and* in JGroups.

On the other hand, we're sending the messages to two different nodes,
on two different sockets, so sending in parallel *may* improve the
response time in scenarios with few worker threads. Certainly not
worth making our heavy-load performance worse, but I thought it was
worth mentioning.


Anyway, my primary motivation for this question was that I believe we
could use GroupRequest instead of our FutureCollator to send our
commands in parallel. They both do essentially the same thing, except
FutureCollator has and extra indirection layer because it uses
UnicastRequest. If there is any performance discrepancy at the moment
between GroupRequest and FutureCollator+UnicastRequest, I don't think
there is any fundamental reason why we can't fix GroupRequest to be
just as efficient as FutureCollator+UnicastRequest.

I think I just saw one such fix: in RequestCorrelator.sendRequest, if
anycasting is enabled then it's making a copy of the buffer for each
target member. I don't think that is necessary at all, in fact I think
it should reuse both the buffer and the headers and only change the
destination address.

Cheers
Dan



 On 2/4/12 3:43 PM, Manik Surtani wrote:
 Is that a micro-optimisation?  Do we know that socket.send() really is a 
 hotspot?

 On 1 Feb 2012, at 00:11, Dan Berindei wrote:

 It's true, but then JGroups' GroupRequest does exactly the same thing...

 socket.send() takes some time too, I thought sending the request in
 parallel would mean calling socket.send() on a separate thread for
 each recipient.

 Cheers
 Dan


 On Fri, Jan 27, 2012 at 6:41 PM, Manik Surtanima...@jboss.org  wrote:
 Doesn't setBlockForResults(false) mean that we're not waiting on a 
 response, and can proceed to the next message to the next recipient?

 On 27 Jan 2012, at 16:34, Dan Berindei wrote:

 Manik, Bela, I think we send the requests sequentially as well. In
 ReplicationTask.call:

                for (Address a : targets) {
                   NotifyingFutureObject  f =
 sendMessageWithFuture(constructMessage(buf, a), opts);
                   futureCollator.watchFuture(f, a);
                }


 In MessageDispatcher.sendMessageWithFuture:

         UnicastRequestT  req=new UnicastRequestT(msg, corr, dest, 
 options);
         req.setBlockForResults(false);
         req.execute();


 Did we use to send each request on a separate thread?


 Cheers
 Dan


 On Fri, Jan 27, 2012 at 1:21 PM, Bela Banb...@redhat.com  wrote:
 yes.

 On 1/27/12 12:13 PM, Manik Surtani wrote:

 On 25 Jan 2012, at 09:42, Bela Ban wrote:

 No, parallel unicasts will be faster, as an anycast to A,B,C sends the
 unicasts sequentially

 Is this still the case in JG 3.x?


 --
 Bela Ban
 Lead JGroups (http://www.jgroups.org)
 JBoss / Red Hat
 ___
 infinispan-dev mailing list
 infinispan-dev@lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/infinispan-dev
 ___
 infinispan-dev mailing list
 infinispan-dev@lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/infinispan-dev

 --
 Manik Surtani
 ma...@jboss.org
 twitter.com/maniksurtani

 Lead, Infinispan
 http://www.infinispan.org




 ___
 infinispan-dev mailing list
 infinispan-dev@lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/infinispan-dev

 ___
 infinispan-dev mailing list
 infinispan-dev@lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/infinispan-dev

 --
 Manik Surtani
 ma...@jboss.org
 twitter.com/maniksurtani

 Lead, Infinispan
 http://www.infinispan.org




 ___
 infinispan-dev mailing list
 

Re: [infinispan-dev] again: no physical address

2012-02-05 Thread Bela Ban


On 2/4/12 5:53 PM, Manik Surtani wrote:

 On 2 Feb 2012, at 07:53, Bela Ban wrote:

 I can also reproduce it by now, in JGroups: I simply create 12 members
 in a loop...

 Don't need the bombastic Transactional test

 Yup; Transactional was made to benchmark and profile 2-phase transactions - 
 as the name suggests! ;) - in Infinispan.  Not JGroups.

 Looking into it.

 So this is what was addressed in 3.0.5?


Yes, there are 2 changes which reduce the chance of missing IP 
addresses. The workaround is to set PING.num_initial_members to 12


-- 
Bela Ban
Lead JGroups (http://www.jgroups.org)
JBoss / Red Hat
___
infinispan-dev mailing list
infinispan-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev


Re: [infinispan-dev] DIST.retrieveFromRemoteSource

2012-02-05 Thread Bela Ban


On 2/5/12 9:44 AM, Dan Berindei wrote:
 On Sat, Feb 4, 2012 at 5:23 PM, Bela Banb...@redhat.com  wrote:
 No, Socket.send() is not a hotspot in the Java code.

 On the networking level, with UDP datagrams, a packet is placed into a
 buffer and send() returns after the packet has been placed into the
 buffer successfully. Some network stacks simply discard the packet if
 the buffer is full, I assume.

 With TCP, send() blocks until the packet has been placed into the buffer
 and an ack has been received. If the receiver can't keep up with the
 sending rate, it'll reduce the window to 0, so send() blocks until the
 receiver catches up.

 So if you send 2 requests in paralle over TCP, both threads would
 block() on the send.

 So I don't think parallelization would make sense here, unless of course
 you're doing something else, such as serialization, lock acquisition etc...


 You're probably right...

 When I sent the email I had only seen that DataGramSocket.send takes
 0.4% of the get worker's total time (which is huge considering that
 99.3% is spent waiting for the remote node's response). I didn't
 notice that it's synchronized and the actual sending only takes half
 of that - sending the messages in parallel would create more lock
 contention on the socket *and* in JGroups.

 On the other hand, we're sending the messages to two different nodes,
 on two different sockets, so sending in parallel *may* improve the
 response time in scenarios with few worker threads. Certainly not
 worth making our heavy-load performance worse, but I thought it was
 worth mentioning.


If you think this makes a difference, why don't you make a temporary 
code change and measure its effect on performance ?


 Anyway, my primary motivation for this question was that I believe we
 could use GroupRequest instead of our FutureCollator to send our
 commands in parallel. They both do essentially the same thing, except
 FutureCollator has and extra indirection layer because it uses
 UnicastRequest.


Ah, ok. Yes, I think that 1 GroupRequest of 2 might be more efficient 
than 2 UnicastRequests for sending an 'anycast' message to 2 members. 
Perhaps the marshalling is done only once (I'd have to confirm that 
though) and we're only creating 1 data structure and add it to a hashmap 
(request/response correlation)... Certainly worth giving a try...


  If there is any performance discrepancy at the moment
 between GroupRequest and FutureCollator+UnicastRequest, I don't think
 there is any fundamental reason why we can't fix GroupRequest to be
 just as efficient as FutureCollator+UnicastRequest.


You're assuming that FutureCollator+UnicastRequest is faster than 
GroupRequest, what are you basing your assumption on ? As I outlined 
above, I'd rather assume the opposite, although I don't know FutureCollator.

Maybe we should have a chat next week to go through the code that sends 
an anycast to 2 members, wdyt ?


 I think I just saw one such fix: in RequestCorrelator.sendRequest, if
 anycasting is enabled then it's making a copy of the buffer for each
 target member. I don't think that is necessary at all, in fact I think
 it should reuse both the buffer and the headers and only change the
 destination address.

No, this *is* necessary, I don't just make copies because I think this 
is fun !! :-)

Remember that a message might be retransmitted, so it is placed into a 
retransmit buffer. If M1 has destination A and M2 has destination B, and 
we send M1 first (to A), then change M1's destination to B, and send it, 
everything is fine. However, if we later get a retransmit request from 
B, we'd resend the message to A instead ! This is just 1 example, 
modifications of headers is another one.

Note that the copy does *not* copy the buffer (payload) itself, but only 
references it, so this is fast. Of course, nobody is supposed to modify 
the contents of the buffer itself...

-- 
Bela Ban
Lead JGroups (http://www.jgroups.org)
JBoss / Red Hat
___
infinispan-dev mailing list
infinispan-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev


Re: [infinispan-dev] Proposal: ISPN-1394 Manual rehashing in 5.2

2012-02-05 Thread Dan Berindei
On Sat, Feb 4, 2012 at 4:49 PM, Manik Surtani ma...@jboss.org wrote:

 On 1 Feb 2012, at 12:23, Dan Berindei wrote:

 Bela, you're right, this is essentially what we talked about in Lisbon:
 https://community.jboss.org/wiki/AsymmetricCachesAndManualRehashingDesign

 For joins I actually started working on a policy of coalescing joins
 that happen one after the other in a short time interval. The current
 implementation is very primitive, as I shifted focus to stability, but
 it does coalesce joins 1 second after another join started (or while
 that join is still running).

 I don't quite agree with Sanne's assessment that it's fine for
 getCache() to block for 5 minutes until the administrator allows the
 new node to join. We should modify startCaches() instead to signal to
 the coordinator that we are ready to receive data for one or all of
 the defined caches, and wait with a customizable time limit until the
 caches have properly joined the cluster.

 The getCache() timeout should not be increased at all. Instead I would
 propose that getCache() returns a functional cache immediately, even
 if the cache didn't receive any data, and it works solely as an L1
 cache until the administrator allows it to join. I'd even make it
 possible to designate a cache as an L1-only cache, so it's never an
 owner for any key.

 I presume this would be encoded in the Address?  That would make sense for a 
 node permanently designated as an L1 node.  But then how would this work for 
 a node temporarily acting as L1 only, until it has been allowed to join?  
 Change the Address instance on the fly?  A delegating Address?  :/


Nope, not in the Address. Since we now have virtual cache views, every
node has to explicitly request to join each cache. It would be quite
easy to add an L1-only flag to the join request, and nodes with that
flag would never be included in the proper cache view. They would
however get a copy of the cache view on join and on every cache view
installation so they could update their CH and send requests to the
proper owners.

Nodes acting temporarily as L1-only would send a normal join request,
but they would also receive a copy of the cache view and getCache()
would return immediately instead of waiting for the node to receive
state. When the coordinator finally installs a cache view that
includes them, they will perform the initial state transfer as they do
now.


 For leaves, the main problem is that every node has to compute the
 same primary owner for a key, at all times. So we need a 2PC cache
 view installation immediately after any leave to ensure that every
 node determines the primary owner in the same way - we can't coalesce
 or postpone leaves.

 Yes, manual rehashing would probably just be for joins.  Controlled shutdown 
 in itself is manual, and crashes, well, need to be dealt with immediately IMO.


We could extend the policy for craches as well, by adding a
minNumOwners setting and only triggering an automatic rehash when
there is a segment on the hash wheel with = minNumOwners owners.

We would have a properly installed CH, that guarantees at some point
in the past each key had numOwners owners, and a filter on top of it
that removes any leavers from the result of DM.locate(),
DM.getPrimaryLocation() etc.

It would probably undo our recent optimizations around locate and
getPrimaryLocation, so it's slowing the normal case (without any
leavers) in order to make the exceptional case (organized shutdown or
a part of the cluster) faster. The question is how big the cluster has
to get before the exceptional case becomes common enough that it's
worth optimizing for...



 For 5.2 I will try to decouple the cache view installation from the
 state transfer, so in theory we will be able to coalesce/postpone the
 state transfer for leaves as well
 (https://issues.jboss.org/browse/ISPN-1827). I'm kind of need it for
 non-blocking state transfer, because with the current implementation a
 leave forces us to cancel any state transfer in progress and restart
 with the updated cache view - a state transfer rollback will be very
 expensive with NBST.


 Erik does raise a valid point - with TACH, if we bring up a node with
 a different siteId, then it will be an owner for all the keys in the
 cache. That node probably isn't provisioned to hold all the keys, so
 it would very likely run out of memory or evict much of the data. I
 guess that makes it a 5.2 issue?

 Yes.

 Shutting down a site should be possible even with what we have now -
 just insert a DISCARD protocol in the JGroups stack of all the nodes
 that are shutting down, and when FD finally times out on the nodes in
 the surviving datacenter they won't have any state transfer to do
 (although it may cause a few failed state transfer attempts). We could
 make it simpler though.


 Cheers
 Dan


 On Tue, Jan 31, 2012 at 6:21 PM, Erik Salter an1...@hotmail.com wrote:
 ...such as bringing up a backup data center.

 -Original Message-
 

Re: [infinispan-dev] DIST.retrieveFromRemoteSource

2012-02-05 Thread Dan Berindei
On Sun, Feb 5, 2012 at 12:24 PM, Bela Ban b...@redhat.com wrote:


 On 2/5/12 9:44 AM, Dan Berindei wrote:
 On Sat, Feb 4, 2012 at 5:23 PM, Bela Banb...@redhat.com  wrote:
 No, Socket.send() is not a hotspot in the Java code.

 On the networking level, with UDP datagrams, a packet is placed into a
 buffer and send() returns after the packet has been placed into the
 buffer successfully. Some network stacks simply discard the packet if
 the buffer is full, I assume.

 With TCP, send() blocks until the packet has been placed into the buffer
 and an ack has been received. If the receiver can't keep up with the
 sending rate, it'll reduce the window to 0, so send() blocks until the
 receiver catches up.

 So if you send 2 requests in paralle over TCP, both threads would
 block() on the send.

 So I don't think parallelization would make sense here, unless of course
 you're doing something else, such as serialization, lock acquisition etc...


 You're probably right...

 When I sent the email I had only seen that DataGramSocket.send takes
 0.4% of the get worker's total time (which is huge considering that
 99.3% is spent waiting for the remote node's response). I didn't
 notice that it's synchronized and the actual sending only takes half
 of that - sending the messages in parallel would create more lock
 contention on the socket *and* in JGroups.

 On the other hand, we're sending the messages to two different nodes,
 on two different sockets, so sending in parallel *may* improve the
 response time in scenarios with few worker threads. Certainly not
 worth making our heavy-load performance worse, but I thought it was
 worth mentioning.


 If you think this makes a difference, why don't you make a temporary
 code change and measure its effect on performance ?


I will give it a try, it's just that writing emails is a bit easier ;-)


 Anyway, my primary motivation for this question was that I believe we
 could use GroupRequest instead of our FutureCollator to send our
 commands in parallel. They both do essentially the same thing, except
 FutureCollator has and extra indirection layer because it uses
 UnicastRequest.


 Ah, ok. Yes, I think that 1 GroupRequest of 2 might be more efficient
 than 2 UnicastRequests for sending an 'anycast' message to 2 members.
 Perhaps the marshalling is done only once (I'd have to confirm that
 though) and we're only creating 1 data structure and add it to a hashmap
 (request/response correlation)... Certainly worth giving a try...


We are only serializing once with the FutureCollator approach, and we
don't copy the buffer either.


  If there is any performance discrepancy at the moment
 between GroupRequest and FutureCollator+UnicastRequest, I don't think
 there is any fundamental reason why we can't fix GroupRequest to be
 just as efficient as FutureCollator+UnicastRequest.


 You're assuming that FutureCollator+UnicastRequest is faster than
 GroupRequest, what are you basing your assumption on ? As I outlined
 above, I'd rather assume the opposite, although I don't know FutureCollator.


I did say *if*...

 Maybe we should have a chat next week to go through the code that sends
 an anycast to 2 members, wdyt ?


Sounds good, we should also also talk about how to implement staggered
get requests best.


 I think I just saw one such fix: in RequestCorrelator.sendRequest, if
 anycasting is enabled then it's making a copy of the buffer for each
 target member. I don't think that is necessary at all, in fact I think
 it should reuse both the buffer and the headers and only change the
 destination address.

 No, this *is* necessary, I don't just make copies because I think this
 is fun !! :-)

 Remember that a message might be retransmitted, so it is placed into a
 retransmit buffer. If M1 has destination A and M2 has destination B, and
 we send M1 first (to A), then change M1's destination to B, and send it,
 everything is fine. However, if we later get a retransmit request from
 B, we'd resend the message to A instead ! This is just 1 example,
 modifications of headers is another one.

 Note that the copy does *not* copy the buffer (payload) itself, but only
 references it, so this is fast. Of course, nobody is supposed to modify
 the contents of the buffer itself...


I wasn't clear enough, but I didn't mean we should reuse the entire
Message object. I meant we should copy the Message but not the buffer
or the headers. I see now that protocols may be adding new headers, so
it wouldn't be safe to reuse the headers collection.

I think this line in
RequestCorrelator.sendRequest(RequestCorrelator.java:152) means that
the contents of the buffer is copied in the new message, not just the
buffer reference:

Message copy=msg.copy(true);


Cheers
Dan

___
infinispan-dev mailing list
infinispan-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev


Re: [infinispan-dev] DIST.retrieveFromRemoteSource

2012-02-05 Thread Dan Berindei
On Sun, Feb 5, 2012 at 5:56 PM, Bela Ban b...@redhat.com wrote:


 On 2/5/12 4:40 PM, Dan Berindei wrote:


 Remember that a message might be retransmitted, so it is placed into a
 retransmit buffer. If M1 has destination A and M2 has destination B, and
 we send M1 first (to A), then change M1's destination to B, and send it,
 everything is fine. However, if we later get a retransmit request from
 B, we'd resend the message to A instead ! This is just 1 example,
 modifications of headers is another one.

 Note that the copy does *not* copy the buffer (payload) itself, but only
 references it, so this is fast. Of course, nobody is supposed to modify
 the contents of the buffer itself...


 I wasn't clear enough, but I didn't mean we should reuse the entire
 Message object. I meant we should copy the Message but not the buffer
 or the headers. I see now that protocols may be adding new headers, so
 it wouldn't be safe to reuse the headers collection.

 I think this line in
 RequestCorrelator.sendRequest(RequestCorrelator.java:152) means that
 the contents of the buffer is copied in the new message, not just the
 buffer reference:

                  Message copy=msg.copy(true);


 No, this does *not* copy the buffer, but simply references the same buffer.


Aha, I thought copy_buffer == true meant copy the contents and
copy_buffer == false meant share the contents. I see copy_buffer ==
true actually means copy the reference, share the contents and
copy_buffer == false means don't copy anything.

I will modify our CommandAwareRpcDispatcher to use GroupRequest and
see how they compare, then we can continue this discussion with the
results.

Cheers
Dan

___
infinispan-dev mailing list
infinispan-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev


Re: [infinispan-dev] Proposal: ISPN-1394 Manual rehashing in 5.2

2012-02-05 Thread Manik Surtani

On 5 Feb 2012, at 16:24, Dan Berindei wrote:

 On Sat, Feb 4, 2012 at 4:49 PM, Manik Surtani ma...@jboss.org wrote:
 
 On 1 Feb 2012, at 12:23, Dan Berindei wrote:
 
 Bela, you're right, this is essentially what we talked about in Lisbon:
 https://community.jboss.org/wiki/AsymmetricCachesAndManualRehashingDesign
 
 For joins I actually started working on a policy of coalescing joins
 that happen one after the other in a short time interval. The current
 implementation is very primitive, as I shifted focus to stability, but
 it does coalesce joins 1 second after another join started (or while
 that join is still running).
 
 I don't quite agree with Sanne's assessment that it's fine for
 getCache() to block for 5 minutes until the administrator allows the
 new node to join. We should modify startCaches() instead to signal to
 the coordinator that we are ready to receive data for one or all of
 the defined caches, and wait with a customizable time limit until the
 caches have properly joined the cluster.
 
 The getCache() timeout should not be increased at all. Instead I would
 propose that getCache() returns a functional cache immediately, even
 if the cache didn't receive any data, and it works solely as an L1
 cache until the administrator allows it to join. I'd even make it
 possible to designate a cache as an L1-only cache, so it's never an
 owner for any key.
 
 I presume this would be encoded in the Address?  That would make sense for a 
 node permanently designated as an L1 node.  But then how would this work for 
 a node temporarily acting as L1 only, until it has been allowed to join?  
 Change the Address instance on the fly?  A delegating Address?  :/
 
 
 Nope, not in the Address. Since we now have virtual cache views, every
 node has to explicitly request to join each cache. It would be quite
 easy to add an L1-only flag to the join request, and nodes with that
 flag would never be included in the proper cache view. They would
 however get a copy of the cache view on join and on every cache view
 installation so they could update their CH and send requests to the
 proper owners.
 
 Nodes acting temporarily as L1-only would send a normal join request,
 but they would also receive a copy of the cache view and getCache()
 would return immediately instead of waiting for the node to receive
 state. When the coordinator finally installs a cache view that
 includes them, they will perform the initial state transfer as they do
 now.

Makes sense.

 
 
 For leaves, the main problem is that every node has to compute the
 same primary owner for a key, at all times. So we need a 2PC cache
 view installation immediately after any leave to ensure that every
 node determines the primary owner in the same way - we can't coalesce
 or postpone leaves.
 
 Yes, manual rehashing would probably just be for joins.  Controlled shutdown 
 in itself is manual, and crashes, well, need to be dealt with immediately 
 IMO.
 
 
 We could extend the policy for craches as well, by adding a
 minNumOwners setting and only triggering an automatic rehash when
 there is a segment on the hash wheel with = minNumOwners owners.
 
 We would have a properly installed CH, that guarantees at some point
 in the past each key had numOwners owners, and a filter on top of it
 that removes any leavers from the result of DM.locate(),
 DM.getPrimaryLocation() etc.
 
 It would probably undo our recent optimizations around locate and
 getPrimaryLocation, so it's slowing the normal case (without any
 leavers) in order to make the exceptional case (organized shutdown or
 a part of the cluster) faster. The question is how big the cluster has
 to get before the exceptional case becomes common enough that it's
 worth optimizing for…

Well, we could just leave the crashed node in the owners list.  When any caller 
attempts to do something with this list, i.e., perform an RPC, don't we check 
again at that time and remove leavers from an RPC recipient list?

Cheers
Manik
--
Manik Surtani
ma...@jboss.org
twitter.com/maniksurtani

Lead, Infinispan
http://www.infinispan.org




___
infinispan-dev mailing list
infinispan-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev


Re: [infinispan-dev] Proposal: ISPN-1394 Manual rehashing in 5.2

2012-02-05 Thread Dan Berindei
On Mon, Feb 6, 2012 at 3:30 AM, Manik Surtani ma...@jboss.org wrote:

 On 5 Feb 2012, at 16:24, Dan Berindei wrote:

 On Sat, Feb 4, 2012 at 4:49 PM, Manik Surtani ma...@jboss.org wrote:

 On 1 Feb 2012, at 12:23, Dan Berindei wrote:

 Bela, you're right, this is essentially what we talked about in Lisbon:
 https://community.jboss.org/wiki/AsymmetricCachesAndManualRehashingDesign

 For joins I actually started working on a policy of coalescing joins
 that happen one after the other in a short time interval. The current
 implementation is very primitive, as I shifted focus to stability, but
 it does coalesce joins 1 second after another join started (or while
 that join is still running).

 I don't quite agree with Sanne's assessment that it's fine for
 getCache() to block for 5 minutes until the administrator allows the
 new node to join. We should modify startCaches() instead to signal to
 the coordinator that we are ready to receive data for one or all of
 the defined caches, and wait with a customizable time limit until the
 caches have properly joined the cluster.

 The getCache() timeout should not be increased at all. Instead I would
 propose that getCache() returns a functional cache immediately, even
 if the cache didn't receive any data, and it works solely as an L1
 cache until the administrator allows it to join. I'd even make it
 possible to designate a cache as an L1-only cache, so it's never an
 owner for any key.

 I presume this would be encoded in the Address?  That would make sense for 
 a node permanently designated as an L1 node.  But then how would this work 
 for a node temporarily acting as L1 only, until it has been allowed to 
 join?  Change the Address instance on the fly?  A delegating Address?  :/


 Nope, not in the Address. Since we now have virtual cache views, every
 node has to explicitly request to join each cache. It would be quite
 easy to add an L1-only flag to the join request, and nodes with that
 flag would never be included in the proper cache view. They would
 however get a copy of the cache view on join and on every cache view
 installation so they could update their CH and send requests to the
 proper owners.

 Nodes acting temporarily as L1-only would send a normal join request,
 but they would also receive a copy of the cache view and getCache()
 would return immediately instead of waiting for the node to receive
 state. When the coordinator finally installs a cache view that
 includes them, they will perform the initial state transfer as they do
 now.

 Makes sense.



 For leaves, the main problem is that every node has to compute the
 same primary owner for a key, at all times. So we need a 2PC cache
 view installation immediately after any leave to ensure that every
 node determines the primary owner in the same way - we can't coalesce
 or postpone leaves.

 Yes, manual rehashing would probably just be for joins.  Controlled 
 shutdown in itself is manual, and crashes, well, need to be dealt with 
 immediately IMO.


 We could extend the policy for craches as well, by adding a
 minNumOwners setting and only triggering an automatic rehash when
 there is a segment on the hash wheel with = minNumOwners owners.

 We would have a properly installed CH, that guarantees at some point
 in the past each key had numOwners owners, and a filter on top of it
 that removes any leavers from the result of DM.locate(),
 DM.getPrimaryLocation() etc.

 It would probably undo our recent optimizations around locate and
 getPrimaryLocation, so it's slowing the normal case (without any
 leavers) in order to make the exceptional case (organized shutdown or
 a part of the cluster) faster. The question is how big the cluster has
 to get before the exceptional case becomes common enough that it's
 worth optimizing for…

 Well, we could just leave the crashed node in the owners list.  When any 
 caller attempts to do something with this list, i.e., perform an RPC, don't 
 we check again at that time and remove leavers from an RPC recipient list?


We do check against the JGroups cluster membership, but we only remove
leavers from the recipient list if the response mode is
SYNCHRONOUS_IGNORE_LEAVERS. If the response mode is SYNCHRONOUS, we
fail the RPC with a SuspectException. If the response mode is
ASYNCHRONOUS, we don't care about the response anyway, so we don't do
anything.

In SYNCHRONOUS mode, we then catch the SuspectExceptions in
StateTransferLockInterceptor and we retry to run the command after
waiting for state transfer to end (even if there is no state transfer
in progress, we know there will be one because of the
SuspectException).

This ensures that we always execute the command on the current primary
owner - if we would not fail PREPAREs when the primary owner is down,
we could have multiple PREPAREs claiming to have locked the same key
at the same time.

Anyway, we can't rely on the JGroups view membership, we need to check
against the cache view membership. When a