It looks like the core problem is an incoming RPC-1 which triggers 
another blocking RPC-2: the thread delivering RPC-1 is blocked waiting 
for the response from RPC-2, and can therefore not be used to serve 
other requests for the duration of RPC-2. If RPC-2 takes a while, e.g. 
waiting to acquire a lock in the remote node, then it is clear that the 
thread pool will quickly exceed its max size.

A simple solution would be to prevent invoking blocking RPCs *from 
within* a received RPC. Let's take a look at an example:
- A invokes a blocking PUT-1 on B
- B forwards the request as blocking PUT-2 to C and D
- When PUT-2 returns and B gets the responses from C and D (or the first 
one to respond, don't know exactly how this is implemented), it sends 
the response back to A (PUT-1 terminates now at A)

We could change this to the following:
- A invokes a blocking PUT-1 on B
- B receives PUT-1. Instead of invoking a blocking PUT-2 on C and D, it 
does the following:
      - B invokes PUT-2 and gets a future
      - B adds itself as a FutureListener, and it also stores the 
address of the original sender (A)
      - When the FutureListener is invoked, B sends back the result as a 
response to A
- Whenever a member leaves the cluster, the corresponding futures are 
cancelled and removed from the hashmaps

This could probably be done differently (e.g. by sending asynchronous 
messages and implementing a finite state machine), but the core of the 
solution is the same; namely to avoid having an incoming thread block on 
a sync RPC.

Thoughts ?

On 2/1/13 9:04 AM, Radim Vansa wrote:
> Hi guys,
> after dealing with the large cluster for a while I find the way how we use 
> OOB threads in synchronous configuration non-robust.
> Imagine a situation where node which is not an owner of the key calls PUT. 
> Then the a RPC is called to the primary owner of that key, which reroutes the 
> request to all other owners and after these reply, it replies back.
> There are two problems:
> 1) If we do simultanously X requests from non-owners to the primary owner 
> where X is OOB TP size, all the OOB threads are waiting for the responses and 
> there is no thread to process the OOB response and release the thread.
> 2) Node A is primary owner of keyA, non-primary owner of keyB and B is 
> primary of keyB and non-primary of keyA. We got many requests for both keyA 
> and keyB from other nodes, therefore, all OOB threads from both nodes call 
> RPC to the non-primary owner but there's noone who could process the request.
> While we wait for the requests to timeout, the nodes with depleted OOB 
> threadpools start suspecting all other nodes because they can't receive 
> heartbeats etc...
> You can say "increase your OOB tp size", but that's not always an option, I 
> have currently set it to 1000 threads and it's not enough. In the end, I will 
> be always limited by RAM and something tells me that even nodes with few gigs 
> of RAM should be able to form a huge cluster. We use 160 hotrod worker 
> threads in JDG, that means that 160 * clusterSize = 10240 (64 nodes in my 
> cluster) parallel requests can be executed, and if 10% targets the same node 
> with 1000 OOB threads, it stucks. It's about scaling and robustness.
> Not that I'd have any good solution, but I'd really like to start a 
> discussion.
> Thinking about it a bit, the problem is that blocking call (calling RPC on 
> primary owner from message handler) can block non-blocking calls (such as RPC 
> response or command that never sends any more messages). Therefore, having a 
> flag on message "this won't send another message" could let the message be 
> executed in different threadpool, which will be never deadlocked. In fact, 
> the pools could share the threads but the non-blocking would have always a 
> few threads spare.
> It's a bad solution as maintaining which message could block in the other 
> node is really, really hard (we can be sure only in case of RPC responses), 
> especially when some locks come. I will welcome anything better.

Bela Ban, JGroups lead (

infinispan-dev mailing list

Reply via email to