It looks like the core problem is an incoming RPC-1 which triggers another blocking RPC-2: the thread delivering RPC-1 is blocked waiting for the response from RPC-2, and can therefore not be used to serve other requests for the duration of RPC-2. If RPC-2 takes a while, e.g. waiting to acquire a lock in the remote node, then it is clear that the thread pool will quickly exceed its max size.
A simple solution would be to prevent invoking blocking RPCs *from within* a received RPC. Let's take a look at an example: - A invokes a blocking PUT-1 on B - B forwards the request as blocking PUT-2 to C and D - When PUT-2 returns and B gets the responses from C and D (or the first one to respond, don't know exactly how this is implemented), it sends the response back to A (PUT-1 terminates now at A) We could change this to the following: - A invokes a blocking PUT-1 on B - B receives PUT-1. Instead of invoking a blocking PUT-2 on C and D, it does the following: - B invokes PUT-2 and gets a future - B adds itself as a FutureListener, and it also stores the address of the original sender (A) - When the FutureListener is invoked, B sends back the result as a response to A - Whenever a member leaves the cluster, the corresponding futures are cancelled and removed from the hashmaps This could probably be done differently (e.g. by sending asynchronous messages and implementing a finite state machine), but the core of the solution is the same; namely to avoid having an incoming thread block on a sync RPC. Thoughts ? On 2/1/13 9:04 AM, Radim Vansa wrote: > Hi guys, > > after dealing with the large cluster for a while I find the way how we use > OOB threads in synchronous configuration non-robust. > Imagine a situation where node which is not an owner of the key calls PUT. > Then the a RPC is called to the primary owner of that key, which reroutes the > request to all other owners and after these reply, it replies back. > There are two problems: > 1) If we do simultanously X requests from non-owners to the primary owner > where X is OOB TP size, all the OOB threads are waiting for the responses and > there is no thread to process the OOB response and release the thread. > 2) Node A is primary owner of keyA, non-primary owner of keyB and B is > primary of keyB and non-primary of keyA. We got many requests for both keyA > and keyB from other nodes, therefore, all OOB threads from both nodes call > RPC to the non-primary owner but there's noone who could process the request. > > While we wait for the requests to timeout, the nodes with depleted OOB > threadpools start suspecting all other nodes because they can't receive > heartbeats etc... > > You can say "increase your OOB tp size", but that's not always an option, I > have currently set it to 1000 threads and it's not enough. In the end, I will > be always limited by RAM and something tells me that even nodes with few gigs > of RAM should be able to form a huge cluster. We use 160 hotrod worker > threads in JDG, that means that 160 * clusterSize = 10240 (64 nodes in my > cluster) parallel requests can be executed, and if 10% targets the same node > with 1000 OOB threads, it stucks. It's about scaling and robustness. > > Not that I'd have any good solution, but I'd really like to start a > discussion. > Thinking about it a bit, the problem is that blocking call (calling RPC on > primary owner from message handler) can block non-blocking calls (such as RPC > response or command that never sends any more messages). Therefore, having a > flag on message "this won't send another message" could let the message be > executed in different threadpool, which will be never deadlocked. In fact, > the pools could share the threads but the non-blocking would have always a > few threads spare. > It's a bad solution as maintaining which message could block in the other > node is really, really hard (we can be sure only in case of RPC responses), > especially when some locks come. I will welcome anything better. -- Bela Ban, JGroups lead (http://www.jgroups.org) _______________________________________________ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev