Re: [infinispan-dev] Threadpools in a large cluster

Pedro Ruivo Wed, 06 Feb 2013 11:29:46 -0800

Hi all,

Recently I came up with a solution that can help with the thread poolproblem motivated by the following:

In one of the first implementation of Total Order based commit protocol(TO), I had the requirement to move the PrepareCommand to another threadpool. In resume, the TO protocol delivers the PrepareCommand in adeterministic order in all the nodes, by a single deliver thread. Toensure consistency, if it delivers two conflicting transactions, thesecond transaction must wait until the first transaction finishes.However, blocking single deliver thread is not a good solution, becauseno more transaction can be validated, even if they don't conflict, whilethe thread is blocked.

So, after creating a dependency graph (i.e. the second transaction knowsthat it must wait for the first transaction to finish) I move thePrepareCommand to another thread pool. Initially, I implemented a newcommand, called PrepareResponseCommand, that sends back the reply of thePrepareCommand. This solution has one disadvantage: I had to implementan ack collector in ISPN, while JGroups already offers me that with asynchronous communication.

Recently (2 or 3 months ago) I implemented a simple modification inJGroups. In a more generic approach, it allows other threads to reply toa RPC request (such as the PrepareCommand). In the previous scenario, Ireplaced the PrepareResponseCommand and the ack collector implementationwith a synchronous RPC invocation. I've used this solution in otherissues in the Cloud-TM's ISPN fork.

This solution is quite simple to implement and may help you to move thecommands to ISPN internal thread pools. The modifications I've made arethe following:

1) I added a new interface (see [1]) that is sent to the applicationinstead of the Message object (see [4]). This interface contains theMessage and it has a method to allow the application send the reply tothat particular request.2) I added a new object in [4] with the meaning: this return value isnot the reply to the RPC request. This is the returned value that Ireturn when I want to release the thread, because ISPN should returnsome object in the handle() method. Of course, I know that ISPN willinvoke the sendReply() in some other place, otherwise, I will get aTimeoutException in the sender side.3) Also I've changed the RequestCorrelator implementation to support theprevious modifications (see [2] and [3])

In the Cloud-TM's ISPN fork I added a reference in theBaseCacheRpcCommand to [1] and added the method sendReply() [5]. Inaddition, I have the following uses cases working perfectly with this:


1) Total Order

The scenario described in the beginning. The ParallelTotalOrderManagerreturns the DO_NOT_REPLY object when it receives a remote PrepareCommand(see [6] line 77). When the PrepareCommand is finally processed by therest of the interceptor chain, it invokes the PreapreCommand.sendReply()(see [6] line 230).


2) GMU remote get

GMU ensures SERIALIZABLE Isolation Level and the remote gets must ensurethat the node that is processing the request has a minimum versionavailable to ensure data consistency. The problem in ours initialimplementation in large cluster, is the number of remote gets are veryhigh and all the OOB are being blocked because of this condition.

Same thing I've done with the ClusteredRemoteGet as you can in see [7],line 93 and 105.


3) GMU CommitCommand

In GMU, the CommitCommand cannot be processed by any order. If T1 isserialized before T2, the commit command of T1 must be processed beforethe commit command of T2, even if the transactions do not haveconflicts. This generates the same problem above and the same solutionwas adopted.

I know that you have discussed some solutions and I would like to knowwhat it is your opinion about what I've described.


If you have questions, please let me know.

Cheers,
Pedro

[1]https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/MessageRequest.java<https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/MessageRequest.java>[2]https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestCorrelator.java#L463<https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestCorrelator.java#L463>[3]https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestCorrelator.java#L495<https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestCorrelator.java#L495>[4]https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestHandler.java<https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestHandler.java>[5]https://github.com/pruivo/infinispan/blob/cloudtm_v1/core/src/main/java/org/infinispan/commands/remote/BaseRpcCommand.java#L75<https://github.com/pruivo/infinispan/blob/cloudtm_v1/core/src/main/java/org/infinispan/commands/remote/BaseRpcCommand.java#L75>[6]https://github.com/pruivo/infinispan/blob/cloudtm_v1/core/src/main/java/org/infinispan/transaction/totalorder/ParallelTotalOrderManager.java[7]https://github.com/pruivo/infinispan/blob/cloudtm_v1/core/src/main/java/org/infinispan/commands/remote/GMUClusteredGetCommand.java



On 2/3/13 11:35 AM, Bela Ban wrote:

If you send me the details, I'll take a look. I'm pretty busy with
message batching, so I can't promise next week, but soon...

On 2/1/13 11:08 AM, Pedro Ruivo wrote:

Hi,

I had a similar problem when I tried GMU[1] in "large" cluster (40 vms),
because the remote gets and the commit messages (I'm talking about ISPN
commands) must wait for some conditions before being processed.

I solved this problem by adding a feature in JGroups[2] that allows the
request to be moved to another thread, releasing the OOB thread. The
other thread will send the reply of the JGroups Request. Of course, I'm
only moving commands that I know they can block.

I can enter in some detail if you want =)

Cheers,
Pedro

[1] http://www.gsd.inesc-id.pt/~romanop/files/papers/icdcs12.pdf
[2] I would like to talk with Bela about this, because it makes my life
easier to support total order in ISPN. I'll try to send an email this
weekend =)

On 01-02-2013 08:04, Radim Vansa wrote:

Hi guys,

after dealing with the large cluster for a while I find the way how we use OOB 
threads in synchronous configuration non-robust.
Imagine a situation where node which is not an owner of the key calls PUT. Then 
the a RPC is called to the primary owner of that key, which reroutes the 
request to all other owners and after these reply, it replies back.
There are two problems:
1) If we do simultanously X requests from non-owners to the primary owner where 
X is OOB TP size, all the OOB threads are waiting for the responses and there 
is no thread to process the OOB response and release the thread.
2) Node A is primary owner of keyA, non-primary owner of keyB and B is primary 
of keyB and non-primary of keyA. We got many requests for both keyA and keyB 
from other nodes, therefore, all OOB threads from both nodes call RPC to the 
non-primary owner but there's noone who could process the request.

While we wait for the requests to timeout, the nodes with depleted OOB 
threadpools start suspecting all other nodes because they can't receive 
heartbeats etc...

You can say "increase your OOB tp size", but that's not always an option, I 
have currently set it to 1000 threads and it's not enough. In the end, I will be always 
limited by RAM and something tells me that even nodes with few gigs of RAM should be able 
to form a huge cluster. We use 160 hotrod worker threads in JDG, that means that 160 * 
clusterSize = 10240 (64 nodes in my cluster) parallel requests can be executed, and if 
10% targets the same node with 1000 OOB threads, it stucks. It's about scaling and 
robustness.

Not that I'd have any good solution, but I'd really like to start a discussion.
Thinking about it a bit, the problem is that blocking call (calling RPC on primary owner 
from message handler) can block non-blocking calls (such as RPC response or command that 
never sends any more messages). Therefore, having a flag on message "this won't send 
another message" could let the message be executed in different threadpool, which 
will be never deadlocked. In fact, the pools could share the threads but the non-blocking 
would have always a few threads spare.
It's a bad solution as maintaining which message could block in the other node 
is really, really hard (we can be sure only in case of RPC responses), 
especially when some locks come. I will welcome anything better.

Radim


-----------------------------------------------------------
Radim Vansa
Quality Assurance Engineer
JBoss Datagrid
tel. +420532294559 ext. 62559

Red Hat Czech, s.r.o.
Brno, Purkyňova 99/71, PSČ 612 45
Czech Republic


_______________________________________________
infinispan-dev mailing list
infinispan-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev

_______________________________________________
infinispan-dev mailing list
infinispan-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev

_______________________________________________
infinispan-dev mailing list
infinispan-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev

Re: [infinispan-dev] Threadpools in a large cluster

Reply via email to