Re: [infinispan-dev] Threadpools in a large cluster

Bela Ban Wed, 06 Feb 2013 20:54:00 -0800

Hi Pedro,

this is almost exactly what I wanted to implement !


Question:
- In RequestCorrelator.handleRequest():

protected void handleRequest(Message req, Header hdr) {
Object retval;
boolean threwException = false;
MessageRequest messageRequest = new MessageRequestImpl(req, hdr);
try {
retval=request_handler.handle(messageRequest);
} catch(Throwable t) {
retval=t;
threwException = true;
}
messageRequest.sendReply(retval, threwException);// <-- should be moved 
up, or called only if threwException == true
}


, you create a MessageRequestImpl and pass it to the RequestHandler. The 
request handler then dispatches the request (possibly) to a thread pool 
and calls MessageRequestImpl.sendReply() when done.

However, you also call MessageRequest.sendReply() before returning from 
handleRequest(). I think this is an error, and 
MessageRequest.sendReply() should be moved up inside the catch clause, 
or be called only if threwException is true, so that we send a reply on 
behalf of the RequestHandler if and only if it threw an exception (e.g. 
before it dispatches the request to a thread pool). Otherwise, we'd send 
a reply *twice* !

A few changes I have in mind (need to think about it more):

- I want to leave the existing RequestHandler interface in place, so 
current implementation continue to work
- There will be a new AsyncRequestHandler interface (possibly extending 
RequestHandler, so an implementation can decide to implement both). The 
RequestCorrelator needs to have either request_handler or 
async_request_handler set. If the former is set, the logic is unchanged. 
If the latter is set I'll invoke the async dispatching code

- AsyncRequestHandler will look similar to the following:
void handle(Message request, Handback hb, boolean requires_response) 
throws Throwable;
- Handback is an interface, and its impl contains header information 
(e.g. request ID)
- Handback has a sendReply(Object reply, boolean is_exception) method 
which sends a response (or exception) back to the caller
- When requires_response is false, the AsyncRequestHandler doesn't need 
to invoke sendReply()

- Message batching
- The above interfaces need to take message batching into account, e.g. 
the ability to handle multiple requests concurrently (if they don't need 
to be executed sequentially)


Thoughts ?


On 2/6/13 8:29 PM, Pedro Ruivo wrote:
> Hi all,
>
> Recently I came up with a solution that can help with the thread pool 
> problem motivated by the following:
>
> In one of the first implementation of Total Order based commit 
> protocol (TO), I had the requirement to move the PrepareCommand to 
> another thread pool. In resume, the TO protocol delivers the 
> PrepareCommand in a deterministic order in all the nodes, by a single 
> deliver thread. To ensure consistency, if it delivers two conflicting 
> transactions, the second transaction must wait until the first 
> transaction finishes. However, blocking single deliver thread is not a 
> good solution, because no more transaction can be validated, even if 
> they don't conflict, while the thread is blocked.
>
> So, after creating a dependency graph (i.e. the second transaction 
> knows that it must wait for the first transaction to finish) I move 
> the PrepareCommand to another thread pool. Initially, I implemented a 
> new command, called PrepareResponseCommand, that sends back the reply 
> of the PrepareCommand. This solution has one disadvantage: I had to 
> implement an ack collector in ISPN, while JGroups already offers me 
> that with a synchronous communication.
>
> Recently (2 or 3 months ago) I implemented a simple modification in 
> JGroups. In a more generic approach, it allows other threads to reply 
> to a RPC request (such as the PrepareCommand). In the previous 
> scenario, I replaced the PrepareResponseCommand and the ack collector 
> implementation with a synchronous RPC invocation. I've used this 
> solution in other issues in the Cloud-TM's ISPN fork.
>
> This solution is quite simple to implement and may help you to move 
> the commands to ISPN internal thread pools. The modifications I've 
> made are the following:
>
> 1) I added a new interface (see [1]) that is sent to the application 
> instead of the Message object (see [4]). This interface contains the 
> Message and it has a method to allow the application send the reply to 
> that particular request.
> 2) I added a new object in [4] with the meaning: this return value is 
> not the reply to the RPC request. This is the returned value that I 
> return when I want to release the thread, because ISPN should return 
> some object in the handle() method. Of course, I know that ISPN will 
> invoke the sendReply() in some other place, otherwise, I will get a 
> TimeoutException in the sender side.
> 3) Also I've changed the RequestCorrelator implementation to support 
> the previous modifications (see [2] and [3])
>
> In the Cloud-TM's ISPN fork I added a reference in the 
> BaseCacheRpcCommand to [1] and added the method sendReply() [5]. In 
> addition, I have the following uses cases working perfectly with this:
>
> 1) Total Order
>
> The scenario described in the beginning. The ParallelTotalOrderManager 
> returns the DO_NOT_REPLY object when it receives a remote 
> PrepareCommand (see [6] line 77). When the PrepareCommand is finally 
> processed by the rest of the interceptor chain, it invokes the 
> PreapreCommand.sendReply() (see [6] line 230).
>
> 2) GMU remote get
>
> GMU ensures SERIALIZABLE Isolation Level and the remote gets must 
> ensure that the node that is processing the request has a minimum 
> version available to ensure data consistency. The problem in ours 
> initial implementation in large cluster, is the number of remote gets 
> are very high and all the OOB are being blocked because of this condition.
>
> Same thing I've done with the ClusteredRemoteGet as you can in see 
> [7], line 93 and 105.
>
> 3) GMU CommitCommand
>
> In GMU, the CommitCommand cannot be processed by any order. If T1 is 
> serialized before T2, the commit command of T1 must be processed 
> before the commit command of T2, even if the transactions do not have 
> conflicts. This generates the same problem above and the same solution 
> was adopted.
>
> I know that you have discussed some solutions and I would like to know 
> what it is your opinion about what I've described.
>
> If you have questions, please let me know.
>
> Cheers,
> Pedro
>
> [1] 
> https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/MessageRequest.java
>  
> <https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/MessageRequest.java>
> [2] 
> https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestCorrelator.java#L463
>  
> <https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestCorrelator.java#L463>
> [3] 
> https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestCorrelator.java#L495
>  
> <https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestCorrelator.java#L495>
> [4] 
> https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestHandler.java
>  
> <https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestHandler.java>
> [5] 
> https://github.com/pruivo/infinispan/blob/cloudtm_v1/core/src/main/java/org/infinispan/commands/remote/BaseRpcCommand.java#L75
>  
> <https://github.com/pruivo/infinispan/blob/cloudtm_v1/core/src/main/java/org/infinispan/commands/remote/BaseRpcCommand.java#L75>
> [6] 
> https://github.com/pruivo/infinispan/blob/cloudtm_v1/core/src/main/java/org/infinispan/transaction/totalorder/ParallelTotalOrderManager.java
> [7] 
> https://github.com/pruivo/infinispan/blob/cloudtm_v1/core/src/main/java/org/infinispan/commands/remote/GMUClusteredGetCommand.java
>
>
> On 2/3/13 11:35 AM, Bela Ban wrote:
>> If you send me the details, I'll take a look. I'm pretty busy with
>> message batching, so I can't promise next week, but soon...
>>
>> On 2/1/13 11:08 AM, Pedro Ruivo wrote:
>>> Hi,
>>>
>>> I had a similar problem when I tried GMU[1] in "large" cluster (40 vms),
>>> because the remote gets and the commit messages (I'm talking about ISPN
>>> commands) must wait for some conditions before being processed.
>>>
>>> I solved this problem by adding a feature in JGroups[2] that allows the
>>> request to be moved to another thread, releasing the OOB thread. The
>>> other thread will send the reply of the JGroups Request. Of course, I'm
>>> only moving commands that I know they can block.
>>>
>>> I can enter in some detail if you want =)
>>>
>>> Cheers,
>>> Pedro
>>>
>>> [1]http://www.gsd.inesc-id.pt/~romanop/files/papers/icdcs12.pdf
>>> [2] I would like to talk with Bela about this, because it makes my life
>>> easier to support total order in ISPN. I'll try to send an email this
>>> weekend =)
>>>
>>> On 01-02-2013 08:04, Radim Vansa wrote:
>>>> Hi guys,
>>>>
>>>> after dealing with the large cluster for a while I find the way how we use 
>>>> OOB threads in synchronous configuration non-robust.
>>>> Imagine a situation where node which is not an owner of the key calls PUT. 
>>>> Then the a RPC is called to the primary owner of that key, which reroutes 
>>>> the request to all other owners and after these reply, it replies back.
>>>> There are two problems:
>>>> 1) If we do simultanously X requests from non-owners to the primary owner 
>>>> where X is OOB TP size, all the OOB threads are waiting for the responses 
>>>> and there is no thread to process the OOB response and release the thread.
>>>> 2) Node A is primary owner of keyA, non-primary owner of keyB and B is 
>>>> primary of keyB and non-primary of keyA. We got many requests for both 
>>>> keyA and keyB from other nodes, therefore, all OOB threads from both nodes 
>>>> call RPC to the non-primary owner but there's noone who could process the 
>>>> request.
>>>>
>>>> While we wait for the requests to timeout, the nodes with depleted OOB 
>>>> threadpools start suspecting all other nodes because they can't receive 
>>>> heartbeats etc...
>>>>
>>>> You can say "increase your OOB tp size", but that's not always an option, 
>>>> I have currently set it to 1000 threads and it's not enough. In the end, I 
>>>> will be always limited by RAM and something tells me that even nodes with 
>>>> few gigs of RAM should be able to form a huge cluster. We use 160 hotrod 
>>>> worker threads in JDG, that means that 160 * clusterSize = 10240 (64 nodes 
>>>> in my cluster) parallel requests can be executed, and if 10% targets the 
>>>> same node with 1000 OOB threads, it stucks. It's about scaling and 
>>>> robustness.
>>>>
>>>> Not that I'd have any good solution, but I'd really like to start a 
>>>> discussion.
>>>> Thinking about it a bit, the problem is that blocking call (calling RPC on 
>>>> primary owner from message handler) can block non-blocking calls (such as 
>>>> RPC response or command that never sends any more messages). Therefore, 
>>>> having a flag on message "this won't send another message" could let the 
>>>> message be executed in different threadpool, which will be never 
>>>> deadlocked. In fact, the pools could share the threads but the 
>>>> non-blocking would have always a few threads spare.
>>>> It's a bad solution as maintaining which message could block in the other 
>>>> node is really, really hard (we can be sure only in case of RPC 
>>>> responses), especially when some locks come. I will welcome anything 
>>>> better.
>>>>
>>>> Radim
>>>>
>>>>
>>>> -----------------------------------------------------------
>>>> Radim Vansa
>>>> Quality Assurance Engineer
>>>> JBoss Datagrid
>>>> tel. +420532294559 ext. 62559
>>>>
>>>> Red Hat Czech, s.r.o.
>>>> Brno, Purkyňova 99/71, PSČ 612 45
>>>> Czech Republic
>>>>
>>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev@lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev@lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev@lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

-- 
Bela Ban, JGroups lead (http://www.jgroups.org)

_______________________________________________
infinispan-dev mailing list
infinispan-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev

Re: [infinispan-dev] Threadpools in a large cluster

Reply via email to