Re: [infinispan-dev] Threadpools in a large cluster

Manik Surtani Thu, 07 Feb 2013 01:56:41 -0800

Very interesting.  However I presume this would be something for Infinispan 
6.0?  Any thoughts on backward compat?


On 7 Feb 2013, at 04:53, Bela Ban <b...@redhat.com> wrote:

> Hi Pedro,
> 
> this is almost exactly what I wanted to implement !
> 
> Question:
> - In RequestCorrelator.handleRequest():
> 
> protected void handleRequest(Message req, Header hdr) {
> Object retval;
> boolean threwException = false;
> MessageRequest messageRequest = new MessageRequestImpl(req, hdr);
> try {
> retval=request_handler.handle(messageRequest);
> } catch(Throwable t) {
> retval=t;
> threwException = true;
> }
> messageRequest.sendReply(retval, threwException);// <-- should be moved 
> up, or called only if threwException == true
> }
> 
> 
> , you create a MessageRequestImpl and pass it to the RequestHandler. The 
> request handler then dispatches the request (possibly) to a thread pool 
> and calls MessageRequestImpl.sendReply() when done.
> 
> However, you also call MessageRequest.sendReply() before returning from 
> handleRequest(). I think this is an error, and 
> MessageRequest.sendReply() should be moved up inside the catch clause, 
> or be called only if threwException is true, so that we send a reply on 
> behalf of the RequestHandler if and only if it threw an exception (e.g. 
> before it dispatches the request to a thread pool). Otherwise, we'd send 
> a reply *twice* !
> 
> A few changes I have in mind (need to think about it more):
> 
> - I want to leave the existing RequestHandler interface in place, so 
> current implementation continue to work
> - There will be a new AsyncRequestHandler interface (possibly extending 
> RequestHandler, so an implementation can decide to implement both). The 
> RequestCorrelator needs to have either request_handler or 
> async_request_handler set. If the former is set, the logic is unchanged. 
> If the latter is set I'll invoke the async dispatching code
> 
> - AsyncRequestHandler will look similar to the following:
> void handle(Message request, Handback hb, boolean requires_response) 
> throws Throwable;
> - Handback is an interface, and its impl contains header information 
> (e.g. request ID)
> - Handback has a sendReply(Object reply, boolean is_exception) method 
> which sends a response (or exception) back to the caller
> - When requires_response is false, the AsyncRequestHandler doesn't need 
> to invoke sendReply()
> 
> - Message batching
> - The above interfaces need to take message batching into account, e.g. 
> the ability to handle multiple requests concurrently (if they don't need 
> to be executed sequentially)
> 
> 
> Thoughts ?
> 
> 
> On 2/6/13 8:29 PM, Pedro Ruivo wrote:
>> Hi all,
>> 
>> Recently I came up with a solution that can help with the thread pool 
>> problem motivated by the following:
>> 
>> In one of the first implementation of Total Order based commit 
>> protocol (TO), I had the requirement to move the PrepareCommand to 
>> another thread pool. In resume, the TO protocol delivers the 
>> PrepareCommand in a deterministic order in all the nodes, by a single 
>> deliver thread. To ensure consistency, if it delivers two conflicting 
>> transactions, the second transaction must wait until the first 
>> transaction finishes. However, blocking single deliver thread is not a 
>> good solution, because no more transaction can be validated, even if 
>> they don't conflict, while the thread is blocked.
>> 
>> So, after creating a dependency graph (i.e. the second transaction 
>> knows that it must wait for the first transaction to finish) I move 
>> the PrepareCommand to another thread pool. Initially, I implemented a 
>> new command, called PrepareResponseCommand, that sends back the reply 
>> of the PrepareCommand. This solution has one disadvantage: I had to 
>> implement an ack collector in ISPN, while JGroups already offers me 
>> that with a synchronous communication.
>> 
>> Recently (2 or 3 months ago) I implemented a simple modification in 
>> JGroups. In a more generic approach, it allows other threads to reply 
>> to a RPC request (such as the PrepareCommand). In the previous 
>> scenario, I replaced the PrepareResponseCommand and the ack collector 
>> implementation with a synchronous RPC invocation. I've used this 
>> solution in other issues in the Cloud-TM's ISPN fork.
>> 
>> This solution is quite simple to implement and may help you to move 
>> the commands to ISPN internal thread pools. The modifications I've 
>> made are the following:
>> 
>> 1) I added a new interface (see [1]) that is sent to the application 
>> instead of the Message object (see [4]). This interface contains the 
>> Message and it has a method to allow the application send the reply to 
>> that particular request.
>> 2) I added a new object in [4] with the meaning: this return value is 
>> not the reply to the RPC request. This is the returned value that I 
>> return when I want to release the thread, because ISPN should return 
>> some object in the handle() method. Of course, I know that ISPN will 
>> invoke the sendReply() in some other place, otherwise, I will get a 
>> TimeoutException in the sender side.
>> 3) Also I've changed the RequestCorrelator implementation to support 
>> the previous modifications (see [2] and [3])
>> 
>> In the Cloud-TM's ISPN fork I added a reference in the 
>> BaseCacheRpcCommand to [1] and added the method sendReply() [5]. In 
>> addition, I have the following uses cases working perfectly with this:
>> 
>> 1) Total Order
>> 
>> The scenario described in the beginning. The ParallelTotalOrderManager 
>> returns the DO_NOT_REPLY object when it receives a remote 
>> PrepareCommand (see [6] line 77). When the PrepareCommand is finally 
>> processed by the rest of the interceptor chain, it invokes the 
>> PreapreCommand.sendReply() (see [6] line 230).
>> 
>> 2) GMU remote get
>> 
>> GMU ensures SERIALIZABLE Isolation Level and the remote gets must 
>> ensure that the node that is processing the request has a minimum 
>> version available to ensure data consistency. The problem in ours 
>> initial implementation in large cluster, is the number of remote gets 
>> are very high and all the OOB are being blocked because of this condition.
>> 
>> Same thing I've done with the ClusteredRemoteGet as you can in see 
>> [7], line 93 and 105.
>> 
>> 3) GMU CommitCommand
>> 
>> In GMU, the CommitCommand cannot be processed by any order. If T1 is 
>> serialized before T2, the commit command of T1 must be processed 
>> before the commit command of T2, even if the transactions do not have 
>> conflicts. This generates the same problem above and the same solution 
>> was adopted.
>> 
>> I know that you have discussed some solutions and I would like to know 
>> what it is your opinion about what I've described.
>> 
>> If you have questions, please let me know.
>> 
>> Cheers,
>> Pedro
>> 
>> [1] 
>> https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/MessageRequest.java
>>  
>> <https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/MessageRequest.java>
>> [2] 
>> https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestCorrelator.java#L463
>>  
>> <https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestCorrelator.java#L463>
>> [3] 
>> https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestCorrelator.java#L495
>>  
>> <https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestCorrelator.java#L495>
>> [4] 
>> https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestHandler.java
>>  
>> <https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestHandler.java>
>> [5] 
>> https://github.com/pruivo/infinispan/blob/cloudtm_v1/core/src/main/java/org/infinispan/commands/remote/BaseRpcCommand.java#L75
>>  
>> <https://github.com/pruivo/infinispan/blob/cloudtm_v1/core/src/main/java/org/infinispan/commands/remote/BaseRpcCommand.java#L75>
>> [6] 
>> https://github.com/pruivo/infinispan/blob/cloudtm_v1/core/src/main/java/org/infinispan/transaction/totalorder/ParallelTotalOrderManager.java
>> [7] 
>> https://github.com/pruivo/infinispan/blob/cloudtm_v1/core/src/main/java/org/infinispan/commands/remote/GMUClusteredGetCommand.java
>> 
>> 
>> On 2/3/13 11:35 AM, Bela Ban wrote:
>>> If you send me the details, I'll take a look. I'm pretty busy with
>>> message batching, so I can't promise next week, but soon...
>>> 
>>> On 2/1/13 11:08 AM, Pedro Ruivo wrote:
>>>> Hi,
>>>> 
>>>> I had a similar problem when I tried GMU[1] in "large" cluster (40 vms),
>>>> because the remote gets and the commit messages (I'm talking about ISPN
>>>> commands) must wait for some conditions before being processed.
>>>> 
>>>> I solved this problem by adding a feature in JGroups[2] that allows the
>>>> request to be moved to another thread, releasing the OOB thread. The
>>>> other thread will send the reply of the JGroups Request. Of course, I'm
>>>> only moving commands that I know they can block.
>>>> 
>>>> I can enter in some detail if you want =)
>>>> 
>>>> Cheers,
>>>> Pedro
>>>> 
>>>> [1]http://www.gsd.inesc-id.pt/~romanop/files/papers/icdcs12.pdf
>>>> [2] I would like to talk with Bela about this, because it makes my life
>>>> easier to support total order in ISPN. I'll try to send an email this
>>>> weekend =)
>>>> 
>>>> On 01-02-2013 08:04, Radim Vansa wrote:
>>>>> Hi guys,
>>>>> 
>>>>> after dealing with the large cluster for a while I find the way how we 
>>>>> use OOB threads in synchronous configuration non-robust.
>>>>> Imagine a situation where node which is not an owner of the key calls 
>>>>> PUT. Then the a RPC is called to the primary owner of that key, which 
>>>>> reroutes the request to all other owners and after these reply, it 
>>>>> replies back.
>>>>> There are two problems:
>>>>> 1) If we do simultanously X requests from non-owners to the primary owner 
>>>>> where X is OOB TP size, all the OOB threads are waiting for the responses 
>>>>> and there is no thread to process the OOB response and release the thread.
>>>>> 2) Node A is primary owner of keyA, non-primary owner of keyB and B is 
>>>>> primary of keyB and non-primary of keyA. We got many requests for both 
>>>>> keyA and keyB from other nodes, therefore, all OOB threads from both 
>>>>> nodes call RPC to the non-primary owner but there's noone who could 
>>>>> process the request.
>>>>> 
>>>>> While we wait for the requests to timeout, the nodes with depleted OOB 
>>>>> threadpools start suspecting all other nodes because they can't receive 
>>>>> heartbeats etc...
>>>>> 
>>>>> You can say "increase your OOB tp size", but that's not always an option, 
>>>>> I have currently set it to 1000 threads and it's not enough. In the end, 
>>>>> I will be always limited by RAM and something tells me that even nodes 
>>>>> with few gigs of RAM should be able to form a huge cluster. We use 160 
>>>>> hotrod worker threads in JDG, that means that 160 * clusterSize = 10240 
>>>>> (64 nodes in my cluster) parallel requests can be executed, and if 10% 
>>>>> targets the same node with 1000 OOB threads, it stucks. It's about 
>>>>> scaling and robustness.
>>>>> 
>>>>> Not that I'd have any good solution, but I'd really like to start a 
>>>>> discussion.
>>>>> Thinking about it a bit, the problem is that blocking call (calling RPC 
>>>>> on primary owner from message handler) can block non-blocking calls (such 
>>>>> as RPC response or command that never sends any more messages). 
>>>>> Therefore, having a flag on message "this won't send another message" 
>>>>> could let the message be executed in different threadpool, which will be 
>>>>> never deadlocked. In fact, the pools could share the threads but the 
>>>>> non-blocking would have always a few threads spare.
>>>>> It's a bad solution as maintaining which message could block in the other 
>>>>> node is really, really hard (we can be sure only in case of RPC 
>>>>> responses), especially when some locks come. I will welcome anything 
>>>>> better.
>>>>> 
>>>>> Radim
>>>>> 
>>>>> 
>>>>> -----------------------------------------------------------
>>>>> Radim Vansa
>>>>> Quality Assurance Engineer
>>>>> JBoss Datagrid
>>>>> tel. +420532294559 ext. 62559
>>>>> 
>>>>> Red Hat Czech, s.r.o.
>>>>> Brno, Purkyňova 99/71, PSČ 612 45
>>>>> Czech Republic
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> infinispan-dev@lists.jboss.org
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev@lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> 
>> 
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev@lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> 
> -- 
> Bela Ban, JGroups lead (http://www.jgroups.org)
> 
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev@lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Manik Surtani
ma...@jboss.org
twitter.com/maniksurtani

Platform Architect, JBoss Data Grid
http://red.ht/data-grid


_______________________________________________
infinispan-dev mailing list
infinispan-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev

Re: [infinispan-dev] Threadpools in a large cluster

Reply via email to