Hi Pedro, this is almost exactly what I wanted to implement !
Question: - In RequestCorrelator.handleRequest(): protected void handleRequest(Message req, Header hdr) { Object retval; boolean threwException = false; MessageRequest messageRequest = new MessageRequestImpl(req, hdr); try { retval=request_handler.handle(messageRequest); } catch(Throwable t) { retval=t; threwException = true; } messageRequest.sendReply(retval, threwException);// <-- should be moved up, or called only if threwException == true } , you create a MessageRequestImpl and pass it to the RequestHandler. The request handler then dispatches the request (possibly) to a thread pool and calls MessageRequestImpl.sendReply() when done. However, you also call MessageRequest.sendReply() before returning from handleRequest(). I think this is an error, and MessageRequest.sendReply() should be moved up inside the catch clause, or be called only if threwException is true, so that we send a reply on behalf of the RequestHandler if and only if it threw an exception (e.g. before it dispatches the request to a thread pool). Otherwise, we'd send a reply *twice* ! A few changes I have in mind (need to think about it more): - I want to leave the existing RequestHandler interface in place, so current implementation continue to work - There will be a new AsyncRequestHandler interface (possibly extending RequestHandler, so an implementation can decide to implement both). The RequestCorrelator needs to have either request_handler or async_request_handler set. If the former is set, the logic is unchanged. If the latter is set I'll invoke the async dispatching code - AsyncRequestHandler will look similar to the following: void handle(Message request, Handback hb, boolean requires_response) throws Throwable; - Handback is an interface, and its impl contains header information (e.g. request ID) - Handback has a sendReply(Object reply, boolean is_exception) method which sends a response (or exception) back to the caller - When requires_response is false, the AsyncRequestHandler doesn't need to invoke sendReply() - Message batching - The above interfaces need to take message batching into account, e.g. the ability to handle multiple requests concurrently (if they don't need to be executed sequentially) Thoughts ? On 2/6/13 8:29 PM, Pedro Ruivo wrote: > Hi all, > > Recently I came up with a solution that can help with the thread pool > problem motivated by the following: > > In one of the first implementation of Total Order based commit > protocol (TO), I had the requirement to move the PrepareCommand to > another thread pool. In resume, the TO protocol delivers the > PrepareCommand in a deterministic order in all the nodes, by a single > deliver thread. To ensure consistency, if it delivers two conflicting > transactions, the second transaction must wait until the first > transaction finishes. However, blocking single deliver thread is not a > good solution, because no more transaction can be validated, even if > they don't conflict, while the thread is blocked. > > So, after creating a dependency graph (i.e. the second transaction > knows that it must wait for the first transaction to finish) I move > the PrepareCommand to another thread pool. Initially, I implemented a > new command, called PrepareResponseCommand, that sends back the reply > of the PrepareCommand. This solution has one disadvantage: I had to > implement an ack collector in ISPN, while JGroups already offers me > that with a synchronous communication. > > Recently (2 or 3 months ago) I implemented a simple modification in > JGroups. In a more generic approach, it allows other threads to reply > to a RPC request (such as the PrepareCommand). In the previous > scenario, I replaced the PrepareResponseCommand and the ack collector > implementation with a synchronous RPC invocation. I've used this > solution in other issues in the Cloud-TM's ISPN fork. > > This solution is quite simple to implement and may help you to move > the commands to ISPN internal thread pools. The modifications I've > made are the following: > > 1) I added a new interface (see [1]) that is sent to the application > instead of the Message object (see [4]). This interface contains the > Message and it has a method to allow the application send the reply to > that particular request. > 2) I added a new object in [4] with the meaning: this return value is > not the reply to the RPC request. This is the returned value that I > return when I want to release the thread, because ISPN should return > some object in the handle() method. Of course, I know that ISPN will > invoke the sendReply() in some other place, otherwise, I will get a > TimeoutException in the sender side. > 3) Also I've changed the RequestCorrelator implementation to support > the previous modifications (see [2] and [3]) > > In the Cloud-TM's ISPN fork I added a reference in the > BaseCacheRpcCommand to [1] and added the method sendReply() [5]. In > addition, I have the following uses cases working perfectly with this: > > 1) Total Order > > The scenario described in the beginning. The ParallelTotalOrderManager > returns the DO_NOT_REPLY object when it receives a remote > PrepareCommand (see [6] line 77). When the PrepareCommand is finally > processed by the rest of the interceptor chain, it invokes the > PreapreCommand.sendReply() (see [6] line 230). > > 2) GMU remote get > > GMU ensures SERIALIZABLE Isolation Level and the remote gets must > ensure that the node that is processing the request has a minimum > version available to ensure data consistency. The problem in ours > initial implementation in large cluster, is the number of remote gets > are very high and all the OOB are being blocked because of this condition. > > Same thing I've done with the ClusteredRemoteGet as you can in see > [7], line 93 and 105. > > 3) GMU CommitCommand > > In GMU, the CommitCommand cannot be processed by any order. If T1 is > serialized before T2, the commit command of T1 must be processed > before the commit command of T2, even if the transactions do not have > conflicts. This generates the same problem above and the same solution > was adopted. > > I know that you have discussed some solutions and I would like to know > what it is your opinion about what I've described. > > If you have questions, please let me know. > > Cheers, > Pedro > > [1] > https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/MessageRequest.java > > <https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/MessageRequest.java> > [2] > https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestCorrelator.java#L463 > > <https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestCorrelator.java#L463> > [3] > https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestCorrelator.java#L495 > > <https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestCorrelator.java#L495> > [4] > https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestHandler.java > > <https://github.com/pruivo/JGroups/blob/t_cloudtm/src/org/jgroups/blocks/RequestHandler.java> > [5] > https://github.com/pruivo/infinispan/blob/cloudtm_v1/core/src/main/java/org/infinispan/commands/remote/BaseRpcCommand.java#L75 > > <https://github.com/pruivo/infinispan/blob/cloudtm_v1/core/src/main/java/org/infinispan/commands/remote/BaseRpcCommand.java#L75> > [6] > https://github.com/pruivo/infinispan/blob/cloudtm_v1/core/src/main/java/org/infinispan/transaction/totalorder/ParallelTotalOrderManager.java > [7] > https://github.com/pruivo/infinispan/blob/cloudtm_v1/core/src/main/java/org/infinispan/commands/remote/GMUClusteredGetCommand.java > > > On 2/3/13 11:35 AM, Bela Ban wrote: >> If you send me the details, I'll take a look. I'm pretty busy with >> message batching, so I can't promise next week, but soon... >> >> On 2/1/13 11:08 AM, Pedro Ruivo wrote: >>> Hi, >>> >>> I had a similar problem when I tried GMU[1] in "large" cluster (40 vms), >>> because the remote gets and the commit messages (I'm talking about ISPN >>> commands) must wait for some conditions before being processed. >>> >>> I solved this problem by adding a feature in JGroups[2] that allows the >>> request to be moved to another thread, releasing the OOB thread. The >>> other thread will send the reply of the JGroups Request. Of course, I'm >>> only moving commands that I know they can block. >>> >>> I can enter in some detail if you want =) >>> >>> Cheers, >>> Pedro >>> >>> [1]http://www.gsd.inesc-id.pt/~romanop/files/papers/icdcs12.pdf >>> [2] I would like to talk with Bela about this, because it makes my life >>> easier to support total order in ISPN. I'll try to send an email this >>> weekend =) >>> >>> On 01-02-2013 08:04, Radim Vansa wrote: >>>> Hi guys, >>>> >>>> after dealing with the large cluster for a while I find the way how we use >>>> OOB threads in synchronous configuration non-robust. >>>> Imagine a situation where node which is not an owner of the key calls PUT. >>>> Then the a RPC is called to the primary owner of that key, which reroutes >>>> the request to all other owners and after these reply, it replies back. >>>> There are two problems: >>>> 1) If we do simultanously X requests from non-owners to the primary owner >>>> where X is OOB TP size, all the OOB threads are waiting for the responses >>>> and there is no thread to process the OOB response and release the thread. >>>> 2) Node A is primary owner of keyA, non-primary owner of keyB and B is >>>> primary of keyB and non-primary of keyA. We got many requests for both >>>> keyA and keyB from other nodes, therefore, all OOB threads from both nodes >>>> call RPC to the non-primary owner but there's noone who could process the >>>> request. >>>> >>>> While we wait for the requests to timeout, the nodes with depleted OOB >>>> threadpools start suspecting all other nodes because they can't receive >>>> heartbeats etc... >>>> >>>> You can say "increase your OOB tp size", but that's not always an option, >>>> I have currently set it to 1000 threads and it's not enough. In the end, I >>>> will be always limited by RAM and something tells me that even nodes with >>>> few gigs of RAM should be able to form a huge cluster. We use 160 hotrod >>>> worker threads in JDG, that means that 160 * clusterSize = 10240 (64 nodes >>>> in my cluster) parallel requests can be executed, and if 10% targets the >>>> same node with 1000 OOB threads, it stucks. It's about scaling and >>>> robustness. >>>> >>>> Not that I'd have any good solution, but I'd really like to start a >>>> discussion. >>>> Thinking about it a bit, the problem is that blocking call (calling RPC on >>>> primary owner from message handler) can block non-blocking calls (such as >>>> RPC response or command that never sends any more messages). Therefore, >>>> having a flag on message "this won't send another message" could let the >>>> message be executed in different threadpool, which will be never >>>> deadlocked. In fact, the pools could share the threads but the >>>> non-blocking would have always a few threads spare. >>>> It's a bad solution as maintaining which message could block in the other >>>> node is really, really hard (we can be sure only in case of RPC >>>> responses), especially when some locks come. I will welcome anything >>>> better. >>>> >>>> Radim >>>> >>>> >>>> ----------------------------------------------------------- >>>> Radim Vansa >>>> Quality Assurance Engineer >>>> JBoss Datagrid >>>> tel. +420532294559 ext. 62559 >>>> >>>> Red Hat Czech, s.r.o. >>>> Brno, Purkyňova 99/71, PSČ 612 45 >>>> Czech Republic >>>> >>>> >>>> _______________________________________________ >>>> infinispan-dev mailing list >>>> infinispan-dev@lists.jboss.org >>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev >>> _______________________________________________ >>> infinispan-dev mailing list >>> infinispan-dev@lists.jboss.org >>> https://lists.jboss.org/mailman/listinfo/infinispan-dev > > > _______________________________________________ > infinispan-dev mailing list > infinispan-dev@lists.jboss.org > https://lists.jboss.org/mailman/listinfo/infinispan-dev -- Bela Ban, JGroups lead (http://www.jgroups.org) _______________________________________________ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev