Re: [infinispan-dev] Threadpools in a large cluster
On 7 Feb 2013, at 11:36, Manik Surtani wrote: On 7 Feb 2013, at 11:29, Bela Ban b...@redhat.com wrote: I meant to say that this is up to you guys to decide inwhich Infinispan release this will be used, but it will be available in JGroups 3.3. What's the strategy/schedule for 6 and 5.3 anyway ? 5.3 will be relatively quick, just an incremental release mainly targeting reviewing and putting in a few contribs on the queue. E.g., we're hoping for 5.3 to be feature-complete in no more than 2 months. 6 is where we get to break API - specifically with the new packaging, etc. But this seems like a pretty important feature/fix so it may make sense to get this into 5.3. Pedro - please correct me if I'm wrong. This feature (or the alternative Pedro has already implemented in his branch) is used be the TO code that we'll integrate in 5.3. Se in order to have TO in ISPN we need these jgroups improvements in. Pedro - are there any other modifications in *your branch* of jgroups that require integration into jgroups upstream from a TO perspective? Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org) ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] Threadpools in a large cluster
On 2/8/13 12:26 PM, Mircea Markus wrote: On 7 Feb 2013, at 11:36, Manik Surtani wrote: On 7 Feb 2013, at 11:29, Bela Ban b...@redhat.com mailto:b...@redhat.com wrote: I meant to say that this is up to you guys to decide inwhich Infinispan release this will be used, but it will be available in JGroups 3.3. What's the strategy/schedule for 6 and 5.3 anyway ? 5.3 will be relatively quick, just an incremental release mainly targeting reviewing and putting in a few contribs on the queue. E.g., we're hoping for 5.3 to be feature-complete in no more than 2 months. 6 is where we get to break API - specifically with the new packaging, etc. But this seems like a pretty important feature/fix so it may make sense to get this into 5.3. Pedro - please correct me if I'm wrong. This feature (or the alternative Pedro has already implemented in his branch) is used be the TO code that we'll integrate in 5.3. Se in order to have TO in ISPN we need these jgroups improvements in. yes, I'll need it. otherwise, I have to create another RpcCommand to send back the response generated by the other thread Pedro - are there any other modifications in *your branch* of jgroups that require integration into jgroups upstream from a TO perspective? No, just this... Cheers Pedro Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org http://www.infinispan.org) ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] Threadpools in a large cluster
Hi Bela, Alpha1 does not exist in GitHub. Instead you have 3.3.0.Alpha2 =) So, can I start to use it? Thanks Pedro On 2/8/13 2:31 PM, Bela Ban wrote: On 2/8/13 3:23 PM, Mircea Markus wrote: Hi Bela, Do you think we can have this particular improvement in JGroups by 18 Feb? That week Pedro, Dan and I are going to start integrating TO protocols in ISPN and this is a requirement for it. It is done and part of the alpha1 release... On 8 Feb 2013, at 12:41, Pedro Ruivo wrote: On 2/8/13 12:26 PM, Mircea Markus wrote: On 7 Feb 2013, at 11:36, Manik Surtani wrote: On 7 Feb 2013, at 11:29, Bela Banb...@redhat.com mailto:b...@redhat.com wrote: I meant to say that this is up to you guys to decide inwhich Infinispan release this will be used, but it will be available in JGroups 3.3. What's the strategy/schedule for 6 and 5.3 anyway ? 5.3 will be relatively quick, just an incremental release mainly targeting reviewing and putting in a few contribs on the queue. E.g., we're hoping for 5.3 to be feature-complete in no more than 2 months. 6 is where we get to break API - specifically with the new packaging, etc. But this seems like a pretty important feature/fix so it may make sense to get this into 5.3. Pedro - please correct me if I'm wrong. This feature (or the alternative Pedro has already implemented in his branch) is used be the TO code that we'll integrate in 5.3. Se in order to have TO in ISPN we need these jgroups improvements in. yes, I'll need it. otherwise, I have to create another RpcCommand to send back the response generated by the other thread Cheers, -- Mircea Markus Infinispan lead (www.infinispan.orghttp://www.infinispan.org) ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] Threadpools in a large cluster
On 8 Feb 2013, at 14:31, Bela Ban wrote: On 2/8/13 3:23 PM, Mircea Markus wrote: Hi Bela, Do you think we can have this particular improvement in JGroups by 18 Feb? That week Pedro, Dan and I are going to start integrating TO protocols in ISPN and this is a requirement for it. It is done and part of the alpha1 release... thanks Bela! Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org) ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] Threadpools in a large cluster
I've created a JIRA in the scope of Infinispan 5.3 to track the thread pool improvements discussed in this email thread: ISPN-2808 - Make Infinispan use its own thread pool for sending OOB messages in order to avoid thread deadlocks On 8 Feb 2013, at 14:44, Mircea Markus wrote: On 8 Feb 2013, at 14:31, Bela Ban wrote: On 2/8/13 3:23 PM, Mircea Markus wrote: Hi Bela, Do you think we can have this particular improvement in JGroups by 18 Feb? That week Pedro, Dan and I are going to start integrating TO protocols in ISPN and this is a requirement for it. It is done and part of the alpha1 release... thanks Bela! Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org) ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org) ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] Threadpools in a large cluster
yes, I mis-tagged it... :-( On 2/8/13 3:33 PM, Pedro Ruivo wrote: Hi Bela, Alpha1 does not exist in GitHub. Instead you have 3.3.0.Alpha2 =) So, can I start to use it? Thanks Pedro On 2/8/13 2:31 PM, Bela Ban wrote: On 2/8/13 3:23 PM, Mircea Markus wrote: Hi Bela, Do you think we can have this particular improvement in JGroups by 18 Feb? That week Pedro, Dan and I are going to start integrating TO protocols in ISPN and this is a requirement for it. It is done and part of the alpha1 release... On 8 Feb 2013, at 12:41, Pedro Ruivo wrote: On 2/8/13 12:26 PM, Mircea Markus wrote: On 7 Feb 2013, at 11:36, Manik Surtani wrote: On 7 Feb 2013, at 11:29, Bela Banb...@redhat.com mailto:b...@redhat.com wrote: I meant to say that this is up to you guys to decide inwhich Infinispan release this will be used, but it will be available in JGroups 3.3. What's the strategy/schedule for 6 and 5.3 anyway ? 5.3 will be relatively quick, just an incremental release mainly targeting reviewing and putting in a few contribs on the queue. E.g., we're hoping for 5.3 to be feature-complete in no more than 2 months. 6 is where we get to break API - specifically with the new packaging, etc. But this seems like a pretty important feature/fix so it may make sense to get this into 5.3. Pedro - please correct me if I'm wrong. This feature (or the alternative Pedro has already implemented in his branch) is used be the TO code that we'll integrate in 5.3. Se in order to have TO in ISPN we need these jgroups improvements in. yes, I'll need it. otherwise, I have to create another RpcCommand to send back the response generated by the other thread Cheers, -- Mircea Markus Infinispan lead (www.infinispan.orghttp://www.infinispan.org) ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev -- Bela Ban, JGroups lead (http://www.jgroups.org) ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] Threadpools in a large cluster
excellent, and we'll update it with our findings from the London meeting On 2/8/13 5:19 PM, Mircea Markus wrote: I've created a JIRA in the scope of Infinispan 5.3 to track the thread pool improvements discussed in this email thread: ISPN-2808 - Make Infinispan use its own thread pool for sending OOB messages in order to avoid thread deadlocks On 8 Feb 2013, at 14:44, Mircea Markus wrote: On 8 Feb 2013, at 14:31, Bela Ban wrote: On 2/8/13 3:23 PM, Mircea Markus wrote: Hi Bela, Do you think we can have this particular improvement in JGroups by 18 Feb? That week Pedro, Dan and I are going to start integrating TO protocols in ISPN and this is a requirement for it. It is done and part of the alpha1 release... thanks Bela! Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org http://www.infinispan.org/) ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org mailto:infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org http://www.infinispan.org) ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev -- Bela Ban, JGroups lead (http://www.jgroups.org) ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] Threadpools in a large cluster
On Thu, Feb 7, 2013 at 6:53 AM, Bela Ban b...@redhat.com wrote: Hi Pedro, this is almost exactly what I wanted to implement ! Question: - In RequestCorrelator.handleRequest(): protected void handleRequest(Message req, Header hdr) { Object retval; boolean threwException = false; MessageRequest messageRequest = new MessageRequestImpl(req, hdr); try { retval=request_handler.handle(messageRequest); } catch(Throwable t) { retval=t; threwException = true; } messageRequest.sendReply(retval, threwException);// -- should be moved up, or called only if threwException == true } , you create a MessageRequestImpl and pass it to the RequestHandler. The request handler then dispatches the request (possibly) to a thread pool and calls MessageRequestImpl.sendReply() when done. However, you also call MessageRequest.sendReply() before returning from handleRequest(). I think this is an error, and MessageRequest.sendReply() should be moved up inside the catch clause, or be called only if threwException is true, so that we send a reply on behalf of the RequestHandler if and only if it threw an exception (e.g. before it dispatches the request to a thread pool). Otherwise, we'd send a reply *twice* ! A few changes I have in mind (need to think about it more): - I want to leave the existing RequestHandler interface in place, so current implementation continue to work - There will be a new AsyncRequestHandler interface (possibly extending RequestHandler, so an implementation can decide to implement both). The RequestCorrelator needs to have either request_handler or async_request_handler set. If the former is set, the logic is unchanged. If the latter is set I'll invoke the async dispatching code - AsyncRequestHandler will look similar to the following: void handle(Message request, Handback hb, boolean requires_response) throws Throwable; - Handback is an interface, and its impl contains header information (e.g. request ID) - Handback has a sendReply(Object reply, boolean is_exception) method which sends a response (or exception) back to the caller +1 for a new interface. TBH I hadn't read the RequestCorrelator code, so I had assumed it was already asynchronous, and only RpcDispatcher was synchronous. I'm not so sure about the Handback name, how about calling it Response instead? - When requires_response is false, the AsyncRequestHandler doesn't need to invoke sendReply() I think this should be the other way around: when requires_response is true, the AsyncRequestHandler *can* invoke sendReply(), but is not required to (the call will just time out on the caller node); when requires_response is false, invoking sendReply() should throw an exception. - Message batching - The above interfaces need to take message batching into account, e.g. the ability to handle multiple requests concurrently (if they don't need to be executed sequentially) You mean handle() is still going to be called once for each request, but second handle() call won't necessarily wait for the first message's sendReply() call? Is this going to apply only to OOB messages, or to regular messages as well? I think I'd prefer it if it only applied to OOB messages, otherwise we'd have to implement our own ordering for regular/async commands. Thoughts ? On 2/6/13 8:29 PM, Pedro Ruivo wrote: Hi all, Recently I came up with a solution that can help with the thread pool problem motivated by the following: In one of the first implementation of Total Order based commit protocol (TO), I had the requirement to move the PrepareCommand to another thread pool. In resume, the TO protocol delivers the PrepareCommand in a deterministic order in all the nodes, by a single deliver thread. To ensure consistency, if it delivers two conflicting transactions, the second transaction must wait until the first transaction finishes. However, blocking single deliver thread is not a good solution, because no more transaction can be validated, even if they don't conflict, while the thread is blocked. So, after creating a dependency graph (i.e. the second transaction knows that it must wait for the first transaction to finish) I move the PrepareCommand to another thread pool. Initially, I implemented a new command, called PrepareResponseCommand, that sends back the reply of the PrepareCommand. This solution has one disadvantage: I had to implement an ack collector in ISPN, while JGroups already offers me that with a synchronous communication. Recently (2 or 3 months ago) I implemented a simple modification in JGroups. In a more generic approach, it allows other threads to reply to a RPC request (such as the PrepareCommand). In the previous scenario, I replaced the PrepareResponseCommand and the ack collector implementation with a synchronous RPC invocation. I've used this solution in other issues in the Cloud-TM's ISPN fork. This solution is quite simple to
Re: [infinispan-dev] Threadpools in a large cluster
On 7 Feb 2013, at 10:30, Bela Ban b...@redhat.com wrote: No, this *could* be in Infinispan 5.3 (it will be in JGroups 3.3). A MessageDispatcher (RpcDispatcher) instance picks which dispatching mechanism it wants to use. I have RequestHandler (the default) and a sub-interface AsyncRequestHandler. MessageDispatcher (and subclass RpcDispatcher) will implement both (they already do implement handle(Message)). So now a user simply sets an attribute in MessageDispatcher to select async dispatching (sync is the default). What needs to be done from the Infinispan side is to override handle(Message,Response), and implement handling of requests in a thread pool. The current behavior (inherited from MessageDispatcher) will be to call handle(Message) which CommandAwareDispatcher already implements. The Infinispan side can be done in 10 minutes. However, the real work will be the dispatching of incoming requests to threads from the Infinispan thread pool, and the impl of the thread pool, which doesn't exist yet. I guess preserving ordering of requests will be the important part. This isn't particularly hard either ... If you have your own thread pool, sync RPCs can be sent without OOB, but the handle() method in CommandAwareDispatcher can decide, based on the mode (e.g. sync) whether to queue the request behind other requests, or whether to invoke it directly. I wanted to implement this quickly in JGroups so the hooks are in place for Infinispan to use them later, once a pool has been implemented. On 2/7/13 10:56 AM, Manik Surtani wrote: Very interesting. However I presume this would be something for Infinispan 6.0? Any thoughts on backward compat? On 7 Feb 2013, at 04:53, Bela Ban b...@redhat.com wrote: Hi Pedro, this is almost exactly what I wanted to implement ! Question: - In RequestCorrelator.handleRequest(): protected void handleRequest(Message req, Header hdr) { Object retval; boolean threwException = false; MessageRequest messageRequest = new MessageRequestImpl(req, hdr); try { retval=request_handler.handle(messageRequest); } catch(Throwable t) { retval=t; threwException = true; } messageRequest.sendReply(retval, threwException);// -- should be moved up, or called only if threwException == true } , you create a MessageRequestImpl and pass it to the RequestHandler. The request handler then dispatches the request (possibly) to a thread pool and calls MessageRequestImpl.sendReply() when done. However, you also call MessageRequest.sendReply() before returning from handleRequest(). I think this is an error, and MessageRequest.sendReply() should be moved up inside the catch clause, or be called only if threwException is true, so that we send a reply on behalf of the RequestHandler if and only if it threw an exception (e.g. before it dispatches the request to a thread pool). Otherwise, we'd send a reply *twice* ! A few changes I have in mind (need to think about it more): - I want to leave the existing RequestHandler interface in place, so current implementation continue to work - There will be a new AsyncRequestHandler interface (possibly extending RequestHandler, so an implementation can decide to implement both). The RequestCorrelator needs to have either request_handler or async_request_handler set. If the former is set, the logic is unchanged. If the latter is set I'll invoke the async dispatching code - AsyncRequestHandler will look similar to the following: void handle(Message request, Handback hb, boolean requires_response) throws Throwable; - Handback is an interface, and its impl contains header information (e.g. request ID) - Handback has a sendReply(Object reply, boolean is_exception) method which sends a response (or exception) back to the caller - When requires_response is false, the AsyncRequestHandler doesn't need to invoke sendReply() - Message batching - The above interfaces need to take message batching into account, e.g. the ability to handle multiple requests concurrently (if they don't need to be executed sequentially) Thoughts ? On 2/6/13 8:29 PM, Pedro Ruivo wrote: Hi all, Recently I came up with a solution that can help with the thread pool problem motivated by the following: In one of the first implementation of Total Order based commit protocol (TO), I had the requirement to move the PrepareCommand to another thread pool. In resume, the TO protocol delivers the PrepareCommand in a deterministic order in all the nodes, by a single deliver thread. To ensure consistency, if it delivers two conflicting transactions, the second transaction must wait until the first transaction finishes. However, blocking single deliver thread is not a good solution, because no more transaction can be validated, even if they don't conflict, while the thread is blocked. So, after creating a dependency graph (i.e. the second transaction knows that
Re: [infinispan-dev] Threadpools in a large cluster
I meant to say that this is up to you guys to decide inwhich Infinispan release this will be used, but it will be available in JGroups 3.3. What's the strategy/schedule for 6 and 5.3 anyway ? On 2/7/13 11:30 AM, Bela Ban wrote: No, this *could* be in Infinispan 5.3 (it will be in JGroups 3.3). A MessageDispatcher (RpcDispatcher) instance picks which dispatching mechanism it wants to use. I have RequestHandler (the default) and a sub-interface AsyncRequestHandler. MessageDispatcher (and subclass RpcDispatcher) will implement both (they already do implement handle(Message)). So now a user simply sets an attribute in MessageDispatcher to select async dispatching (sync is the default). What needs to be done from the Infinispan side is to override handle(Message,Response), and implement handling of requests in a thread pool. The current behavior (inherited from MessageDispatcher) will be to call handle(Message) which CommandAwareDispatcher already implements. The Infinispan side can be done in 10 minutes. However, the real work will be the dispatching of incoming requests to threads from the Infinispan thread pool, and the impl of the thread pool, which doesn't exist yet. I guess preserving ordering of requests will be the important part. If you have your own thread pool, sync RPCs can be sent without OOB, but the handle() method in CommandAwareDispatcher can decide, based on the mode (e.g. sync) whether to queue the request behind other requests, or whether to invoke it directly. I wanted to implement this quickly in JGroups so the hooks are in place for Infinispan to use them later, once a pool has been implemented. On 2/7/13 10:56 AM, Manik Surtani wrote: Very interesting. However I presume this would be something for Infinispan 6.0? Any thoughts on backward compat? On 7 Feb 2013, at 04:53, Bela Ban b...@redhat.com wrote: Hi Pedro, this is almost exactly what I wanted to implement ! Question: - In RequestCorrelator.handleRequest(): protected void handleRequest(Message req, Header hdr) { Object retval; boolean threwException = false; MessageRequest messageRequest = new MessageRequestImpl(req, hdr); try { retval=request_handler.handle(messageRequest); } catch(Throwable t) { retval=t; threwException = true; } messageRequest.sendReply(retval, threwException);// -- should be moved up, or called only if threwException == true } , you create a MessageRequestImpl and pass it to the RequestHandler. The request handler then dispatches the request (possibly) to a thread pool and calls MessageRequestImpl.sendReply() when done. However, you also call MessageRequest.sendReply() before returning from handleRequest(). I think this is an error, and MessageRequest.sendReply() should be moved up inside the catch clause, or be called only if threwException is true, so that we send a reply on behalf of the RequestHandler if and only if it threw an exception (e.g. before it dispatches the request to a thread pool). Otherwise, we'd send a reply *twice* ! A few changes I have in mind (need to think about it more): - I want to leave the existing RequestHandler interface in place, so current implementation continue to work - There will be a new AsyncRequestHandler interface (possibly extending RequestHandler, so an implementation can decide to implement both). The RequestCorrelator needs to have either request_handler or async_request_handler set. If the former is set, the logic is unchanged. If the latter is set I'll invoke the async dispatching code - AsyncRequestHandler will look similar to the following: void handle(Message request, Handback hb, boolean requires_response) throws Throwable; - Handback is an interface, and its impl contains header information (e.g. request ID) - Handback has a sendReply(Object reply, boolean is_exception) method which sends a response (or exception) back to the caller - When requires_response is false, the AsyncRequestHandler doesn't need to invoke sendReply() - Message batching - The above interfaces need to take message batching into account, e.g. the ability to handle multiple requests concurrently (if they don't need to be executed sequentially) Thoughts ? On 2/6/13 8:29 PM, Pedro Ruivo wrote: Hi all, Recently I came up with a solution that can help with the thread pool problem motivated by the following: In one of the first implementation of Total Order based commit protocol (TO), I had the requirement to move the PrepareCommand to another thread pool. In resume, the TO protocol delivers the PrepareCommand in a deterministic order in all the nodes, by a single deliver thread. To ensure consistency, if it delivers two conflicting transactions, the second transaction must wait until the first transaction finishes. However, blocking single deliver thread is not a good solution, because no more transaction can be validated, even if they don't conflict, while the thread is
Re: [infinispan-dev] Threadpools in a large cluster
Hi Bela On 2/7/13 4:53 AM, Bela Ban wrote: Hi Pedro, this is almost exactly what I wanted to implement ! Question: - In RequestCorrelator.handleRequest(): protected void handleRequest(Message req, Header hdr) { Object retval; boolean threwException = false; MessageRequest messageRequest = new MessageRequestImpl(req, hdr); try { retval=request_handler.handle(messageRequest); } catch(Throwable t) { retval=t; threwException = true; } messageRequest.sendReply(retval, threwException);//-- should be moved up, or called only if threwException == true } , you create a MessageRequestImpl and pass it to the RequestHandler. The request handler then dispatches the request (possibly) to a thread pool and calls MessageRequestImpl.sendReply() when done. However, you also call MessageRequest.sendReply() before returning from handleRequest(). I think this is an error, and MessageRequest.sendReply() should be moved up inside the catch clause, or be called only if threwException is true, so that we send a reply on behalf of the RequestHandler if and only if it threw an exception (e.g. before it dispatches the request to a thread pool). Otherwise, we'd send a reply *twice* ! In my defense, I was assuming if the application uses the sendReply() method, it must return a special return value: DO_NOT_REPLY (in RequestHandler interface). This return value is automatically ignored: public final void sendReply(Object reply, boolean exceptionThrown) { if(!header.rsp_expected || reply == RequestHandler.DO_NOT_REPLY) return; A few changes I have in mind (need to think about it more): - I want to leave the existing RequestHandler interface in place, so current implementation continue to work - There will be a new AsyncRequestHandler interface (possibly extending RequestHandler, so an implementation can decide to implement both). The RequestCorrelator needs to have either request_handler or async_request_handler set. If the former is set, the logic is unchanged. If the latter is set I'll invoke the async dispatching code I'm not sure if it is a good idea to have the AsyncRequestHandler extending the RequestHandler interface. If the application implements both methods (Object handle(Message) and void handle(Message, ...)) how do you know which method should be invoked? - AsyncRequestHandler will look similar to the following: void handle(Message request, Handback hb, boolean requires_response) throws Throwable; - Handback is an interface, and its impl contains header information (e.g. request ID) - Handback has a sendReply(Object reply, boolean is_exception) method which sends a response (or exception) back to the caller - When requires_response is false, the AsyncRequestHandler doesn't need to invoke sendReply() My 2 cents, I think that the boolean requires_response should be in Handback implementation to avoid passing an extra argument - Message batching - The above interfaces need to take message batching into account, e.g. the ability to handle multiple requests concurrently (if they don't need to be executed sequentially) Thoughts ? On 2/6/13 8:29 PM, Pedro Ruivo wrote: Hi all, Recently I came up with a solution that can help with the thread pool problem motivated by the following: In one of the first implementation of Total Order based commit protocol (TO), I had the requirement to move the PrepareCommand to another thread pool. In resume, the TO protocol delivers the PrepareCommand in a deterministic order in all the nodes, by a single deliver thread. To ensure consistency, if it delivers two conflicting transactions, the second transaction must wait until the first transaction finishes. However, blocking single deliver thread is not a good solution, because no more transaction can be validated, even if they don't conflict, while the thread is blocked. So, after creating a dependency graph (i.e. the second transaction knows that it must wait for the first transaction to finish) I move the PrepareCommand to another thread pool. Initially, I implemented a new command, called PrepareResponseCommand, that sends back the reply of the PrepareCommand. This solution has one disadvantage: I had to implement an ack collector in ISPN, while JGroups already offers me that with a synchronous communication. Recently (2 or 3 months ago) I implemented a simple modification in JGroups. In a more generic approach, it allows other threads to reply to a RPC request (such as the PrepareCommand). In the previous scenario, I replaced the PrepareResponseCommand and the ack collector implementation with a synchronous RPC invocation. I've used this solution in other issues in the Cloud-TM's ISPN fork. This solution is quite simple to implement and may help you to move the commands to ISPN internal thread pools. The modifications I've made are the following: 1) I added a new interface (see [1]) that is sent to the application instead of the Message
Re: [infinispan-dev] Threadpools in a large cluster
On 2/7/13 12:29 PM, Pedro Ruivo wrote: Hi Bela On 2/7/13 4:53 AM, Bela Ban wrote: Hi Pedro, this is almost exactly what I wanted to implement ! Question: - In RequestCorrelator.handleRequest(): protected void handleRequest(Message req, Header hdr) { Object retval; boolean threwException = false; MessageRequest messageRequest = new MessageRequestImpl(req, hdr); try { retval=request_handler.handle(messageRequest); } catch(Throwable t) { retval=t; threwException = true; } messageRequest.sendReply(retval, threwException);//-- should be moved up, or called only if threwException == true } , you create a MessageRequestImpl and pass it to the RequestHandler. The request handler then dispatches the request (possibly) to a thread pool and calls MessageRequestImpl.sendReply() when done. However, you also call MessageRequest.sendReply() before returning from handleRequest(). I think this is an error, and MessageRequest.sendReply() should be moved up inside the catch clause, or be called only if threwException is true, so that we send a reply on behalf of the RequestHandler if and only if it threw an exception (e.g. before it dispatches the request to a thread pool). Otherwise, we'd send a reply *twice* ! In my defense, I was assuming if the application uses the sendReply() method, it must return a special return value: DO_NOT_REPLY (in RequestHandler interface). This return value is automatically ignored: public final void sendReply(Object reply, boolean exceptionThrown) { if(!header.rsp_expected || reply == RequestHandler.DO_NOT_REPLY) return; OK A few changes I have in mind (need to think about it more): - I want to leave the existing RequestHandler interface in place, so current implementation continue to work - There will be a new AsyncRequestHandler interface (possibly extending RequestHandler, so an implementation can decide to implement both). The RequestCorrelator needs to have either request_handler or async_request_handler set. If the former is set, the logic is unchanged. If the latter is set I'll invoke the async dispatching code I'm not sure if it is a good idea to have the AsyncRequestHandler extending the RequestHandler interface. If the application implements both methods (Object handle(Message) and void handle(Message, ...)) how do you know which method should be invoked? The default would be to invoke the old handle(Message) method. The dispatching mechanism could be changed to use the async method by setting an attribute in MessageDispatcher (which in turn sets it in RequestCorrelator). How would you do this ? Remember, we cannot change or remove handle(Message) as subclasses of RpcDispatcher or MessageDispatcher, or impls of RequestHandler are out there and any change to handle(Message) would break them. Would you simply provide a separate AsyncRequestHandler interface, not extending RequestHandler ? This would require RequestCorrelator and MessageDispatcher to have 2 refs instead of 1. With the current approach I can do an instanceof on the RequestHandler. I eventually like to merge RequestHandler and AsyncRequestHandler into 1 class, but this can be done in 4.0 at the earliest time. - AsyncRequestHandler will look similar to the following: void handle(Message request, Handback hb, boolean requires_response) throws Throwable; - Handback is an interface, and its impl contains header information (e.g. request ID) - Handback has a sendReply(Object reply, boolean is_exception) method which sends a response (or exception) back to the caller - When requires_response is false, the AsyncRequestHandler doesn't need to invoke sendReply() My 2 cents, I think that the boolean requires_response should be in Handback implementation to avoid passing an extra argument I actually removed the requires_response parameter. If no response is required, rsp in handle(Message req, Response rsp) will simply be null. -- Bela Ban, JGroups lead (http://www.jgroups.org) ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] Threadpools in a large cluster
On Thu, Feb 7, 2013 at 12:43 PM, Bela Ban b...@redhat.com wrote: On 2/7/13 11:09 AM, Dan Berindei wrote: A few changes I have in mind (need to think about it more): - I want to leave the existing RequestHandler interface in place, so current implementation continue to work - There will be a new AsyncRequestHandler interface (possibly extending RequestHandler, so an implementation can decide to implement both). The RequestCorrelator needs to have either request_handler or async_request_handler set. If the former is set, the logic is unchanged. If the latter is set I'll invoke the async dispatching code - AsyncRequestHandler will look similar to the following: void handle(Message request, Handback hb, boolean requires_response) throws Throwable; - Handback is an interface, and its impl contains header information (e.g. request ID) - Handback has a sendReply(Object reply, boolean is_exception) method which sends a response (or exception) back to the caller +1 for a new interface. TBH I hadn't read the RequestCorrelator code, so I had assumed it was already asynchronous, and only RpcDispatcher was synchronous. Nope, unfortunately not. I'm not so sure about the Handback name, how about calling it Response instead? It *is* actually called Response (can you read my mind?) :-) Nice :) - When requires_response is false, the AsyncRequestHandler doesn't need to invoke sendReply() I think this should be the other way around: when requires_response is true, the AsyncRequestHandler *can* invoke sendReply(), but is not required to (the call will just time out on the caller node); when requires_response is false, invoking sendReply() should throw an exception. The way I actually implemented it this morning is to omit the boolean parameter altogether: void handle(Message request, Response response) throws Exception; Response is null for async requests. Sounds good. - Message batching - The above interfaces need to take message batching into account, e.g. the ability to handle multiple requests concurrently (if they don't need to be executed sequentially) You mean handle() is still going to be called once for each request, but second handle() call won't necessarily wait for the first message's sendReply() call? Yes. I was thinking of adding a second method to the interface, which has a message batch as parameter. However, we'd also have to pass in an array of Response objects and it looked a bit clumsy. Agree, it would look quite clumsy. Is this going to apply only to OOB messages, or to regular messages as well? I think I'd prefer it if it only applied to OOB messages, otherwise we'd have to implement our own ordering for regular/async commands. No, I think it'll apply to all messages. A simple implementation could dispatch OOB messages to the thread pool, as they don't need to be ordered. Regular messages could be added to a queue where they are processed sequentially by a *single* thread. Pedro does implement ordering based on transactions (see his prev email), and I think there are some other good use cases for regular messages. I think one thing that could be done for regular messages is to implement something like SCOPE (remember ?) for async RPCs: updates to different web sessions could be processed concurrently, only updates to the *same* session would have to be ordered. Yeah, I agree implementing the regular message ordering ourselves would give us a little more room for optimizations. But it would make our part more complicated, too. Well, not for Total Ordering, because Pedro already implemented it, but for our regular async scenarios we'd need to add a thread pool (we want to allow 2 threads from different sources to access different keys at the same time). This API is not in stone, we can always change it. Once I'm done with this and have batching II implemented, plus some other JIRAs, I'll ping you guys and we should have a meeting discussing - Async invocation API - Message batching (also in conjunction with the above) - Message bundling and OOB / DONT_BUNDLE; bundling of OOB messages -- Bela Ban, JGroups lead (http://www.jgroups.org) ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] Threadpools in a large cluster
On 2/7/13 1:38 PM, Dan Berindei wrote: No, I think it'll apply to all messages. A simple implementation could dispatch OOB messages to the thread pool, as they don't need to be ordered. Regular messages could be added to a queue where they are processed sequentially by a *single* thread. Pedro does implement ordering based on transactions (see his prev email), and I think there are some other good use cases for regular messages. I think one thing that could be done for regular messages is to implement something like SCOPE (remember ?) for async RPCs: updates to different web sessions could be processed concurrently, only updates to the *same* session would have to be ordered. Yeah, I agree implementing the regular message ordering ourselves would give us a little more room for optimizations. But it would make our part more complicated, too. Well, not for Total Ordering, because Pedro already implemented it, but for our regular async scenarios we'd need to add a thread pool (we want to allow 2 threads from different sources to access different keys at the same time). We don't need to implement this *now*, but let's agree on the API (feedback from Pedro and you guys), and add this in JGroups 3.3. You can start out with a very simple implementation, or no implementation at all (then you'd have the same behavior as now), but the point is that you can do this at your own pace, and don't depend on any changes in JGroups (once 3.3 is used by Infinispan). I agree that optimization of thread/request processing *while still preserving ordering where needed* might be a bit tricky, but the potential benefits are great. You could for example add request priorities by adding a parameter/flag to RPCs and dispatching them to a priority queue in your thread pool, without me having to change anything. -- Bela Ban, JGroups lead (http://www.jgroups.org) ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] Threadpools in a large cluster
On 2/7/13 12:36 PM, Manik Surtani wrote: On 7 Feb 2013, at 11:29, Bela Ban b...@redhat.com wrote: I meant to say that this is up to you guys to decide inwhich Infinispan release this will be used, but it will be available in JGroups 3.3. What's the strategy/schedule for 6 and 5.3 anyway ? 5.3 will be relatively quick, just an incremental release mainly targeting reviewing and putting in a few contribs on the queue. E.g., we're hoping for 5.3 to be feature-complete in no more than 2 months. OK, so let's target JGroups 3.3 to be included in Infinispan 5.3. Then you have a choice of doing this in 5.3 or waiting until 6. 6 is where we get to break API - specifically with the new packaging, etc. But this seems like a pretty important feature/fix so it may make sense to get this into 5.3. -- Bela Ban, JGroups lead (http://www.jgroups.org) ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] Threadpools in a large cluster
On 1 Feb 2013, at 08:04, Radim Vansa wrote: Hi guys, after dealing with the large cluster for a while I find the way how we use OOB threads in synchronous configuration non-robust. Imagine a situation where node which is not an owner of the key calls PUT. Then the a RPC is called to the primary owner of that key, which reroutes the request to all other owners and after these reply, it replies back. This delegation RPC pattern happens for non-transactional caches only. Do you have the same problem with transactional caches as well? There are two problems: 1) If we do simultanously X requests from non-owners to the primary owner where X is OOB TP size, all the OOB threads are waiting for the responses and there is no thread to process the OOB response and release the thread. 2) Node A is primary owner of keyA, non-primary owner of keyB and B is primary of keyB and non-primary of keyA. We got many requests for both keyA and keyB from other nodes, therefore, all OOB threads from both nodes call RPC to the non-primary owner but there's noone who could process the request. While we wait for the requests to timeout, the nodes with depleted OOB threadpools start suspecting all other nodes because they can't receive heartbeats etc... You can say increase your OOB tp size, but that's not always an option, I have currently set it to 1000 threads and it's not enough. In the end, I will be always limited by RAM and something tells me that even nodes with few gigs of RAM should be able to form a huge cluster. We use 160 hotrod worker threads in JDG, that means that 160 * clusterSize = 10240 (64 nodes in my cluster) parallel requests can be executed, and if 10% targets the same node with 1000 OOB threads, it stucks. It's about scaling and robustness. Not that I'd have any good solution, but I'd really like to start a discussion. Thinking about it a bit, the problem is that blocking call (calling RPC on primary owner from message handler) can block non-blocking calls (such as RPC response or command that never sends any more messages). Therefore, having a flag on message this won't send another message could let the message be executed in different threadpool, which will be never deadlocked. In fact, the pools could share the threads but the non-blocking would have always a few threads spare. It's a bad solution as maintaining which message could block in the other node is really, really hard (we can be sure only in case of RPC responses), especially when some locks come. I will welcome anything better. Radim --- Radim Vansa Quality Assurance Engineer JBoss Datagrid tel. +420532294559 ext. 62559 Red Hat Czech, s.r.o. Brno, Purkyňova 99/71, PSČ 612 45 Czech Republic ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev Cheers, -- Mircea Markus Infinispan lead (www.infinispan.org) ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] Threadpools in a large cluster
On 1 Feb 2013, at 09:54, Dan Berindei wrote: Yeah, I wouldn't call this a simple solution... The distribution/replication interceptors are quite high in the interceptor stack, so we'd have to save the state of the interceptor stack (basically the thread's stack) somehow and resume processing it on the thread receiving the responses. In a language that supports continuations that would be a piece of cake, but since we're in Java we'd have to completely change the way the interceptor stack works. Actually we do hold the lock on modified keys while the command is replicated to the other owners. But think locking wouldn't be a problem: we already allow locks to be owned by transactions instead of threads, so it would just be a matter of creating a lite transaction for non-transactional caches. Obviously the TransactionSynchronizerInterceptor would have to go, but I see that as a positive thing ;) The TransactionSynchronizerInterceptor protected the CacheTransaction objects from multiple writes, we'd still need that because of the NBST forwarding. So yeah, it could work, but it would take a huge amount of effort and it's going to obfuscate the code. Plus, I'm not at all convinced that it's going to improve performance that much compared to a new thread pool. +1 Cheers Dan On Fri, Feb 1, 2013 at 10:59 AM, Radim Vansa rva...@redhat.com wrote: Yeah, that would work if it is possible to break execution path into the FutureListener from the middle of interceptor stack - I am really not sure about that but as in current design no locks should be held when a RPC is called, it may be possible. Let's see what someone more informed (Dan?) would think about that. Thanks, Bela Radim - Original Message - | From: Bela Ban b...@redhat.com | To: infinispan-dev@lists.jboss.org | Sent: Friday, February 1, 2013 9:39:43 AM | Subject: Re: [infinispan-dev] Threadpools in a large cluster | | It looks like the core problem is an incoming RPC-1 which triggers | another blocking RPC-2: the thread delivering RPC-1 is blocked | waiting | for the response from RPC-2, and can therefore not be used to serve | other requests for the duration of RPC-2. If RPC-2 takes a while, | e.g. | waiting to acquire a lock in the remote node, then it is clear that | the | thread pool will quickly exceed its max size. | | A simple solution would be to prevent invoking blocking RPCs *from | within* a received RPC. Let's take a look at an example: | - A invokes a blocking PUT-1 on B | - B forwards the request as blocking PUT-2 to C and D | - When PUT-2 returns and B gets the responses from C and D (or the | first | one to respond, don't know exactly how this is implemented), it sends | the response back to A (PUT-1 terminates now at A) | | We could change this to the following: | - A invokes a blocking PUT-1 on B | - B receives PUT-1. Instead of invoking a blocking PUT-2 on C and D, | it | does the following: | - B invokes PUT-2 and gets a future | - B adds itself as a FutureListener, and it also stores the | address of the original sender (A) | - When the FutureListener is invoked, B sends back the result | as a | response to A | - Whenever a member leaves the cluster, the corresponding futures are | cancelled and removed from the hashmaps | | This could probably be done differently (e.g. by sending asynchronous | messages and implementing a finite state machine), but the core of | the | solution is the same; namely to avoid having an incoming thread block | on | a sync RPC. | | Thoughts ? | | | | | On 2/1/13 9:04 AM, Radim Vansa wrote: | Hi guys, | | after dealing with the large cluster for a while I find the way how | we use OOB threads in synchronous configuration non-robust. | Imagine a situation where node which is not an owner of the key | calls PUT. Then the a RPC is called to the primary owner of that | key, which reroutes the request to all other owners and after | these reply, it replies back. | There are two problems: | 1) If we do simultanously X requests from non-owners to the primary | owner where X is OOB TP size, all the OOB threads are waiting for | the responses and there is no thread to process the OOB response | and release the thread. | 2) Node A is primary owner of keyA, non-primary owner of keyB and B | is primary of keyB and non-primary of keyA. We got many requests | for both keyA and keyB from other nodes, therefore, all OOB | threads from both nodes call RPC to the non-primary owner but | there's noone who could process the request. | | While we wait for the requests to timeout, the nodes with depleted | OOB threadpools start suspecting all other nodes because they | can't receive heartbeats etc... | | You can say increase your OOB tp size, but that's not always an | option, I have currently set it to 1000 threads and it's
Re: [infinispan-dev] Threadpools in a large cluster
On Thu, Feb 7, 2013 at 8:05 PM, Mircea Markus mmar...@redhat.com wrote: On 1 Feb 2013, at 09:54, Dan Berindei wrote: Yeah, I wouldn't call this a simple solution... The distribution/replication interceptors are quite high in the interceptor stack, so we'd have to save the state of the interceptor stack (basically the thread's stack) somehow and resume processing it on the thread receiving the responses. In a language that supports continuations that would be a piece of cake, but since we're in Java we'd have to completely change the way the interceptor stack works. Actually we do hold the lock on modified keys while the command is replicated to the other owners. But think locking wouldn't be a problem: we already allow locks to be owned by transactions instead of threads, so it would just be a matter of creating a lite transaction for non-transactional caches. Obviously the TransactionSynchronizerInterceptor would have to go, but I see that as a positive thing ;) The TransactionSynchronizerInterceptor protected the CacheTransaction objects from multiple writes, we'd still need that because of the NBST forwarding. We wouldn't need it if access to the Collection members in CacheTransaction was properly synchronized. Perhaps hack is too strong a word, let's just say I'm seeing TransactionSynchronizerInterceptor as a temporary solution :) So yeah, it could work, but it would take a huge amount of effort and it's going to obfuscate the code. Plus, I'm not at all convinced that it's going to improve performance that much compared to a new thread pool. +1 Cheers Dan On Fri, Feb 1, 2013 at 10:59 AM, Radim Vansa rva...@redhat.com wrote: Yeah, that would work if it is possible to break execution path into the FutureListener from the middle of interceptor stack - I am really not sure about that but as in current design no locks should be held when a RPC is called, it may be possible. Let's see what someone more informed (Dan?) would think about that. Thanks, Bela Radim - Original Message - | From: Bela Ban b...@redhat.com | To: infinispan-dev@lists.jboss.org | Sent: Friday, February 1, 2013 9:39:43 AM | Subject: Re: [infinispan-dev] Threadpools in a large cluster | | It looks like the core problem is an incoming RPC-1 which triggers | another blocking RPC-2: the thread delivering RPC-1 is blocked | waiting | for the response from RPC-2, and can therefore not be used to serve | other requests for the duration of RPC-2. If RPC-2 takes a while, | e.g. | waiting to acquire a lock in the remote node, then it is clear that | the | thread pool will quickly exceed its max size. | | A simple solution would be to prevent invoking blocking RPCs *from | within* a received RPC. Let's take a look at an example: | - A invokes a blocking PUT-1 on B | - B forwards the request as blocking PUT-2 to C and D | - When PUT-2 returns and B gets the responses from C and D (or the | first | one to respond, don't know exactly how this is implemented), it sends | the response back to A (PUT-1 terminates now at A) | | We could change this to the following: | - A invokes a blocking PUT-1 on B | - B receives PUT-1. Instead of invoking a blocking PUT-2 on C and D, | it | does the following: | - B invokes PUT-2 and gets a future | - B adds itself as a FutureListener, and it also stores the | address of the original sender (A) | - When the FutureListener is invoked, B sends back the result | as a | response to A | - Whenever a member leaves the cluster, the corresponding futures are | cancelled and removed from the hashmaps | | This could probably be done differently (e.g. by sending asynchronous | messages and implementing a finite state machine), but the core of | the | solution is the same; namely to avoid having an incoming thread block | on | a sync RPC. | | Thoughts ? | | | | | On 2/1/13 9:04 AM, Radim Vansa wrote: | Hi guys, | | after dealing with the large cluster for a while I find the way how | we use OOB threads in synchronous configuration non-robust. | Imagine a situation where node which is not an owner of the key | calls PUT. Then the a RPC is called to the primary owner of that | key, which reroutes the request to all other owners and after | these reply, it replies back. | There are two problems: | 1) If we do simultanously X requests from non-owners to the primary | owner where X is OOB TP size, all the OOB threads are waiting for | the responses and there is no thread to process the OOB response | and release the thread. | 2) Node A is primary owner of keyA, non-primary owner of keyB and B | is primary of keyB and non-primary of keyA. We got many requests | for both keyA and keyB from other nodes, therefore, all OOB | threads from both nodes call RPC to the non-primary owner but | there's noone who could process the request. | | While we wait for the requests
Re: [infinispan-dev] Threadpools in a large cluster
Hi Pedro, this is almost exactly what I wanted to implement ! Question: - In RequestCorrelator.handleRequest(): protected void handleRequest(Message req, Header hdr) { Object retval; boolean threwException = false; MessageRequest messageRequest = new MessageRequestImpl(req, hdr); try { retval=request_handler.handle(messageRequest); } catch(Throwable t) { retval=t; threwException = true; } messageRequest.sendReply(retval, threwException);// -- should be moved up, or called only if threwException == true } , you create a MessageRequestImpl and pass it to the RequestHandler. The request handler then dispatches the request (possibly) to a thread pool and calls MessageRequestImpl.sendReply() when done. However, you also call MessageRequest.sendReply() before returning from handleRequest(). I think this is an error, and MessageRequest.sendReply() should be moved up inside the catch clause, or be called only if threwException is true, so that we send a reply on behalf of the RequestHandler if and only if it threw an exception (e.g. before it dispatches the request to a thread pool). Otherwise, we'd send a reply *twice* ! A few changes I have in mind (need to think about it more): - I want to leave the existing RequestHandler interface in place, so current implementation continue to work - There will be a new AsyncRequestHandler interface (possibly extending RequestHandler, so an implementation can decide to implement both). The RequestCorrelator needs to have either request_handler or async_request_handler set. If the former is set, the logic is unchanged. If the latter is set I'll invoke the async dispatching code - AsyncRequestHandler will look similar to the following: void handle(Message request, Handback hb, boolean requires_response) throws Throwable; - Handback is an interface, and its impl contains header information (e.g. request ID) - Handback has a sendReply(Object reply, boolean is_exception) method which sends a response (or exception) back to the caller - When requires_response is false, the AsyncRequestHandler doesn't need to invoke sendReply() - Message batching - The above interfaces need to take message batching into account, e.g. the ability to handle multiple requests concurrently (if they don't need to be executed sequentially) Thoughts ? On 2/6/13 8:29 PM, Pedro Ruivo wrote: Hi all, Recently I came up with a solution that can help with the thread pool problem motivated by the following: In one of the first implementation of Total Order based commit protocol (TO), I had the requirement to move the PrepareCommand to another thread pool. In resume, the TO protocol delivers the PrepareCommand in a deterministic order in all the nodes, by a single deliver thread. To ensure consistency, if it delivers two conflicting transactions, the second transaction must wait until the first transaction finishes. However, blocking single deliver thread is not a good solution, because no more transaction can be validated, even if they don't conflict, while the thread is blocked. So, after creating a dependency graph (i.e. the second transaction knows that it must wait for the first transaction to finish) I move the PrepareCommand to another thread pool. Initially, I implemented a new command, called PrepareResponseCommand, that sends back the reply of the PrepareCommand. This solution has one disadvantage: I had to implement an ack collector in ISPN, while JGroups already offers me that with a synchronous communication. Recently (2 or 3 months ago) I implemented a simple modification in JGroups. In a more generic approach, it allows other threads to reply to a RPC request (such as the PrepareCommand). In the previous scenario, I replaced the PrepareResponseCommand and the ack collector implementation with a synchronous RPC invocation. I've used this solution in other issues in the Cloud-TM's ISPN fork. This solution is quite simple to implement and may help you to move the commands to ISPN internal thread pools. The modifications I've made are the following: 1) I added a new interface (see [1]) that is sent to the application instead of the Message object (see [4]). This interface contains the Message and it has a method to allow the application send the reply to that particular request. 2) I added a new object in [4] with the meaning: this return value is not the reply to the RPC request. This is the returned value that I return when I want to release the thread, because ISPN should return some object in the handle() method. Of course, I know that ISPN will invoke the sendReply() in some other place, otherwise, I will get a TimeoutException in the sender side. 3) Also I've changed the RequestCorrelator implementation to support the previous modifications (see [2] and [3]) In the Cloud-TM's ISPN fork I added a reference in the BaseCacheRpcCommand to [1] and added the method sendReply() [5]. In
Re: [infinispan-dev] Threadpools in a large cluster
I agree that an application thread pool just pushes the issue of the OOB pool running out of threads elsewhere, but that is only one of the two problems Radim has. The other is that nodes get suspected and kicked out because heartbeat messages get blocked as well. Same thing with FC credit messages. By having a separate application pool, at least we guarantee that the cluster service messages get handled… On 3 Feb 2013, at 11:23, Bela Ban b...@redhat.com wrote: A new thread pool owned by Infinispan is certainly something desirable, as discussed in Palma, but I think it wouldn't solve the issue Radim ran into, namely threads being used despite the fact that they only wait for another blocking RPC to finish. If we made the JGroups thread return immediately by transferring control to an Infinispan thread, then we'd simply move the issue from the former to the latter pool. Eventually, the Infinispan pool would run out of threads. Coming back to the specific problem Radim ran into: the forwarding of a PUT doesn't hold any locks, so your argument below wouldn't hold. However, of course this is only one specific scenario, and you're probably right that we'd have to consider the more general case of a thread holding locks... All said, I believe it would still be worthwhile looking into a more non-blocking way of invoking RPCs, that doesn't occupy threads which essentially only wait on IO (network traffic)... A simple state machine approach could be the solution to this... On 2/1/13 10:54 AM, Dan Berindei wrote: Yeah, I wouldn't call this a simple solution... The distribution/replication interceptors are quite high in the interceptor stack, so we'd have to save the state of the interceptor stack (basically the thread's stack) somehow and resume processing it on the thread receiving the responses. In a language that supports continuations that would be a piece of cake, but since we're in Java we'd have to completely change the way the interceptor stack works. Actually we do hold the lock on modified keys while the command is replicated to the other owners. But think locking wouldn't be a problem: we already allow locks to be owned by transactions instead of threads, so it would just be a matter of creating a lite transaction for non-transactional caches. Obviously the TransactionSynchronizerInterceptor would have to go, but I see that as a positive thing ;) So yeah, it could work, but it would take a huge amount of effort and it's going to obfuscate the code. Plus, I'm not at all convinced that it's going to improve performance that much compared to a new thread pool. Cheers Dan On Fri, Feb 1, 2013 at 10:59 AM, Radim Vansa rva...@redhat.com mailto:rva...@redhat.com wrote: Yeah, that would work if it is possible to break execution path into the FutureListener from the middle of interceptor stack - I am really not sure about that but as in current design no locks should be held when a RPC is called, it may be possible. Let's see what someone more informed (Dan?) would think about that. Thanks, Bela Radim - Original Message - | From: Bela Ban b...@redhat.com mailto:b...@redhat.com | To: infinispan-dev@lists.jboss.org mailto:infinispan-dev@lists.jboss.org | Sent: Friday, February 1, 2013 9:39:43 AM | Subject: Re: [infinispan-dev] Threadpools in a large cluster | | It looks like the core problem is an incoming RPC-1 which triggers | another blocking RPC-2: the thread delivering RPC-1 is blocked | waiting | for the response from RPC-2, and can therefore not be used to serve | other requests for the duration of RPC-2. If RPC-2 takes a while, | e.g. | waiting to acquire a lock in the remote node, then it is clear that | the | thread pool will quickly exceed its max size. | | A simple solution would be to prevent invoking blocking RPCs *from | within* a received RPC. Let's take a look at an example: | - A invokes a blocking PUT-1 on B | - B forwards the request as blocking PUT-2 to C and D | - When PUT-2 returns and B gets the responses from C and D (or the | first | one to respond, don't know exactly how this is implemented), it sends | the response back to A (PUT-1 terminates now at A) | | We could change this to the following: | - A invokes a blocking PUT-1 on B | - B receives PUT-1. Instead of invoking a blocking PUT-2 on C and D, | it | does the following: | - B invokes PUT-2 and gets a future | - B adds itself as a FutureListener, and it also stores the | address of the original sender (A) | - When the FutureListener is invoked, B sends back the result | as a | response to A | - Whenever a member leaves the cluster, the corresponding futures are | cancelled and removed from the hashmaps
Re: [infinispan-dev] Threadpools in a large cluster
On 4 Feb 2013, at 07:46, Dan Berindei dan.berin...@gmail.com wrote: Switching to a state machine approach would require rethinking and rewriting all our interceptors, and I'm pretty sure the code would get more complex and harder to debug (to say nothing about interpreting the logs). Are you sure it's going to have that many benefits to make it worthwhile? I think it could be worthwhile, but I agree it is not at all trivial. So that's actually a hard call as a result - not so easy to prototype and benchmark. -- Manik Surtani ma...@jboss.org twitter.com/maniksurtani Platform Architect, JBoss Data Grid http://red.ht/data-grid ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] Threadpools in a large cluster
If you send me the details, I'll take a look. I'm pretty busy with message batching, so I can't promise next week, but soon... On 2/1/13 11:08 AM, Pedro Ruivo wrote: Hi, I had a similar problem when I tried GMU[1] in large cluster (40 vms), because the remote gets and the commit messages (I'm talking about ISPN commands) must wait for some conditions before being processed. I solved this problem by adding a feature in JGroups[2] that allows the request to be moved to another thread, releasing the OOB thread. The other thread will send the reply of the JGroups Request. Of course, I'm only moving commands that I know they can block. I can enter in some detail if you want =) Cheers, Pedro [1] http://www.gsd.inesc-id.pt/~romanop/files/papers/icdcs12.pdf [2] I would like to talk with Bela about this, because it makes my life easier to support total order in ISPN. I'll try to send an email this weekend =) On 01-02-2013 08:04, Radim Vansa wrote: Hi guys, after dealing with the large cluster for a while I find the way how we use OOB threads in synchronous configuration non-robust. Imagine a situation where node which is not an owner of the key calls PUT. Then the a RPC is called to the primary owner of that key, which reroutes the request to all other owners and after these reply, it replies back. There are two problems: 1) If we do simultanously X requests from non-owners to the primary owner where X is OOB TP size, all the OOB threads are waiting for the responses and there is no thread to process the OOB response and release the thread. 2) Node A is primary owner of keyA, non-primary owner of keyB and B is primary of keyB and non-primary of keyA. We got many requests for both keyA and keyB from other nodes, therefore, all OOB threads from both nodes call RPC to the non-primary owner but there's noone who could process the request. While we wait for the requests to timeout, the nodes with depleted OOB threadpools start suspecting all other nodes because they can't receive heartbeats etc... You can say increase your OOB tp size, but that's not always an option, I have currently set it to 1000 threads and it's not enough. In the end, I will be always limited by RAM and something tells me that even nodes with few gigs of RAM should be able to form a huge cluster. We use 160 hotrod worker threads in JDG, that means that 160 * clusterSize = 10240 (64 nodes in my cluster) parallel requests can be executed, and if 10% targets the same node with 1000 OOB threads, it stucks. It's about scaling and robustness. Not that I'd have any good solution, but I'd really like to start a discussion. Thinking about it a bit, the problem is that blocking call (calling RPC on primary owner from message handler) can block non-blocking calls (such as RPC response or command that never sends any more messages). Therefore, having a flag on message this won't send another message could let the message be executed in different threadpool, which will be never deadlocked. In fact, the pools could share the threads but the non-blocking would have always a few threads spare. It's a bad solution as maintaining which message could block in the other node is really, really hard (we can be sure only in case of RPC responses), especially when some locks come. I will welcome anything better. Radim --- Radim Vansa Quality Assurance Engineer JBoss Datagrid tel. +420532294559 ext. 62559 Red Hat Czech, s.r.o. Brno, Purkyňova 99/71, PSČ 612 45 Czech Republic ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev -- Bela Ban, JGroups lead (http://www.jgroups.org) ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] Threadpools in a large cluster
It looks like the core problem is an incoming RPC-1 which triggers another blocking RPC-2: the thread delivering RPC-1 is blocked waiting for the response from RPC-2, and can therefore not be used to serve other requests for the duration of RPC-2. If RPC-2 takes a while, e.g. waiting to acquire a lock in the remote node, then it is clear that the thread pool will quickly exceed its max size. A simple solution would be to prevent invoking blocking RPCs *from within* a received RPC. Let's take a look at an example: - A invokes a blocking PUT-1 on B - B forwards the request as blocking PUT-2 to C and D - When PUT-2 returns and B gets the responses from C and D (or the first one to respond, don't know exactly how this is implemented), it sends the response back to A (PUT-1 terminates now at A) We could change this to the following: - A invokes a blocking PUT-1 on B - B receives PUT-1. Instead of invoking a blocking PUT-2 on C and D, it does the following: - B invokes PUT-2 and gets a future - B adds itself as a FutureListener, and it also stores the address of the original sender (A) - When the FutureListener is invoked, B sends back the result as a response to A - Whenever a member leaves the cluster, the corresponding futures are cancelled and removed from the hashmaps This could probably be done differently (e.g. by sending asynchronous messages and implementing a finite state machine), but the core of the solution is the same; namely to avoid having an incoming thread block on a sync RPC. Thoughts ? On 2/1/13 9:04 AM, Radim Vansa wrote: Hi guys, after dealing with the large cluster for a while I find the way how we use OOB threads in synchronous configuration non-robust. Imagine a situation where node which is not an owner of the key calls PUT. Then the a RPC is called to the primary owner of that key, which reroutes the request to all other owners and after these reply, it replies back. There are two problems: 1) If we do simultanously X requests from non-owners to the primary owner where X is OOB TP size, all the OOB threads are waiting for the responses and there is no thread to process the OOB response and release the thread. 2) Node A is primary owner of keyA, non-primary owner of keyB and B is primary of keyB and non-primary of keyA. We got many requests for both keyA and keyB from other nodes, therefore, all OOB threads from both nodes call RPC to the non-primary owner but there's noone who could process the request. While we wait for the requests to timeout, the nodes with depleted OOB threadpools start suspecting all other nodes because they can't receive heartbeats etc... You can say increase your OOB tp size, but that's not always an option, I have currently set it to 1000 threads and it's not enough. In the end, I will be always limited by RAM and something tells me that even nodes with few gigs of RAM should be able to form a huge cluster. We use 160 hotrod worker threads in JDG, that means that 160 * clusterSize = 10240 (64 nodes in my cluster) parallel requests can be executed, and if 10% targets the same node with 1000 OOB threads, it stucks. It's about scaling and robustness. Not that I'd have any good solution, but I'd really like to start a discussion. Thinking about it a bit, the problem is that blocking call (calling RPC on primary owner from message handler) can block non-blocking calls (such as RPC response or command that never sends any more messages). Therefore, having a flag on message this won't send another message could let the message be executed in different threadpool, which will be never deadlocked. In fact, the pools could share the threads but the non-blocking would have always a few threads spare. It's a bad solution as maintaining which message could block in the other node is really, really hard (we can be sure only in case of RPC responses), especially when some locks come. I will welcome anything better. -- Bela Ban, JGroups lead (http://www.jgroups.org) ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] Threadpools in a large cluster
Yeah, that would work if it is possible to break execution path into the FutureListener from the middle of interceptor stack - I am really not sure about that but as in current design no locks should be held when a RPC is called, it may be possible. Let's see what someone more informed (Dan?) would think about that. Thanks, Bela Radim - Original Message - | From: Bela Ban b...@redhat.com | To: infinispan-dev@lists.jboss.org | Sent: Friday, February 1, 2013 9:39:43 AM | Subject: Re: [infinispan-dev] Threadpools in a large cluster | | It looks like the core problem is an incoming RPC-1 which triggers | another blocking RPC-2: the thread delivering RPC-1 is blocked | waiting | for the response from RPC-2, and can therefore not be used to serve | other requests for the duration of RPC-2. If RPC-2 takes a while, | e.g. | waiting to acquire a lock in the remote node, then it is clear that | the | thread pool will quickly exceed its max size. | | A simple solution would be to prevent invoking blocking RPCs *from | within* a received RPC. Let's take a look at an example: | - A invokes a blocking PUT-1 on B | - B forwards the request as blocking PUT-2 to C and D | - When PUT-2 returns and B gets the responses from C and D (or the | first | one to respond, don't know exactly how this is implemented), it sends | the response back to A (PUT-1 terminates now at A) | | We could change this to the following: | - A invokes a blocking PUT-1 on B | - B receives PUT-1. Instead of invoking a blocking PUT-2 on C and D, | it | does the following: | - B invokes PUT-2 and gets a future | - B adds itself as a FutureListener, and it also stores the | address of the original sender (A) | - When the FutureListener is invoked, B sends back the result | as a | response to A | - Whenever a member leaves the cluster, the corresponding futures are | cancelled and removed from the hashmaps | | This could probably be done differently (e.g. by sending asynchronous | messages and implementing a finite state machine), but the core of | the | solution is the same; namely to avoid having an incoming thread block | on | a sync RPC. | | Thoughts ? | | | | | On 2/1/13 9:04 AM, Radim Vansa wrote: | Hi guys, | | after dealing with the large cluster for a while I find the way how | we use OOB threads in synchronous configuration non-robust. | Imagine a situation where node which is not an owner of the key | calls PUT. Then the a RPC is called to the primary owner of that | key, which reroutes the request to all other owners and after | these reply, it replies back. | There are two problems: | 1) If we do simultanously X requests from non-owners to the primary | owner where X is OOB TP size, all the OOB threads are waiting for | the responses and there is no thread to process the OOB response | and release the thread. | 2) Node A is primary owner of keyA, non-primary owner of keyB and B | is primary of keyB and non-primary of keyA. We got many requests | for both keyA and keyB from other nodes, therefore, all OOB | threads from both nodes call RPC to the non-primary owner but | there's noone who could process the request. | | While we wait for the requests to timeout, the nodes with depleted | OOB threadpools start suspecting all other nodes because they | can't receive heartbeats etc... | | You can say increase your OOB tp size, but that's not always an | option, I have currently set it to 1000 threads and it's not | enough. In the end, I will be always limited by RAM and something | tells me that even nodes with few gigs of RAM should be able to | form a huge cluster. We use 160 hotrod worker threads in JDG, that | means that 160 * clusterSize = 10240 (64 nodes in my cluster) | parallel requests can be executed, and if 10% targets the same | node with 1000 OOB threads, it stucks. It's about scaling and | robustness. | | Not that I'd have any good solution, but I'd really like to start a | discussion. | Thinking about it a bit, the problem is that blocking call (calling | RPC on primary owner from message handler) can block non-blocking | calls (such as RPC response or command that never sends any more | messages). Therefore, having a flag on message this won't send | another message could let the message be executed in different | threadpool, which will be never deadlocked. In fact, the pools | could share the threads but the non-blocking would have always a | few threads spare. | It's a bad solution as maintaining which message could block in the | other node is really, really hard (we can be sure only in case of | RPC responses), especially when some locks come. I will welcome | anything better. | | -- | Bela Ban, JGroups lead (http://www.jgroups.org) | | ___ | infinispan-dev mailing list | infinispan-dev@lists.jboss.org | https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] Threadpools in a large cluster
Radim, do these problems happen with the HotRod server, or only with memcached? HotRod requests handled by non-owners should be very rare, instead the vast majority should be handled by the primary owner directly. So if this happens with HotRod, we should focus on fixing the HotRod routing instead of focusing on how to handle a large number of requests from non-owners. That being said, even if a HotRod put request is handled by the primary owner, it generates (numOwners - 1) extra OOB requests. So if you have 160 HotRod worker threads per node, you can expect 4 * 160 OOB messages per node. Multiply that by 2, because responses are OOB as well, and you can get 1280 OOB messages before you even start reusing any HotRod worker thread. Have you tried decreasing the number of HotRod workers? The thing is, our OOB thread pool can't use queueing because we'd get a queue full of commit commands while all the OOB threads are waiting on keys that those commit commands would unlock. As the OOB thread pool is full, we discard messages, which I suspect slows things down quite a bit (especially if it's a credit request/response message). So it may well be that a lower number of HotRod working threads would perform better. On the other hand, why is increasing the number of OOB threads a solution? With -Xss 512k, you can get 2000 threads with only 1 GB of virtual memory (the actual used memory is probably even less, unless you're using huge pages). AFAIK the Linux kernel doesn't break a sweat with 10 threads running, so having 2000 threads just hanging around, waiting for a response, should be such a problem. I did chat with Bela (or was it a break-out session?) about moving Infinispan's request processing to another thread pool during the team meeting in Palma. That would leave the OOB thread pool free to receive response messages, FD heartbeats, credit requests/responses etc. The downside, I guess, is that each request would have to be passed to another thread, and the context switch may slow things down a bit. But since the new thread pool would be in Infinispan, we could even do tricks like executing a commit/rollback directly on the OOB thread. In the end, I just didn't feel that working on this was justified, considering the number of critical bugs we had. But maybe now's the time to start experimenting... On Fri, Feb 1, 2013 at 10:04 AM, Radim Vansa rva...@redhat.com wrote: Hi guys, after dealing with the large cluster for a while I find the way how we use OOB threads in synchronous configuration non-robust. Imagine a situation where node which is not an owner of the key calls PUT. Then the a RPC is called to the primary owner of that key, which reroutes the request to all other owners and after these reply, it replies back. There are two problems: 1) If we do simultanously X requests from non-owners to the primary owner where X is OOB TP size, all the OOB threads are waiting for the responses and there is no thread to process the OOB response and release the thread. 2) Node A is primary owner of keyA, non-primary owner of keyB and B is primary of keyB and non-primary of keyA. We got many requests for both keyA and keyB from other nodes, therefore, all OOB threads from both nodes call RPC to the non-primary owner but there's noone who could process the request. While we wait for the requests to timeout, the nodes with depleted OOB threadpools start suspecting all other nodes because they can't receive heartbeats etc... You can say increase your OOB tp size, but that's not always an option, I have currently set it to 1000 threads and it's not enough. In the end, I will be always limited by RAM and something tells me that even nodes with few gigs of RAM should be able to form a huge cluster. We use 160 hotrod worker threads in JDG, that means that 160 * clusterSize = 10240 (64 nodes in my cluster) parallel requests can be executed, and if 10% targets the same node with 1000 OOB threads, it stucks. It's about scaling and robustness. Not that I'd have any good solution, but I'd really like to start a discussion. Thinking about it a bit, the problem is that blocking call (calling RPC on primary owner from message handler) can block non-blocking calls (such as RPC response or command that never sends any more messages). Therefore, having a flag on message this won't send another message could let the message be executed in different threadpool, which will be never deadlocked. In fact, the pools could share the threads but the non-blocking would have always a few threads spare. It's a bad solution as maintaining which message could block in the other node is really, really hard (we can be sure only in case of RPC responses), especially when some locks come. I will welcome anything better. Radim --- Radim Vansa Quality Assurance Engineer JBoss Datagrid tel. +420532294559 ext. 62559 Red Hat Czech,
Re: [infinispan-dev] Threadpools in a large cluster
On 1 Feb 2013, at 09:39, Dan Berindei dan.berin...@gmail.com wrote: Radim, do these problems happen with the HotRod server, or only with memcached? HotRod requests handled by non-owners should be very rare, instead the vast majority should be handled by the primary owner directly. So if this happens with HotRod, we should focus on fixing the HotRod routing instead of focusing on how to handle a large number of requests from non-owners. Well, even Hot Rod only optionally uses smart routing. Some client libraries don't have this capability. That being said, even if a HotRod put request is handled by the primary owner, it generates (numOwners - 1) extra OOB requests. So if you have 160 HotRod worker threads per node, you can expect 4 * 160 OOB messages per node. Multiply that by 2, because responses are OOB as well, and you can get 1280 OOB messages before you even start reusing any HotRod worker thread. Have you tried decreasing the number of HotRod workers? The thing is, our OOB thread pool can't use queueing because we'd get a queue full of commit commands while all the OOB threads are waiting on keys that those commit commands would unlock. As the OOB thread pool is full, we discard messages, which I suspect slows things down quite a bit (especially if it's a credit request/response message). So it may well be that a lower number of HotRod working threads would perform better. On the other hand, why is increasing the number of OOB threads a solution? With -Xss 512k, you can get 2000 threads with only 1 GB of virtual memory (the actual used memory is probably even less, unless you're using huge pages). AFAIK the Linux kernel doesn't break a sweat with 10 threads running, so having 2000 threads just hanging around, waiting for a response, should be such a problem. I did chat with Bela (or was it a break-out session?) about moving Infinispan's request processing to another thread pool during the team meeting in Palma. That would leave the OOB thread pool free to receive response messages, FD heartbeats, credit requests/responses etc. The downside, I guess, is that each request would have to be passed to another thread, and the context switch may slow things down a bit. But since the new thread pool would be in Infinispan, we could even do tricks like executing a commit/rollback directly on the OOB thread. Right. I always got the impression we were abusing the OOB pool. But in the end, I think it makes sense (in JGroups) to separate a service thread pool (for heartbeats, credits, etc) and an application thread pool (what we'd use instead of OOB). This way you could even tune your service thread pool to just have, say, 2 threads, and the application thread pool to 1000 or whatever. In the end, I just didn't feel that working on this was justified, considering the number of critical bugs we had. But maybe now's the time to start experimenting… On Fri, Feb 1, 2013 at 10:04 AM, Radim Vansa rva...@redhat.com wrote: Hi guys, after dealing with the large cluster for a while I find the way how we use OOB threads in synchronous configuration non-robust. Imagine a situation where node which is not an owner of the key calls PUT. Then the a RPC is called to the primary owner of that key, which reroutes the request to all other owners and after these reply, it replies back. There are two problems: 1) If we do simultanously X requests from non-owners to the primary owner where X is OOB TP size, all the OOB threads are waiting for the responses and there is no thread to process the OOB response and release the thread. 2) Node A is primary owner of keyA, non-primary owner of keyB and B is primary of keyB and non-primary of keyA. We got many requests for both keyA and keyB from other nodes, therefore, all OOB threads from both nodes call RPC to the non-primary owner but there's noone who could process the request. While we wait for the requests to timeout, the nodes with depleted OOB threadpools start suspecting all other nodes because they can't receive heartbeats etc... You can say increase your OOB tp size, but that's not always an option, I have currently set it to 1000 threads and it's not enough. In the end, I will be always limited by RAM and something tells me that even nodes with few gigs of RAM should be able to form a huge cluster. We use 160 hotrod worker threads in JDG, that means that 160 * clusterSize = 10240 (64 nodes in my cluster) parallel requests can be executed, and if 10% targets the same node with 1000 OOB threads, it stucks. It's about scaling and robustness. Not that I'd have any good solution, but I'd really like to start a discussion. Thinking about it a bit, the problem is that blocking call (calling RPC on primary owner from message handler) can block non-blocking calls (such as RPC response or command that never sends any more messages).
Re: [infinispan-dev] Threadpools in a large cluster
Hi, I had a similar problem when I tried GMU[1] in large cluster (40 vms), because the remote gets and the commit messages (I'm talking about ISPN commands) must wait for some conditions before being processed. I solved this problem by adding a feature in JGroups[2] that allows the request to be moved to another thread, releasing the OOB thread. The other thread will send the reply of the JGroups Request. Of course, I'm only moving commands that I know they can block. I can enter in some detail if you want =) Cheers, Pedro [1] http://www.gsd.inesc-id.pt/~romanop/files/papers/icdcs12.pdf [2] I would like to talk with Bela about this, because it makes my life easier to support total order in ISPN. I'll try to send an email this weekend =) On 01-02-2013 08:04, Radim Vansa wrote: Hi guys, after dealing with the large cluster for a while I find the way how we use OOB threads in synchronous configuration non-robust. Imagine a situation where node which is not an owner of the key calls PUT. Then the a RPC is called to the primary owner of that key, which reroutes the request to all other owners and after these reply, it replies back. There are two problems: 1) If we do simultanously X requests from non-owners to the primary owner where X is OOB TP size, all the OOB threads are waiting for the responses and there is no thread to process the OOB response and release the thread. 2) Node A is primary owner of keyA, non-primary owner of keyB and B is primary of keyB and non-primary of keyA. We got many requests for both keyA and keyB from other nodes, therefore, all OOB threads from both nodes call RPC to the non-primary owner but there's noone who could process the request. While we wait for the requests to timeout, the nodes with depleted OOB threadpools start suspecting all other nodes because they can't receive heartbeats etc... You can say increase your OOB tp size, but that's not always an option, I have currently set it to 1000 threads and it's not enough. In the end, I will be always limited by RAM and something tells me that even nodes with few gigs of RAM should be able to form a huge cluster. We use 160 hotrod worker threads in JDG, that means that 160 * clusterSize = 10240 (64 nodes in my cluster) parallel requests can be executed, and if 10% targets the same node with 1000 OOB threads, it stucks. It's about scaling and robustness. Not that I'd have any good solution, but I'd really like to start a discussion. Thinking about it a bit, the problem is that blocking call (calling RPC on primary owner from message handler) can block non-blocking calls (such as RPC response or command that never sends any more messages). Therefore, having a flag on message this won't send another message could let the message be executed in different threadpool, which will be never deadlocked. In fact, the pools could share the threads but the non-blocking would have always a few threads spare. It's a bad solution as maintaining which message could block in the other node is really, really hard (we can be sure only in case of RPC responses), especially when some locks come. I will welcome anything better. Radim --- Radim Vansa Quality Assurance Engineer JBoss Datagrid tel. +420532294559 ext. 62559 Red Hat Czech, s.r.o. Brno, Purkyňova 99/71, PSČ 612 45 Czech Republic ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev ___ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev
Re: [infinispan-dev] Threadpools in a large cluster
On Fri, Feb 1, 2013 at 12:13 PM, Manik Surtani msurt...@redhat.com wrote: On 1 Feb 2013, at 09:39, Dan Berindei dan.berin...@gmail.com wrote: Radim, do these problems happen with the HotRod server, or only with memcached? HotRod requests handled by non-owners should be very rare, instead the vast majority should be handled by the primary owner directly. So if this happens with HotRod, we should focus on fixing the HotRod routing instead of focusing on how to handle a large number of requests from non-owners. Well, even Hot Rod only optionally uses smart routing. Some client libraries don't have this capability. True, and I meant to say that with memcached it should be much worse, but at least in Radim's tests I hope smart routing is enabled. That being said, even if a HotRod put request is handled by the primary owner, it generates (numOwners - 1) extra OOB requests. So if you have 160 HotRod worker threads per node, you can expect 4 * 160 OOB messages per node. Multiply that by 2, because responses are OOB as well, and you can get 1280 OOB messages before you even start reusing any HotRod worker thread. Have you tried decreasing the number of HotRod workers? The thing is, our OOB thread pool can't use queueing because we'd get a queue full of commit commands while all the OOB threads are waiting on keys that those commit commands would unlock. As the OOB thread pool is full, we discard messages, which I suspect slows things down quite a bit (especially if it's a credit request/response message). So it may well be that a lower number of HotRod working threads would perform better. On the other hand, why is increasing the number of OOB threads a solution? With -Xss 512k, you can get 2000 threads with only 1 GB of virtual memory (the actual used memory is probably even less, unless you're using huge pages). AFAIK the Linux kernel doesn't break a sweat with 10 threads running, so having 2000 threads just hanging around, waiting for a response, should be such a problem. I did chat with Bela (or was it a break-out session?) about moving Infinispan's request processing to another thread pool during the team meeting in Palma. That would leave the OOB thread pool free to receive response messages, FD heartbeats, credit requests/responses etc. The downside, I guess, is that each request would have to be passed to another thread, and the context switch may slow things down a bit. But since the new thread pool would be in Infinispan, we could even do tricks like executing a commit/rollback directly on the OOB thread. Right. I always got the impression we were abusing the OOB pool. But in the end, I think it makes sense (in JGroups) to separate a service thread pool (for heartbeats, credits, etc) and an application thread pool (what we'd use instead of OOB). This way you could even tune your service thread pool to just have, say, 2 threads, and the application thread pool to 1000 or whatever. A separate service pool would be good, but I think we could go further and treat ClusteredGet/Commit/Rollback commands the same way, because they can't block waiting for other commands to be processed. In the end, I just didn't feel that working on this was justified, considering the number of critical bugs we had. But maybe now's the time to start experimenting… On Fri, Feb 1, 2013 at 10:04 AM, Radim Vansa rva...@redhat.com wrote: Hi guys, after dealing with the large cluster for a while I find the way how we use OOB threads in synchronous configuration non-robust. Imagine a situation where node which is not an owner of the key calls PUT. Then the a RPC is called to the primary owner of that key, which reroutes the request to all other owners and after these reply, it replies back. There are two problems: 1) If we do simultanously X requests from non-owners to the primary owner where X is OOB TP size, all the OOB threads are waiting for the responses and there is no thread to process the OOB response and release the thread. 2) Node A is primary owner of keyA, non-primary owner of keyB and B is primary of keyB and non-primary of keyA. We got many requests for both keyA and keyB from other nodes, therefore, all OOB threads from both nodes call RPC to the non-primary owner but there's noone who could process the request. While we wait for the requests to timeout, the nodes with depleted OOB threadpools start suspecting all other nodes because they can't receive heartbeats etc... You can say increase your OOB tp size, but that's not always an option, I have currently set it to 1000 threads and it's not enough. In the end, I will be always limited by RAM and something tells me that even nodes with few gigs of RAM should be able to form a huge cluster. We use 160 hotrod worker threads in JDG, that means that 160 * clusterSize = 10240 (64 nodes in my cluster) parallel requests can be executed, and if 10%