[ 
https://issues.apache.org/jira/browse/GEODE-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447553#comment-17447553
 ] 

Anthony Baker commented on GEODE-9764:
--------------------------------------

I would argue that for certain messages like replication of values, timeouts 
alone are insufficient.  To maintain consistency, we have to replicate the 
change or revert it. I think that implies the need for timeouts as well failure 
detection improvements.

> Request-Response Messaging Should Time Out
> ------------------------------------------
>
>                 Key: GEODE-9764
>                 URL: https://issues.apache.org/jira/browse/GEODE-9764
>             Project: Geode
>          Issue Type: Improvement
>          Components: messaging
>            Reporter: Bill Burcham
>            Assignee: Bill Burcham
>            Priority: Major
>
> There is a weakness in the P2P/DirectChannel messaging architecture, in that 
> it never gives up on a request (in a request-response scenario). As a result 
> a bug (software fault) anywhere from the point where the requesting thread 
> hands off the {{DistributionMessage}} e.g. to 
> {{ClusterDistributionManager.putOutgoing(DistributionMessage)}}, to the point 
> where that request is ultimately fulfilled on a (one) receiver, can result in 
> a hang (of some task on the send side, which is waiting for a response).
> Well it's a little worse than that because any code in the return (response) 
> path can also cause disruption of the (response) flow, thereby leaving the 
> requesting task hanging.
> If the code in the request path (primarily in P2P messaging) and the code in 
> the response path (P2P messaging and TBD higher-level code) were perfect this 
> might not be a problem. But there is a fair amount of code there and we have 
> some evidence that it is currently not perfect, nor do we expect it to become 
> perfect and stay that way. That being the case it seems prudent to institute 
> response timeouts so that bugs of this sort (which disrupt request-response 
> message flow) don't result in hangs.
> It's TBD if we want to go a step further and institute retries. The latter 
> would entail introducing duplicate-suppression (conflation) in P2P messaging. 
> We might also add exponential backoff (open-loop) or back-pressure 
> (closed-loop) to prevent a flood of retries when the system is at or near the 
> point of thrashing.
> But even without retries, a configurable timeout might have good ROI as a 
> first step. This would entail:
> * adding a configuration parameter to specify the timeout value
> * changing ReplyProcessor21 and others TBD to "give up" after the timeout has 
> elapsed
> * changing higher-level code dependent on request-reply messaging so it 
> properly handles the situations where we might have to "give up"
> This issue affects all versions of Geode.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to