[
https://issues.apache.org/jira/browse/GEODE-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bill Burcham updated GEODE-9764:
--------------------------------
Attachment: image-2021-11-22-12-14-59-117.png
> Request-Response Messaging Should Time Out
> ------------------------------------------
>
> Key: GEODE-9764
> URL: https://issues.apache.org/jira/browse/GEODE-9764
> Project: Geode
> Issue Type: Improvement
> Components: messaging
> Reporter: Bill Burcham
> Assignee: Bill Burcham
> Priority: Major
> Attachments: image-2021-11-22-11-52-23-586.png,
> image-2021-11-22-12-14-59-117.png
>
>
> There is a weakness in the P2P/DirectChannel messaging architecture, in that
> it never gives up on a request (in a request-response scenario). As a result
> a bug (software fault) anywhere from the point where the requesting thread
> hands off the {{DistributionMessage}} e.g. to
> {{{}ClusterDistributionManager.putOutgoing(DistributionMessage){}}}, to the
> point where that request is ultimately fulfilled on a (one) receiver, can
> result in a hang (of some task on the send side, which is waiting for a
> response).
> Well it's a little worse than that because any code in the return (response)
> path can also cause disruption of the (response) flow, thereby leaving the
> requesting task hanging.
> If the code in the request path (primarily in P2P messaging) and the code in
> the response path (P2P messaging and TBD higher-level code) were perfect this
> might not be a problem. But there is a fair amount of code there and we have
> some evidence that it is currently not perfect, nor do we expect it to become
> perfect and stay that way. That being the case it seems prudent to institute
> response timeouts so that bugs of this sort (which disrupt request-response
> message flow) don't result in hangs.
> It's TBD if we want to go a step further and institute retries. The latter
> would entail introducing duplicate-suppression (conflation) in P2P messaging.
> We might also add exponential backoff (open-loop) or back-pressure
> (closed-loop) to prevent a flood of retries when the system is at or near the
> point of thrashing.
> But even without retries, a configurable timeout might have good ROI as a
> first step. This would entail:
> * adding a configuration parameter to specify the timeout value
> * changing ReplyProcessor21 and others TBD to "give up" after the timeout
> has elapsed
> * changing higher-level code dependent on request-reply messaging so it
> properly handles the situations where we might have to "give up"
> This issue affects all versions of Geode.
> h2. Counterpoint
> Not everbody thinks timeouts are a good idea. Here are some alternative ideas:
>
> Make request-response primitive better. make it so only bugs in our core
> messaging framework could cause a lack of response - rather than our current
> approach where a bug in a class like “RemotePutMessage” could cause a lack of
> a response.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)