[
https://issues.apache.org/jira/browse/GEODE-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bill Burcham updated GEODE-9764:
--------------------------------
Description:
There is a weakness in the P2P/DirectChannel messaging architecture, in that it
never gives up on a request (in a request-response scenario). As a result a bug
(software fault) anywhere from the point where the requesting thread hands off
the {{DistributionMessage}} e.g. to
{{{}ClusterDistributionManager.putOutgoing(DistributionMessage){}}}, to the
point where that request is ultimately fulfilled on a (one) receiver, can
result in a hang (of some task on the send side, which is waiting for a
response).
Well it's a little worse than that because any code in the return (response)
path can also cause disruption of the (response) flow, thereby leaving the
requesting task hanging.
If the code in the request path (primarily in P2P messaging) and the code in
the response path (P2P messaging and TBD higher-level code) were perfect this
might not be a problem. But there is a fair amount of code there and we have
some evidence that it is currently not perfect, nor do we expect it to become
perfect and stay that way. That being the case it seems prudent to institute
response timeouts so that bugs of this sort (which disrupt request-response
message flow) don't result in hangs.
It's TBD if we want to go a step further and institute retries. The latter
would entail introducing duplicate-suppression (conflation) in P2P messaging.
We might also add exponential backoff (open-loop) or back-pressure
(closed-loop) to prevent a flood of retries when the system is at or near the
point of thrashing.
But even without retries, a configurable timeout might have good ROI as a first
step. This would entail:
* adding a configuration parameter to specify the timeout value
* changing ReplyProcessor21 and others TBD to "give up" after the timeout has
elapsed
* changing higher-level code dependent on request-reply messaging so it
properly handles the situations where we might have to "give up"
This issue affects all versions of Geode.
h2. Counterpoint
Not everbody thinks timeouts are a good idea. Here are some alternative ideas:
Make request-response primitive better. make it so only bugs in our core
messaging framework could cause a lack of response - rather than our current
approach where a bug in a class like “RemotePutMessage” could cause a lack of a
response.
was:
There is a weakness in the P2P/DirectChannel messaging architecture, in that it
never gives up on a request (in a request-response scenario). As a result a bug
(software fault) anywhere from the point where the requesting thread hands off
the {{DistributionMessage}} e.g. to
{{ClusterDistributionManager.putOutgoing(DistributionMessage)}}, to the point
where that request is ultimately fulfilled on a (one) receiver, can result in a
hang (of some task on the send side, which is waiting for a response).
Well it's a little worse than that because any code in the return (response)
path can also cause disruption of the (response) flow, thereby leaving the
requesting task hanging.
If the code in the request path (primarily in P2P messaging) and the code in
the response path (P2P messaging and TBD higher-level code) were perfect this
might not be a problem. But there is a fair amount of code there and we have
some evidence that it is currently not perfect, nor do we expect it to become
perfect and stay that way. That being the case it seems prudent to institute
response timeouts so that bugs of this sort (which disrupt request-response
message flow) don't result in hangs.
It's TBD if we want to go a step further and institute retries. The latter
would entail introducing duplicate-suppression (conflation) in P2P messaging.
We might also add exponential backoff (open-loop) or back-pressure
(closed-loop) to prevent a flood of retries when the system is at or near the
point of thrashing.
But even without retries, a configurable timeout might have good ROI as a first
step. This would entail:
* adding a configuration parameter to specify the timeout value
* changing ReplyProcessor21 and others TBD to "give up" after the timeout has
elapsed
* changing higher-level code dependent on request-reply messaging so it
properly handles the situations where we might have to "give up"
This issue affects all versions of Geode.
> Request-Response Messaging Should Time Out
> ------------------------------------------
>
> Key: GEODE-9764
> URL: https://issues.apache.org/jira/browse/GEODE-9764
> Project: Geode
> Issue Type: Improvement
> Components: messaging
> Reporter: Bill Burcham
> Assignee: Bill Burcham
> Priority: Major
>
> There is a weakness in the P2P/DirectChannel messaging architecture, in that
> it never gives up on a request (in a request-response scenario). As a result
> a bug (software fault) anywhere from the point where the requesting thread
> hands off the {{DistributionMessage}} e.g. to
> {{{}ClusterDistributionManager.putOutgoing(DistributionMessage){}}}, to the
> point where that request is ultimately fulfilled on a (one) receiver, can
> result in a hang (of some task on the send side, which is waiting for a
> response).
> Well it's a little worse than that because any code in the return (response)
> path can also cause disruption of the (response) flow, thereby leaving the
> requesting task hanging.
> If the code in the request path (primarily in P2P messaging) and the code in
> the response path (P2P messaging and TBD higher-level code) were perfect this
> might not be a problem. But there is a fair amount of code there and we have
> some evidence that it is currently not perfect, nor do we expect it to become
> perfect and stay that way. That being the case it seems prudent to institute
> response timeouts so that bugs of this sort (which disrupt request-response
> message flow) don't result in hangs.
> It's TBD if we want to go a step further and institute retries. The latter
> would entail introducing duplicate-suppression (conflation) in P2P messaging.
> We might also add exponential backoff (open-loop) or back-pressure
> (closed-loop) to prevent a flood of retries when the system is at or near the
> point of thrashing.
> But even without retries, a configurable timeout might have good ROI as a
> first step. This would entail:
> * adding a configuration parameter to specify the timeout value
> * changing ReplyProcessor21 and others TBD to "give up" after the timeout
> has elapsed
> * changing higher-level code dependent on request-reply messaging so it
> properly handles the situations where we might have to "give up"
> This issue affects all versions of Geode.
> h2. Counterpoint
> Not everbody thinks timeouts are a good idea. Here are some alternative ideas:
>
> Make request-response primitive better. make it so only bugs in our core
> messaging framework could cause a lack of response - rather than our current
> approach where a bug in a class like “RemotePutMessage” could cause a lack of
> a response.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)