[ 
https://issues.apache.org/jira/browse/GEODE-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-9764:
--------------------------------
    Description: 
There is a weakness in the P2P/DirectChannel messaging architecture, in that it 
never gives up on a request (in a request-response scenario). As a result a bug 
(software fault) anywhere from the point where the requesting thread hands off 
the {{DistributionMessage}} e.g. to 
{{{}ClusterDistributionManager.putOutgoing(DistributionMessage){}}}, to the 
point where that request is ultimately fulfilled on a (one) receiver, can 
result in a hang (of some task on the send side, which is waiting for a 
response).

Well it's a little worse than that because any code in the return (response) 
path can also cause disruption of the (response) flow, thereby leaving the 
requesting task hanging.

If the code in the request path (primarily in P2P messaging) and the code in 
the response path (P2P messaging and TBD higher-level code) were perfect this 
might not be a problem. But there is a fair amount of code there and we have 
some evidence that it is currently not perfect, nor do we expect it to become 
perfect and stay that way.

This is a sketch of the situation. The left-most column is the request path or 
the originating member. The middle column is the server-side of the 
request-response path. And the right-most column is the response path back on 
the originating member.

!image-2021-11-22-12-14-59-117.png!

You can see that Geode product code, JDK code, and hardware components all lie 
in the end-to-end request-response messaging path.

That being the case it seems prudent to institute response timeouts so that 
bugs of this sort (which disrupt request-response message flow) don't result in 
hangs.

It's TBD if we want to go a step further and institute retries. The latter 
would entail introducing duplicate-suppression (conflation) in P2P messaging. 
We might also add exponential backoff (open-loop) or back-pressure 
(closed-loop) to prevent a flood of retries when the system is at or near the 
point of thrashing.

But even without retries, a configurable timeout might have good ROI as a first 
step. This would entail:
 * adding a configuration parameter to specify the timeout value
 * changing ReplyProcessor21 and others TBD to "give up" after the timeout has 
elapsed
 * changing higher-level code dependent on request-reply messaging so it 
properly handles the situations where we might have to "give up"

This issue affects all versions of Geode.
h2. Counterpoint

Not everybody thinks timeouts are a good idea. This section has the highlights.
h3. Timeouts Will Result in Data-Inconsistency

If we leave most the surrounding code as-is and introduce timeouts, then we 
risk data inconsistency. TODO: describe in detail why data inconsistency is 
_inherent_ in using timeouts.
h3. Narrow The Vulnerability Cross-Section Without Timeouts

The proposal (above) seeks to solve the problem using end-to-end timeouts since 
any component in the path can, in general, have faults. An alternative 
approach, would be to assume that _some_ of the components can be made "good 
enough" (without adding timeouts) and that those "good enough" components can 
protect themselves (and user applications) from faults in the remaining 
components.

With this approach, the Cluster Distribution Manager, and P2P / TCP Conduit / 
Direct Channel framework would be enhanced so that it was less susceptible to 
bugs in:
 * the 341 Distribution Message classes
 * the 68 Reply Message classes
 * the 95 Reply Processor classes

The question is: what form would that enhancement take, and also, would it be 
sufficient to overcome faults in remaining components (JDK, and the 
host+network layers).
h2. Alternatives Discussed

These alternatives have been discussed, to varying degrees.

Baseline: no timeouts; members waiting for replies do "the right thing" if 
recipient departs view

Give-up-after-timeout

Retry-after-timeout-and-eventually-give-up

Retry-after-forcing-receiver-out-of-view

  was:
There is a weakness in the P2P/DirectChannel messaging architecture, in that it 
never gives up on a request (in a request-response scenario). As a result a bug 
(software fault) anywhere from the point where the requesting thread hands off 
the {{DistributionMessage}} e.g. to 
{{{}ClusterDistributionManager.putOutgoing(DistributionMessage){}}}, to the 
point where that request is ultimately fulfilled on a (one) receiver, can 
result in a hang (of some task on the send side, which is waiting for a 
response).

Well it's a little worse than that because any code in the return (response) 
path can also cause disruption of the (response) flow, thereby leaving the 
requesting task hanging.

If the code in the request path (primarily in P2P messaging) and the code in 
the response path (P2P messaging and TBD higher-level code) were perfect this 
might not be a problem. But there is a fair amount of code there and we have 
some evidence that it is currently not perfect, nor do we expect it to become 
perfect and stay that way.

This is a sketch of the situation. The left-most column is the request path or 
the originating member. The middle column is the server-side of the 
request-response path. And the right-most column is the response path back on 
the originating member.

!image-2021-11-22-12-14-59-117.png!

You can see that Geode product code, JDK code, and hardware components all lie 
in the end-to-end request-response messaging path.

That being the case it seems prudent to institute response timeouts so that 
bugs of this sort (which disrupt request-response message flow) don't result in 
hangs.

It's TBD if we want to go a step further and institute retries. The latter 
would entail introducing duplicate-suppression (conflation) in P2P messaging. 
We might also add exponential backoff (open-loop) or back-pressure 
(closed-loop) to prevent a flood of retries when the system is at or near the 
point of thrashing.

But even without retries, a configurable timeout might have good ROI as a first 
step. This would entail:
 * adding a configuration parameter to specify the timeout value
 * changing ReplyProcessor21 and others TBD to "give up" after the timeout has 
elapsed
 * changing higher-level code dependent on request-reply messaging so it 
properly handles the situations where we might have to "give up"

This issue affects all versions of Geode.
h2. Counterpoint

Not everybody thinks timeouts are a good idea. Here are some alternative ideas:

The proposal (above) seeks to solve the problem using end-to-end timeouts since 
any component in the path can, in general, have faults. An alternative 
approach, would be to assume that _some_ of the components can be made "good 
enough" (without adding timeouts) and that those "good enough" components can 
protect themselves (and user applications) from faults in the remaining 
components.

With this approach, the Cluster Distribution Manager, and P2P / TCP Conduit / 
Direct Channel framework would be enhanced so that it was less susceptible to 
bugs in:
 * the 341 Distribution Message classes
 * the 68 Reply Message classes
 * the 95 Reply Processor classes

The question is: what form would that enhancement take, and also, would it be 
sufficient to overcome faults in remaining components (JDK, and the 
host+network layers).

 


> Request-Response Messaging Should Time Out
> ------------------------------------------
>
>                 Key: GEODE-9764
>                 URL: https://issues.apache.org/jira/browse/GEODE-9764
>             Project: Geode
>          Issue Type: Improvement
>          Components: messaging
>            Reporter: Bill Burcham
>            Assignee: Bill Burcham
>            Priority: Major
>         Attachments: image-2021-11-22-11-52-23-586.png, 
> image-2021-11-22-12-14-59-117.png
>
>
> There is a weakness in the P2P/DirectChannel messaging architecture, in that 
> it never gives up on a request (in a request-response scenario). As a result 
> a bug (software fault) anywhere from the point where the requesting thread 
> hands off the {{DistributionMessage}} e.g. to 
> {{{}ClusterDistributionManager.putOutgoing(DistributionMessage){}}}, to the 
> point where that request is ultimately fulfilled on a (one) receiver, can 
> result in a hang (of some task on the send side, which is waiting for a 
> response).
> Well it's a little worse than that because any code in the return (response) 
> path can also cause disruption of the (response) flow, thereby leaving the 
> requesting task hanging.
> If the code in the request path (primarily in P2P messaging) and the code in 
> the response path (P2P messaging and TBD higher-level code) were perfect this 
> might not be a problem. But there is a fair amount of code there and we have 
> some evidence that it is currently not perfect, nor do we expect it to become 
> perfect and stay that way.
> This is a sketch of the situation. The left-most column is the request path 
> or the originating member. The middle column is the server-side of the 
> request-response path. And the right-most column is the response path back on 
> the originating member.
> !image-2021-11-22-12-14-59-117.png!
> You can see that Geode product code, JDK code, and hardware components all 
> lie in the end-to-end request-response messaging path.
> That being the case it seems prudent to institute response timeouts so that 
> bugs of this sort (which disrupt request-response message flow) don't result 
> in hangs.
> It's TBD if we want to go a step further and institute retries. The latter 
> would entail introducing duplicate-suppression (conflation) in P2P messaging. 
> We might also add exponential backoff (open-loop) or back-pressure 
> (closed-loop) to prevent a flood of retries when the system is at or near the 
> point of thrashing.
> But even without retries, a configurable timeout might have good ROI as a 
> first step. This would entail:
>  * adding a configuration parameter to specify the timeout value
>  * changing ReplyProcessor21 and others TBD to "give up" after the timeout 
> has elapsed
>  * changing higher-level code dependent on request-reply messaging so it 
> properly handles the situations where we might have to "give up"
> This issue affects all versions of Geode.
> h2. Counterpoint
> Not everybody thinks timeouts are a good idea. This section has the 
> highlights.
> h3. Timeouts Will Result in Data-Inconsistency
> If we leave most the surrounding code as-is and introduce timeouts, then we 
> risk data inconsistency. TODO: describe in detail why data inconsistency is 
> _inherent_ in using timeouts.
> h3. Narrow The Vulnerability Cross-Section Without Timeouts
> The proposal (above) seeks to solve the problem using end-to-end timeouts 
> since any component in the path can, in general, have faults. An alternative 
> approach, would be to assume that _some_ of the components can be made "good 
> enough" (without adding timeouts) and that those "good enough" components can 
> protect themselves (and user applications) from faults in the remaining 
> components.
> With this approach, the Cluster Distribution Manager, and P2P / TCP Conduit / 
> Direct Channel framework would be enhanced so that it was less susceptible to 
> bugs in:
>  * the 341 Distribution Message classes
>  * the 68 Reply Message classes
>  * the 95 Reply Processor classes
> The question is: what form would that enhancement take, and also, would it be 
> sufficient to overcome faults in remaining components (JDK, and the 
> host+network layers).
> h2. Alternatives Discussed
> These alternatives have been discussed, to varying degrees.
> Baseline: no timeouts; members waiting for replies do "the right thing" if 
> recipient departs view
> Give-up-after-timeout
> Retry-after-timeout-and-eventually-give-up
> Retry-after-forcing-receiver-out-of-view



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to