[
https://issues.apache.org/jira/browse/KUDU-1409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15234436#comment-15234436
]
Todd Lipcon commented on KUDU-1409:
-----------------------------------
I'm thinking of the following strategy:
- given a 5 second timeout, we set the libev timer for a slightly shorter
value, like 4.8 seconds
- upon that timeout firing, we reset the timer for the remaining 200ms
- only upon the second timeout firing, do we actually consider the call failed
The idea here is that, if the process got paused, then first "pre-timeout"
timer will get arbitrarily delayed. Then, when we wake up, we'll give it an
extra 200ms to try to read the call response off the wire if it is in fact
already waiting. In the case that there was no process pause, we pay the "cost"
of an extra libev wakeup, but timeouts are rare so this shouldn't really
matter. We might also be giving up a slight amount of accuracy on timeouts, but
for long timeouts that shouldn't be important (they're usually chosen rather
arbitrarily).
Any other good ideas here?
> Make krpc call timeouts more resistant to process pauses
> --------------------------------------------------------
>
> Key: KUDU-1409
> URL: https://issues.apache.org/jira/browse/KUDU-1409
> Project: Kudu
> Issue Type: Improvement
> Components: rpc
> Affects Versions: 0.8.0
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
>
> In stress testing Impala on Kudu I've seen various RPC timeouts that turn out
> to be due to pauses on the client side. In particular, scenarios like
> https://issues.cloudera.org/browse/IMPALA-2800 can cause the memory allocator
> inside Impala to block for several seconds, and that might cause us to think
> we missed a timeout.
> We should be more resilient to this sort of "false" timeout.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)