[
https://issues.apache.org/jira/browse/KUDU-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18003953#comment-18003953
]
Alexey Serbin commented on KUDU-3668:
-------------------------------------
Thanks a lot for adding the test!
I briefly look at the test and it looks reasonable. Does it fail as expected
in your test environment? I don't have a lot of time for looking at this
closer this week, but I may have some spare time later on. Meanwhile, if you
have some time to find the bug and put together a fix, that would be great.
Contributions are very welcome! :)
BTW, I saw there was a modification on client-test.cc. Does it mean there was
an attempt to try similar scenario with the C++ client as well? If so, what
was the outcome?
> Kudu Client Write Retry Behavior After Timeout
> ----------------------------------------------
>
> Key: KUDU-3668
> URL: https://issues.apache.org/jira/browse/KUDU-3668
> Project: Kudu
> Issue Type: Bug
> Components: client
> Reporter: shen yushi
> Priority: Major
>
> *Issue Description*
> The Kudu client continues internal retries of write requests _after_
> returning a timeout status to the caller. In our implementation, callers may
> initiate new write operations upon receiving this timeout status. This can
> cause overlapping writes if the client's internal retry completes
> concurrently, potentially resulting in data duplication or conflicts.
> *Technical Analysis*
> Based on our investigation (referencing
> [gerrit.cloudera.org/c/12338|https://gerrit.cloudera.org/c/12338]):
> # Each KuduRpc object contains an RpcTimeoutTask.
> # When a tablet server stalls (simulated via gdb process suspension), this
> task triggers: Invokes the RPC's error callback and Returns timeout status to
> the caller.
> # Meanwhile, the Connection object retains the original KuduRpc.
> # Upon tablet server recovery: The connection may receive exceptions/errors
> and AsyncKuduClient.handleRetryableError initiates internal retries.
> *Environment*
> * Kudu Version: 1.15.0
> *Key Question*
> Is this dual-retry behavior (caller + client) an intentional design? We found
> no existing issues addressing this scenario in the community. We'd appreciate
> your insights on:
> * Whether this constitutes a bug
> * Recommended solutions or configuration adjustments
> We're committed to providing additional details and contributing fixes if
> needed. Thank you for your time and expertise!
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)