[jira] [Commented] (KUDU-3668) Kudu Client Write Retry Behavior After Timeout

Alexey Serbin (Jira) Tue, 08 Jul 2025 18:58:08 -0700


    [ 
https://issues.apache.org/jira/browse/KUDU-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18003953#comment-18003953
 ]


Alexey Serbin commented on KUDU-3668:
-------------------------------------

Thanks a lot for adding the test!

I briefly look at the test and it looks reasonable.  Does it fail as expected 
in your test environment?  I don't have a lot of time for looking at this 
closer this week, but I may have some spare time later on.  Meanwhile, if you 
have some time to find the bug and put together a fix, that would be great.  
Contributions are very welcome!  :)

BTW, I saw there was a modification on client-test.cc.  Does it mean there was 
an attempt to try similar scenario with the C++ client as well?   If so, what 
was the outcome?


> Kudu Client Write Retry Behavior After Timeout
> ----------------------------------------------
>
>                 Key: KUDU-3668
>                 URL: https://issues.apache.org/jira/browse/KUDU-3668
>             Project: Kudu
>          Issue Type: Bug
>          Components: client
>            Reporter: shen yushi
>            Priority: Major
>
> *Issue Description*
> The Kudu client continues internal retries of write requests _after_ 
> returning a timeout status to the caller. In our implementation, callers may 
> initiate new write operations upon receiving this timeout status. This can 
> cause overlapping writes if the client's internal retry completes 
> concurrently, potentially resulting in data duplication or conflicts.
> *Technical Analysis*
> Based on our investigation (referencing 
> [gerrit.cloudera.org/c/12338|https://gerrit.cloudera.org/c/12338]):
>  # Each KuduRpc object contains an RpcTimeoutTask.
>  # When a tablet server stalls (simulated via gdb process suspension), this 
> task triggers: Invokes the RPC's error callback and Returns timeout status to 
> the caller.
>  # Meanwhile, the Connection object retains the original KuduRpc.
>  # Upon tablet server recovery: The connection may receive exceptions/errors 
> and AsyncKuduClient.handleRetryableError initiates internal retries.
> *Environment*
>  * Kudu Version: 1.15.0
> *Key Question*
> Is this dual-retry behavior (caller + client) an intentional design? We found 
> no existing issues addressing this scenario in the community. We'd appreciate 
> your insights on:
>  * Whether this constitutes a bug
>  * Recommended solutions or configuration adjustments
> We're committed to providing additional details and contributing fixes if 
> needed. Thank you for your time and expertise!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KUDU-3668) Kudu Client Write Retry Behavior After Timeout

Reply via email to