[ 
https://issues.apache.org/jira/browse/KUDU-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17976550#comment-17976550
 ] 

shen yushi commented on KUDU-3668:
----------------------------------

I've added a unit test in my fork of the latest master branch:
*[https://github.com/small-turtle-1/kudu/tree/client_retry_bug]*

The test introduces a new flag {{writer_inject_latency_reject_first_ms}} that:
 # For each identical write request. The first request is deliberately delayed. 
Then return a ServiceUnavailable status to the client. 

 # Subsequent requests are processed normally

Test scenario:
 # Tablet servers launch with this flag

 # Open client session with short timeout configuration.

 # Due to timeout, Kudu client returns {{TimedOut}} status to caller

 # Allow sufficient time for the tserver to respond, accounting for potential 
erroneous client retries.
 # Check if the write request is accepted. (expect not).

> Kudu Client Write Retry Behavior After Timeout
> ----------------------------------------------
>
>                 Key: KUDU-3668
>                 URL: https://issues.apache.org/jira/browse/KUDU-3668
>             Project: Kudu
>          Issue Type: Bug
>          Components: client
>            Reporter: shen yushi
>            Priority: Major
>
> *Issue Description*
> The Kudu client continues internal retries of write requests _after_ 
> returning a timeout status to the caller. In our implementation, callers may 
> initiate new write operations upon receiving this timeout status. This can 
> cause overlapping writes if the client's internal retry completes 
> concurrently, potentially resulting in data duplication or conflicts.
> *Technical Analysis*
> Based on our investigation (referencing 
> [gerrit.cloudera.org/c/12338|https://gerrit.cloudera.org/c/12338]):
>  # Each KuduRpc object contains an RpcTimeoutTask.
>  # When a tablet server stalls (simulated via gdb process suspension), this 
> task triggers: Invokes the RPC's error callback and Returns timeout status to 
> the caller.
>  # Meanwhile, the Connection object retains the original KuduRpc.
>  # Upon tablet server recovery: The connection may receive exceptions/errors 
> and AsyncKuduClient.handleRetryableError initiates internal retries.
> *Environment*
>  * Kudu Version: 1.15.0
> *Key Question*
> Is this dual-retry behavior (caller + client) an intentional design? We found 
> no existing issues addressing this scenario in the community. We'd appreciate 
> your insights on:
>  * Whether this constitutes a bug
>  * Recommended solutions or configuration adjustments
> We're committed to providing additional details and contributing fixes if 
> needed. Thank you for your time and expertise!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to