shen yushi created KUDU-3668:
--------------------------------

             Summary: Kudu Client Write Retry Behavior After Timeout
                 Key: KUDU-3668
                 URL: https://issues.apache.org/jira/browse/KUDU-3668
             Project: Kudu
          Issue Type: Bug
          Components: client
            Reporter: shen yushi


*Issue Description*
The Kudu client continues internal retries of write requests _after_ returning 
a timeout status to the caller. In our implementation, callers may initiate new 
write operations upon receiving this timeout status. This can cause overlapping 
writes if the client's internal retry completes concurrently, potentially 
resulting in data duplication or conflicts.
*Technical Analysis*
Based on our investigation (referencing 
[gerrit.cloudera.org/c/12338|https://gerrit.cloudera.org/c/12338]):
 # Each KuduRpc object contains an RpcTimeoutTask.
 # When a tablet server stalls (simulated via gdb process suspension), this 
task triggers:
 * Invokes the RPC's error callback
 * Returns timeout status to the caller
 # Meanwhile, the Connection object retains the original KuduRpc.
 # Upon tablet server recovery:
 * The connection may receive exceptions/errors
 * AsyncKuduClient.handleRetryableError initiates internal retries
*Environment*
 * Kudu Version: 1.15.0
*Key Question*
Is this dual-retry behavior (caller + client) an intentional design? We found 
no existing issues addressing this scenario in the community. We'd appreciate 
your insights on:
 * Whether this constitutes a bug
 * Recommended solutions or configuration adjustments
We're committed to providing additional details and contributing fixes if 
needed. Thank you for your time and expertise!
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to