shen yushi created KUDU-3668:
--------------------------------
Summary: Kudu Client Write Retry Behavior After Timeout
Key: KUDU-3668
URL: https://issues.apache.org/jira/browse/KUDU-3668
Project: Kudu
Issue Type: Bug
Components: client
Reporter: shen yushi
*Issue Description*
The Kudu client continues internal retries of write requests _after_ returning
a timeout status to the caller. In our implementation, callers may initiate new
write operations upon receiving this timeout status. This can cause overlapping
writes if the client's internal retry completes concurrently, potentially
resulting in data duplication or conflicts.
*Technical Analysis*
Based on our investigation (referencing
[gerrit.cloudera.org/c/12338|https://gerrit.cloudera.org/c/12338]):
# Each KuduRpc object contains an RpcTimeoutTask.
# When a tablet server stalls (simulated via gdb process suspension), this
task triggers:
* Invokes the RPC's error callback
* Returns timeout status to the caller
# Meanwhile, the Connection object retains the original KuduRpc.
# Upon tablet server recovery:
* The connection may receive exceptions/errors
* AsyncKuduClient.handleRetryableError initiates internal retries
*Environment*
* Kudu Version: 1.15.0
*Key Question*
Is this dual-retry behavior (caller + client) an intentional design? We found
no existing issues addressing this scenario in the community. We'd appreciate
your insights on:
* Whether this constitutes a bug
* Recommended solutions or configuration adjustments
We're committed to providing additional details and contributing fixes if
needed. Thank you for your time and expertise!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)