[
https://issues.apache.org/jira/browse/KUDU-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17976550#comment-17976550
]
shen yushi commented on KUDU-3668:
----------------------------------
I've added a unit test in my fork of the latest master branch:
*[https://github.com/small-turtle-1/kudu/tree/client_retry_bug]*
The test introduces a new flag {{writer_inject_latency_reject_first_ms}} that:
# For each identical write request. The first request is deliberately delayed.
Then return a ServiceUnavailable status to the client.
# Subsequent requests are processed normally
Test scenario:
# Tablet servers launch with this flag
# Open client session with short timeout configuration.
# Due to timeout, Kudu client returns {{TimedOut}} status to caller
# Allow sufficient time for the tserver to respond, accounting for potential
erroneous client retries.
# Check if the write request is accepted. (expect not).
> Kudu Client Write Retry Behavior After Timeout
> ----------------------------------------------
>
> Key: KUDU-3668
> URL: https://issues.apache.org/jira/browse/KUDU-3668
> Project: Kudu
> Issue Type: Bug
> Components: client
> Reporter: shen yushi
> Priority: Major
>
> *Issue Description*
> The Kudu client continues internal retries of write requests _after_
> returning a timeout status to the caller. In our implementation, callers may
> initiate new write operations upon receiving this timeout status. This can
> cause overlapping writes if the client's internal retry completes
> concurrently, potentially resulting in data duplication or conflicts.
> *Technical Analysis*
> Based on our investigation (referencing
> [gerrit.cloudera.org/c/12338|https://gerrit.cloudera.org/c/12338]):
> # Each KuduRpc object contains an RpcTimeoutTask.
> # When a tablet server stalls (simulated via gdb process suspension), this
> task triggers: Invokes the RPC's error callback and Returns timeout status to
> the caller.
> # Meanwhile, the Connection object retains the original KuduRpc.
> # Upon tablet server recovery: The connection may receive exceptions/errors
> and AsyncKuduClient.handleRetryableError initiates internal retries.
> *Environment*
> * Kudu Version: 1.15.0
> *Key Question*
> Is this dual-retry behavior (caller + client) an intentional design? We found
> no existing issues addressing this scenario in the community. We'd appreciate
> your insights on:
> * Whether this constitutes a bug
> * Recommended solutions or configuration adjustments
> We're committed to providing additional details and contributing fixes if
> needed. Thank you for your time and expertise!
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)