[jira] [Updated] (KUDU-3668) Kudu Client Write Retry Behavior After Timeout

shen yushi (Jira) Sat, 14 Jun 2025 20:32:53 -0700


     [ 
https://issues.apache.org/jira/browse/KUDU-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


shen yushi updated KUDU-3668:
-----------------------------
    Description: 
*Issue Description*
The Kudu client continues internal retries of write requests _after_ returning 
a timeout status to the caller. In our implementation, callers may initiate new 
write operations upon receiving this timeout status. This can cause overlapping 
writes if the client's internal retry completes concurrently, potentially 
resulting in data duplication or conflicts.
*Technical Analysis*
Based on our investigation (referencing 
[gerrit.cloudera.org/c/12338|https://gerrit.cloudera.org/c/12338]):
 # Each KuduRpc object contains an RpcTimeoutTask.
 # When a tablet server stalls (simulated via gdb process suspension), this 
task triggers: Invokes the RPC's error callback and Returns timeout status to 
the caller.
 # Meanwhile, the Connection object retains the original KuduRpc.
 # Upon tablet server recovery: The connection may receive exceptions/errors 
and AsyncKuduClient.handleRetryableError initiates internal retries.

*Environment*
 * Kudu Version: 1.15.0

*Key Question*
Is this dual-retry behavior (caller + client) an intentional design? We found 
no existing issues addressing this scenario in the community. We'd appreciate 
your insights on:
 * Whether this constitutes a bug
 * Recommended solutions or configuration adjustments
We're committed to providing additional details and contributing fixes if 
needed. Thank you for your time and expertise!
 

  was:
*Issue Description*
The Kudu client continues internal retries of write requests _after_ returning 
a timeout status to the caller. In our implementation, callers may initiate new 
write operations upon receiving this timeout status. This can cause overlapping 
writes if the client's internal retry completes concurrently, potentially 
resulting in data duplication or conflicts.
*Technical Analysis*
Based on our investigation (referencing 
[gerrit.cloudera.org/c/12338|https://gerrit.cloudera.org/c/12338]):
 # Each KuduRpc object contains an RpcTimeoutTask.
 # When a tablet server stalls (simulated via gdb process suspension), this 
task triggers:
 * Invokes the RPC's error callback
 * Returns timeout status to the caller
 # Meanwhile, the Connection object retains the original KuduRpc.
 # Upon tablet server recovery:
 * The connection may receive exceptions/errors
 * AsyncKuduClient.handleRetryableError initiates internal retries
*Environment*
 * Kudu Version: 1.15.0
*Key Question*
Is this dual-retry behavior (caller + client) an intentional design? We found 
no existing issues addressing this scenario in the community. We'd appreciate 
your insights on:
 * Whether this constitutes a bug
 * Recommended solutions or configuration adjustments
We're committed to providing additional details and contributing fixes if 
needed. Thank you for your time and expertise!
 


> Kudu Client Write Retry Behavior After Timeout
> ----------------------------------------------
>
>                 Key: KUDU-3668
>                 URL: https://issues.apache.org/jira/browse/KUDU-3668
>             Project: Kudu
>          Issue Type: Bug
>          Components: client
>            Reporter: shen yushi
>            Priority: Major
>
> *Issue Description*
> The Kudu client continues internal retries of write requests _after_ 
> returning a timeout status to the caller. In our implementation, callers may 
> initiate new write operations upon receiving this timeout status. This can 
> cause overlapping writes if the client's internal retry completes 
> concurrently, potentially resulting in data duplication or conflicts.
> *Technical Analysis*
> Based on our investigation (referencing 
> [gerrit.cloudera.org/c/12338|https://gerrit.cloudera.org/c/12338]):
>  # Each KuduRpc object contains an RpcTimeoutTask.
>  # When a tablet server stalls (simulated via gdb process suspension), this 
> task triggers: Invokes the RPC's error callback and Returns timeout status to 
> the caller.
>  # Meanwhile, the Connection object retains the original KuduRpc.
>  # Upon tablet server recovery: The connection may receive exceptions/errors 
> and AsyncKuduClient.handleRetryableError initiates internal retries.
> *Environment*
>  * Kudu Version: 1.15.0
> *Key Question*
> Is this dual-retry behavior (caller + client) an intentional design? We found 
> no existing issues addressing this scenario in the community. We'd appreciate 
> your insights on:
>  * Whether this constitutes a bug
>  * Recommended solutions or configuration adjustments
> We're committed to providing additional details and contributing fixes if 
> needed. Thank you for your time and expertise!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KUDU-3668) Kudu Client Write Retry Behavior After Timeout

Reply via email to