[ 
https://issues.apache.org/jira/browse/SPARK-56538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-56538:
-----------------------------------
    Labels: pull-request-available  (was: )

> Add per-RPC deadlines to Spark Connect client to prevent silent hangs
> ---------------------------------------------------------------------
>
>                 Key: SPARK-56538
>                 URL: https://issues.apache.org/jira/browse/SPARK-56538
>             Project: Spark
>          Issue Type: New Feature
>          Components: Connect
>    Affects Versions: 4.2.0
>            Reporter: Pranav Dev
>            Priority: Major
>              Labels: pull-request-available
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> _Problem:_
> The Spark Connect client currently has no per-RPC timeouts on its gRPC calls. 
> If a network connection silently dies (e.g., a load balancer drops an idle 
> connection, a firewall closes a stale TCP socket, or the server becomes 
> unreachable), the client hangs indefinitely with no error or feedback to the 
> user.
> This affects all RPCs: ExecutePlan, AnalyzePlan, Config, Interrupt, 
> AddArtifacts, and others. The reattachable execution path is especially 
> affected because the client expects a long-lived streaming response that may 
> go silent.
>  
> _Proposed Solution:_
> Add a configurable RpcDeadlines class (Scala case class / Python dataclass) 
> that assigns a per-RPC gRPC deadline to each client call. Each field controls 
> the timeout for one RPC type. Deadlines can be individually configured or 
> globally disabled via `RpcDeadlines.disabled`.
>  
> _Default deadlines:_
>  - Reattachable ExecutePlan / ReattachExecute: 10 minutes per stream segment
>  - AnalyzePlan, AddArtifacts: 1 hour
>   - Config, Interrupt, ReleaseSession, ArtifactStatus, CloneSession, 
> GetStatus, FetchErrorDetails: 10 minutes
>  - Non-reattachable ExecutePlan: no deadline (timeout would kill 
> unrecoverable execution)
>  
> _Key design decisions:_
>  - Reattachable execute path: When a deadline fires mid-stream, the client 
> transparently reattaches via ReattachExecute with a fresh deadline. The 
> server-side operation continues running. This is invisible to the user.
>  - Non-reattachable execute path: No deadline applied because a timeout would 
> kill the server-side execution with no recovery mechanism.
>   - All other RPCs: Deadline fires result in a DEADLINE_EXCEEDED error with a 
> user-friendly hint about how to configure or disable deadlines.
>  - Retry behavior: DEADLINE_EXCEEDED is not retried by the default retry 
> policy. The reattachable path handles it directly in the iterator via 
> RetryException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to