[
https://issues.apache.org/jira/browse/SPARK-56538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-56538:
-----------------------------------
Labels: pull-request-available (was: )
> Add per-RPC deadlines to Spark Connect client to prevent silent hangs
> ---------------------------------------------------------------------
>
> Key: SPARK-56538
> URL: https://issues.apache.org/jira/browse/SPARK-56538
> Project: Spark
> Issue Type: New Feature
> Components: Connect
> Affects Versions: 4.2.0
> Reporter: Pranav Dev
> Priority: Major
> Labels: pull-request-available
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> _Problem:_
> The Spark Connect client currently has no per-RPC timeouts on its gRPC calls.
> If a network connection silently dies (e.g., a load balancer drops an idle
> connection, a firewall closes a stale TCP socket, or the server becomes
> unreachable), the client hangs indefinitely with no error or feedback to the
> user.
> This affects all RPCs: ExecutePlan, AnalyzePlan, Config, Interrupt,
> AddArtifacts, and others. The reattachable execution path is especially
> affected because the client expects a long-lived streaming response that may
> go silent.
>
> _Proposed Solution:_
> Add a configurable RpcDeadlines class (Scala case class / Python dataclass)
> that assigns a per-RPC gRPC deadline to each client call. Each field controls
> the timeout for one RPC type. Deadlines can be individually configured or
> globally disabled via `RpcDeadlines.disabled`.
>
> _Default deadlines:_
> - Reattachable ExecutePlan / ReattachExecute: 10 minutes per stream segment
> - AnalyzePlan, AddArtifacts: 1 hour
> - Config, Interrupt, ReleaseSession, ArtifactStatus, CloneSession,
> GetStatus, FetchErrorDetails: 10 minutes
> - Non-reattachable ExecutePlan: no deadline (timeout would kill
> unrecoverable execution)
>
> _Key design decisions:_
> - Reattachable execute path: When a deadline fires mid-stream, the client
> transparently reattaches via ReattachExecute with a fresh deadline. The
> server-side operation continues running. This is invisible to the user.
> - Non-reattachable execute path: No deadline applied because a timeout would
> kill the server-side execution with no recovery mechanism.
> - All other RPCs: Deadline fires result in a DEADLINE_EXCEEDED error with a
> user-friendly hint about how to configure or disable deadlines.
> - Retry behavior: DEADLINE_EXCEEDED is not retried by the default retry
> policy. The reattachable path handles it directly in the iterator via
> RetryException.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]