Pranav Dev created SPARK-56538:
----------------------------------
Summary: Add per-RPC deadlines to Spark Connect client to prevent
silent hangs
Key: SPARK-56538
URL: https://issues.apache.org/jira/browse/SPARK-56538
Project: Spark
Issue Type: New Feature
Components: Connect
Affects Versions: 4.2.0
Reporter: Pranav Dev
_Problem:_
The Spark Connect client currently has no per-RPC timeouts on its gRPC calls.
If a network connection silently dies (e.g., a load balancer drops an idle
connection, a firewall closes a stale TCP socket, or the server becomes
unreachable), the client hangs indefinitely with no error or feedback to the
user.
This affects all RPCs: ExecutePlan, AnalyzePlan, Config, Interrupt,
AddArtifacts, and others. The reattachable execution path is especially
affected because the client expects a long-lived streaming response that may go
silent.
_Proposed Solution:_
Add a configurable RpcDeadlines class (Scala case class / Python dataclass)
that assigns a per-RPC gRPC deadline to each client call. Each field controls
the timeout for one RPC type. Deadlines can be individually configured or
globally disabled via `RpcDeadlines.disabled`.
_Default deadlines:_
- Reattachable ExecutePlan / ReattachExecute: 10 minutes per stream segment
- AnalyzePlan, AddArtifacts: 1 hour
- Config, Interrupt, ReleaseSession, ArtifactStatus, CloneSession, GetStatus,
FetchErrorDetails: 10 minutes
- Non-reattachable ExecutePlan: no deadline (timeout would kill unrecoverable
execution)
_Key design decisions:_
- Reattachable execute path: When a deadline fires mid-stream, the client
transparently reattaches via ReattachExecute with a fresh deadline. The
server-side operation continues running. This is invisible to the user.
- Non-reattachable execute path: No deadline applied because a timeout would
kill the server-side execution with no recovery mechanism.
- All other RPCs: Deadline fires result in a DEADLINE_EXCEEDED error with a
user-friendly hint about how to configure or disable deadlines.
- Retry behavior: DEADLINE_EXCEEDED is not retried by the default retry
policy. The reattachable path handles it directly in the iterator via
RetryException.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]