Pranav Dev created SPARK-56538:
----------------------------------

             Summary: Add per-RPC deadlines to Spark Connect client to prevent 
silent hangs
                 Key: SPARK-56538
                 URL: https://issues.apache.org/jira/browse/SPARK-56538
             Project: Spark
          Issue Type: New Feature
          Components: Connect
    Affects Versions: 4.2.0
            Reporter: Pranav Dev


_Problem:_

The Spark Connect client currently has no per-RPC timeouts on its gRPC calls. 
If a network connection silently dies (e.g., a load balancer drops an idle 
connection, a firewall closes a stale TCP socket, or the server becomes 
unreachable), the client hangs indefinitely with no error or feedback to the 
user.

This affects all RPCs: ExecutePlan, AnalyzePlan, Config, Interrupt, 
AddArtifacts, and others. The reattachable execution path is especially 
affected because the client expects a long-lived streaming response that may go 
silent.

 

_Proposed Solution:_

Add a configurable RpcDeadlines class (Scala case class / Python dataclass) 
that assigns a per-RPC gRPC deadline to each client call. Each field controls 
the timeout for one RPC type. Deadlines can be individually configured or 
globally disabled via `RpcDeadlines.disabled`.

 

_Default deadlines:_
 - Reattachable ExecutePlan / ReattachExecute: 10 minutes per stream segment
 - AnalyzePlan, AddArtifacts: 1 hour
  - Config, Interrupt, ReleaseSession, ArtifactStatus, CloneSession, GetStatus, 
FetchErrorDetails: 10 minutes
 - Non-reattachable ExecutePlan: no deadline (timeout would kill unrecoverable 
execution)

 

_Key design decisions:_
 - Reattachable execute path: When a deadline fires mid-stream, the client 
transparently reattaches via ReattachExecute with a fresh deadline. The 
server-side operation continues running. This is invisible to the user.
 - Non-reattachable execute path: No deadline applied because a timeout would 
kill the server-side execution with no recovery mechanism.
  - All other RPCs: Deadline fires result in a DEADLINE_EXCEEDED error with a 
user-friendly hint about how to configure or disable deadlines.
 - Retry behavior: DEADLINE_EXCEEDED is not retried by the default retry 
policy. The reattachable path handles it directly in the iterator via 
RetryException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to