[
https://issues.apache.org/jira/browse/SPARK-53900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Herman van Hövell reassigned SPARK-53900:
-----------------------------------------
Assignee: Venkata Sai Akhil Gudesa
> Thread.wait(0) unintentionally called under rare conditions in
> ExecuteGrpcResponseSender
> ----------------------------------------------------------------------------------------
>
> Key: SPARK-53900
> URL: https://issues.apache.org/jira/browse/SPARK-53900
> Project: Spark
> Issue Type: Bug
> Components: Connect
> Affects Versions: 4.1.0, 4.0.0, 4.0.1, 4.0.2, 4.0, 4.2
> Reporter: Venkata Sai Akhil Gudesa
> Assignee: Venkata Sai Akhil Gudesa
> Priority: Major
> Labels: pull-request-available
>
>
> A bug in ExecuteGrpcResponseSender causes RPC streams to hang indefinitely
> when the configured deadline passes. The bug was introduced in
> [[PR|https://github.com/apache/spark/pull/49003/files#diff-d4629281431427e41afd6d3db6630bcfdbfdbf77ba74cf7e48a988c1b66c13f1L244-L253]|https://github.com/apache/spark/pull/49003/files#diff-d4629281431427e41afd6d3db6630bcfdbfdbf77ba74cf7e48a988c1b66c13f1L244-L253]
> during migration from System.currentTimeMillis() to System.nanoTime(), where
> an integer division error converts sub-millisecond timeout values to 0,
> triggering Java's wait(0) behavior (infinite wait).
> h2. Root Cause
> executionObserver.responseLock.wait(timeoutNs / NANOS_PER_MILLIS) // ← BUG
> {*}The Problem{*}: When deadlineTimeNs < System.nanoTime() (deadline has
> passed):
> # Math.max(1, negative_value) clamps to 1 nanosecond
> # Math.min(progressInterval_ns, 1) remains 1 nanosecond
> # Integer division: 1 / 1,000,000 = 0 milliseconds
> # wait(0) in Java means *wait indefinitely until notified*
> # No notification arrives (execution already completed), thread hangs forever
> While one the loop conditions guards against deadlineTimeNs <
> System.nanoTime(), it isn’t sufficient as the deadline can elapse while
> inside the loop (the time is freshly fetched in the latter timeout
> calculation). The probability of occurence can exacerbated by GC pauses
> h2. Conditions Required for Bug to Trigger
> The bug manifests when *all* of the following conditions are met:
> # *Reattachable execution enabled* (CONNECT_EXECUTE_REATTACHABLE_ENABLED =
> true)
> # *Execution completes prior* to the deadline within the inner loop
> # (all responses sent before deadline)
> # *Deadline passes* within the inner loop
> h2. Proposed fix
> Have timeoutNs always contain a positive value.
> executionObserver.responseLock.wait(Math.max(1, timeoutNs / NANOS_PER_MILLIS))
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]