[ 
https://issues.apache.org/jira/browse/SPARK-53900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell reassigned SPARK-53900:
-----------------------------------------

    Assignee: Venkata Sai Akhil Gudesa

> Thread.wait(0) unintentionally called under rare conditions in 
> ExecuteGrpcResponseSender
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-53900
>                 URL: https://issues.apache.org/jira/browse/SPARK-53900
>             Project: Spark
>          Issue Type: Bug
>          Components: Connect
>    Affects Versions: 4.1.0, 4.0.0, 4.0.1, 4.0.2, 4.0, 4.2
>            Reporter: Venkata Sai Akhil Gudesa
>            Assignee: Venkata Sai Akhil Gudesa
>            Priority: Major
>              Labels: pull-request-available
>
>  
> A bug in ExecuteGrpcResponseSender causes RPC streams to hang indefinitely 
> when the configured deadline passes. The bug was introduced in 
> [[PR|https://github.com/apache/spark/pull/49003/files#diff-d4629281431427e41afd6d3db6630bcfdbfdbf77ba74cf7e48a988c1b66c13f1L244-L253]|https://github.com/apache/spark/pull/49003/files#diff-d4629281431427e41afd6d3db6630bcfdbfdbf77ba74cf7e48a988c1b66c13f1L244-L253]
>  during migration from System.currentTimeMillis() to System.nanoTime(), where 
> an integer division error converts sub-millisecond timeout values to 0, 
> triggering Java's wait(0) behavior (infinite wait).
> h2. Root Cause
> executionObserver.responseLock.wait(timeoutNs / NANOS_PER_MILLIS)  // ← BUG
> {*}The Problem{*}: When deadlineTimeNs < System.nanoTime() (deadline has 
> passed):
>  # Math.max(1, negative_value) clamps to 1 nanosecond
>  # Math.min(progressInterval_ns, 1) remains 1 nanosecond
>  # Integer division: 1 / 1,000,000 = 0 milliseconds
>  # wait(0) in Java means *wait indefinitely until notified*
>  # No notification arrives (execution already completed), thread hangs forever
> While one the loop conditions guards against deadlineTimeNs < 
> System.nanoTime(), it isn’t sufficient as the deadline can elapse while 
> inside the loop (the time is freshly fetched in the latter timeout 
> calculation). The probability of occurence can exacerbated by GC pauses
> h2. Conditions Required for Bug to Trigger
> The bug manifests when *all* of the following conditions are met:
>  # *Reattachable execution enabled* (CONNECT_EXECUTE_REATTACHABLE_ENABLED = 
> true)
>  # *Execution completes prior* to the deadline within the inner loop
>  # (all responses sent before deadline)
>  # *Deadline passes* within the inner loop
> h2. Proposed fix
> Have timeoutNs always contain a positive value.
> executionObserver.responseLock.wait(Math.max(1, timeoutNs / NANOS_PER_MILLIS))



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to