[ 
https://issues.apache.org/jira/browse/SPARK-55620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18062358#comment-18062358
 ] 

Tae Hwan Eom commented on SPARK-55620:
--------------------------------------

Hi, I'm new to Spark contribution but I'd like to try fixing this issue.

> test_connect_session flaky timeout due to shutdown deadlock
> -----------------------------------------------------------
>
>                 Key: SPARK-55620
>                 URL: https://issues.apache.org/jira/browse/SPARK-55620
>             Project: Spark
>          Issue Type: Bug
>          Components: Connect, PySpark
>    Affects Versions: 4.2.0
>            Reporter: Yicong Huang
>            Priority: Minor
>
> h2. Description
> {{test_connect_session}} occasionally times out (450 seconds) in CI. The test 
> normally completes in 20 seconds but sometimes hangs indefinitely during 
> shutdown, causing flaky test failures.
> h2. Reproduce
> This is a flaky bug with ~33% failure rate:
> 1. Run {{python/run-tests.py --testnames 
> pyspark.sql.tests.connect.test_connect_session}}
> 2. Test may hang at 450 seconds timeout
> *Evidence from CI runs:*
> - [Run 
> 22196465437|https://github.com/Yicong-Huang/spark/actions/runs/22196465437]: 
> Cancelled after 4m10s
> - [Run 
> 22196593939|https://github.com/Yicong-Huang/spark/actions/runs/22196593939]: 
> Timeout after 1h22m (hung at 450s)
> - [Run 
> 22237720726|https://github.com/Yicong-Huang/spark/actions/runs/22237720726]: 
> Success in 20s ✓
> h2. Root Cause
> Deadlock during Python shutdown when {{ReleaseExecute}} cleanup tasks are 
> still executing:
> {code}
> Session.__del__()
>   → client.close() waits: concurrent.futures.wait(self._release_futures)
>     → Worker thread executes: ReleaseExecute() gRPC call
>       → gRPC attempts: threading.Thread().start()
>         → Python 3.12 blocks thread creation during shutdown
>           → DEADLOCK (main waits for worker, worker waits for thread)
> {code}
> Thread stacks show:
> - Main thread: blocked in {{concurrent.futures.wait()}}
> - Worker thread: blocked in {{threading.start() -> self._started.wait()}}
> The {{ReleaseExecute}} tasks are asynchronous cleanup submitted during test 
> execution. If they haven't completed when Python shuts down, gRPC's attempt 
> to spawn I/O threads gets blocked.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to