[ https://issues.apache.org/jira/browse/SPARK-26175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-26175. ---------------------------------- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25138 [https://github.com/apache/spark/pull/25138] > PySpark cannot terminate worker process if user program reads from stdin > ------------------------------------------------------------------------ > > Key: SPARK-26175 > URL: https://issues.apache.org/jira/browse/SPARK-26175 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.4.0 > Reporter: Ala Luszczak > Assignee: Weichen Xu > Priority: Major > Labels: Hydrogen > Fix For: 3.0.0 > > > PySpark worker daemon reads from stdin the worker PIDs to kill. > https://github.com/apache/spark/blob/1bb60ab8392adf8b896cc04fb1d060620cf09d8a/python/pyspark/daemon.py#L127 > However, the worker process is a forked process from the worker daemon > process and we didn't close stdin on the child after fork. This means the > child and user program can read stdin as well, which blocks daemon from > receiving the PID to kill. This can cause issues because the task reaper > might detect the task was not terminated and eventually kill the JVM. > Possible fix could be: > * Closing stdin of the worker process right after fork. > * Creating a new socket to receive PIDs to kill instead of using stdin. > h4. Steps to reproduce > # Paste the following code in pyspark: > {code} > import subprocess > def task(_): > subprocess.check_output(["cat"]) > sc.parallelize(range(1), 1).mapPartitions(task).count() > {code} > # Press CTRL+C to cancel the job. > # The following message is displayed: > {code} > 18/11/26 17:52:51 WARN PythonRunner: Incomplete task 0.0 in stage 0 (TID 0) > interrupted: Attempting to kill Python Worker > 18/11/26 17:52:52 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, > localhost, executor driver): TaskKilled (Stage cancelled) > {code} > # Run {{ps -xf}} to see that {{cat}} process was in fact not killed: > {code} > 19773 pts/2 Sl+ 0:00 | | \_ python > 19803 pts/2 Sl+ 0:11 | | \_ > /usr/lib/jvm/java-8-oracle/bin/java -cp > /home/ala/Repos/apache-spark-GOOD-2/conf/:/home/ala/Repos/apache-spark-GOOD-2/assembly/target/scala-2.12/jars/* > -Xmx1g org.apache.spark.deploy.SparkSubmit --name PySparkShell pyspark-shell > 19879 pts/2 S 0:00 | | \_ python -m pyspark.daemon > 19895 pts/2 S 0:00 | | \_ python -m pyspark.daemon > 19898 pts/2 S 0:00 | | \_ cat > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org