[jira] [Created] (SPARK-44346) Python worker exited unexpectedly - java.io.EOFException on DataInputStream.readInt - cluster doesn't terminate

Dmitry Goldenberg (Jira) Sun, 09 Jul 2023 07:55:44 -0700

Dmitry Goldenberg created SPARK-44346:
-----------------------------------------


             Summary: Python worker exited unexpectedly - java.io.EOFException 
on DataInputStream.readInt - cluster doesn't terminate
                 Key: SPARK-44346
                 URL: https://issues.apache.org/jira/browse/SPARK-44346
             Project: Spark
          Issue Type: Question
          Components: Spark Core
    Affects Versions: 3.3.2
         Environment: AWS EMR emr-6.11.0
Spark 3.3.2
pandas 1.3.5
pyarrow 12.0.0

"spark.sql.shuffle.partitions": "210",
"spark.default.parallelism": "210",
"spark.yarn.stagingDir": "hdfs:///tmp",
"spark.sql.adaptive.enabled": "true",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.sql.execution.arrow.pyspark.enabled": "true",
"spark.dynamicAllocation.enabled": "false",
"hive.metastore.client.factory.class": 
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
            Reporter: Dmitry Goldenberg


I am getting the below exception as a WARN. Apparently, a worker crashes.

Multiple issues here:
- What is the cause of the crash? Is it something to do with pyarrow; some kind 
of a versioning mismatch?
- Error handling in Spark. The error is too low-level to make sense of. Can it 
be caught in Spark and dealth with properly?
- The cluster doesn't recover or cleanly terminate. It essentially just hangs. 
EMR doesn't terminate it either.

Stack traces:

```
23/07/05 22:43:47 WARN TaskSetManager: Lost task 1.0 in stage 81.0 (TID 2761) 
(ip-10-2-250-114.awsinternal.audiomack.com executor 2): 
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
    at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:592)
    at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:574)
    at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
    at 
org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:763)
    at 
org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
    at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:505)
    at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:955)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
    at 
org.apache.spark.sql.execution.AbstractUnsafeExternalRowSorter.sort(AbstractUnsafeExternalRowSorter.java:50)
    at 
org.apache.spark.sql.execution.SortExecBase.$anonfun$doExecute$1(SortExec.scala:346)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
    at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:138)
    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at 
org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:748)
    ... 26 more

23/07/05 22:43:47 INFO TaskSetManager: Starting task 1.1 in stage 81.0 (TID 
2763) (ip-10-2-250-114.awsinternal.audiomack.com, executor 2, partition 1, 
NODE_LOCAL, 5020 bytes) taskResourceAssignments Map()
23/07/05 23:30:17 INFO TaskSetManager: Finished task 2.0 in stage 81.0 (TID 
2762) in 8603522 ms on ip-10-2-250-114.awsinternal.audiomack.com (executor 2) 
(1/3)
23/07/05 23:39:09 INFO TaskSetManager: Finished task 0.0 in stage 81.0 (TID 
2760) in 9135125 ms on ip-10-2-250-114.awsinternal.audiomack.com (executor 2) 
(2/3)
23/07/06 00:04:41 WARN TaskSetManager: Lost task 1.1 in stage 81.0 (TID 2763) 
(ip-10-2-250-114.awsinternal.audiomack.com executor 2): 
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
    at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:592)
    at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:574)
    at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
    at 
org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:763)
    at 
org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
    at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:505)
    at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:955)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
    at 
org.apache.spark.sql.execution.AbstractUnsafeExternalRowSorter.sort(AbstractUnsafeExternalRowSorter.java:50)
    at 
org.apache.spark.sql.execution.SortExecBase.$anonfun$doExecute$1(SortExec.scala:346)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
    at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:138)
    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at 
org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:748)
    ... 26 more
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44346) Python worker exited unexpectedly - java.io.EOFException on DataInputStream.readInt - cluster doesn't terminate

Reply via email to