P K created SPARK-20415: --------------------------- Summary: SPARK job hangs while writing table to HDFS Key: SPARK-20415 URL: https://issues.apache.org/jira/browse/SPARK-20415 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 2.1.0 Environment: Latest EMR Reporter: P K Priority: Blocker
We are in POC phase with Spark. One of the Steps is reading compressed json files that come from sources, "explode" them into tabular format and then write them to HDFS. This worked for about three weeks until a few days ago, for a particular dataset, the writer just hangs. I logged in to the worker machines and see this stack trace: "Executor task launch worker-0" #39 daemon prio=5 os_prio=0 tid=0x00007f6210352800 nid=0x4542 runnable [0x00007f61f52b3000] java.lang.Thread.State: RUNNABLE at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210) at org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:311) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply6_2$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:111) at org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:109) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) The last messages ever printed in stderr before the hang are: 17/04/18 01:41:14 INFO DAGScheduler: Final stage: ResultStage 4 (save at NativeMethodAccessorImpl.java:0) 17/04/18 01:41:14 INFO DAGScheduler: Parents of final stage: List() 17/04/18 01:41:14 INFO DAGScheduler: Missing parents: List() 17/04/18 01:41:14 INFO DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[31] at save at NativeMethodAccessorImpl.java:0), which has no missing parents 17/04/18 01:41:14 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 170.5 KB, free 2.2 GB) 17/04/18 01:41:14 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 65.2 KB, free 2.2 GB) 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xxx.xx:37907 (size: 65.2 KB, free: 2.2 GB) 17/04/18 01:41:14 INFO SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:996 17/04/18 01:41:14 INFO DAGScheduler: Submitting 9 missing tasks from ResultStage 4 (MapPartitionsRDD[31] at save at NativeMethodAccessorImpl.java:0) 17/04/18 01:41:14 INFO YarnScheduler: Adding task set 4.0 with 9 tasks 17/04/18 01:41:14 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 72, xxx.xxx.xx.xx.xx, executor 12, partition 1, NODE_LOCAL, 8184 bytes) 17/04/18 01:41:14 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 73, xxx.xxx.xx.xx.xx, executor 13, partition 0, NODE_LOCAL, 7967 bytes) 17/04/18 01:41:14 INFO TaskSetManager: Starting task 2.0 in stage 4.0 (TID 74, xxx.xxx.xx.xx.xx, executor 14, partition 2, NODE_LOCAL, 8181 bytes) 17/04/18 01:41:14 INFO TaskSetManager: Starting task 6.0 in stage 4.0 (TID 75, xxx.xxx.xx.xx.xx, executor 16, partition 6, NODE_LOCAL, 8400 bytes) 17/04/18 01:41:14 INFO TaskSetManager: Starting task 7.0 in stage 4.0 (TID 76, xxx.xxx.xx.xx.xx, executor 10, partition 7, NODE_LOCAL, 8398 bytes) 17/04/18 01:41:14 INFO TaskSetManager: Starting task 3.0 in stage 4.0 (TID 77, xxx.xxx.xx.xx.xx, executor 11, partition 3, NODE_LOCAL, 8182 bytes) 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:46030 (size: 65.2 KB, free: 4.0 GB) 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:40494 (size: 65.2 KB, free: 4.0 GB) 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:35861 (size: 65.2 KB, free: 4.0 GB) 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:34157 (size: 65.2 KB, free: 4.0 GB) 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:43202 (size: 65.2 KB, free: 4.0 GB) 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:46053 (size: 65.2 KB, free: 4.0 GB) 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:46030 (size: 28.0 KB, free: 4.0 GB) 17/04/18 01:41:15 INFO ExecutorAllocationManager: Requesting 9 new executors because tasks are backlogged (new desired total will be 9) 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:34157 (size: 28.0 KB, free: 4.0 GB) 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:40494 (size: 28.0 KB, free: 4.0 GB) 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:35861 (size: 28.0 KB, free: 4.0 GB) 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:46053 (size: 28.0 KB, free: 4.0 GB) 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:43202 (size: 28.0 KB, free: 4.0 GB) 17/04/18 01:41:17 INFO TaskSetManager: Starting task 4.0 in stage 4.0 (TID 78, xxx.xxx.xx.xx.xx, executor 15, partition 4, RACK_LOCAL, 8400 bytes) 17/04/18 01:41:17 INFO TaskSetManager: Starting task 5.0 in stage 4.0 (TID 79, xxx.xxx.xx.xx.xx, executor 9, partition 5, RACK_LOCAL, 8400 bytes) 17/04/18 01:41:17 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:34045 (size: 65.2 KB, free: 4.0 GB) 17/04/18 01:41:17 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on xxx.xxx.xx.xx.xx:43887 (size: 65.2 KB, free: 4.0 GB) 17/04/18 01:41:18 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:34045 (size: 28.0 KB, free: 4.0 GB) 17/04/18 01:41:18 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on xxx.xxx.xx.xx.xx:43887 (size: 28.0 KB, free: 4.0 GB) -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org