[ https://issues.apache.org/jira/browse/SPARK-20415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641054#comment-16641054 ]
Yan Zhitao commented on SPARK-20415: ------------------------------------ I have similar issue but the thread dump has minor difference. My thread dump is as following: "Executor task launch worker for task 3806" #54 daemon prio=5 os_prio=0 tid=0x0000000001230800 nid=0x1fc runnable [0x00007fba0e600000] java.lang.Thread.State: RUNNABLE at java.lang.StringCoding.encode(StringCoding.java:364) at java.lang.String.getBytes(String.java:941) at org.apache.spark.unsafe.types.UTF8String.fromString(UTF8String.java:109) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) The thread will keep running and the jave process will use 1 core 100% without stopping. > SPARK job hangs while writing DataFrame to HDFS > ----------------------------------------------- > > Key: SPARK-20415 > URL: https://issues.apache.org/jira/browse/SPARK-20415 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN > Affects Versions: 2.1.0 > Environment: EMR 5.4.0 > Reporter: P K > Priority: Major > > We are in POC phase with Spark. One of the Steps is reading compressed json > files that come from sources, "explode" them into tabular format and then > write them to HDFS. This worked for about three weeks until a few days ago, > for a particular dataset, the writer just hangs. I logged in to the worker > machines and see this stack trace: > "Executor task launch worker-0" #39 daemon prio=5 os_prio=0 > tid=0x00007f6210352800 nid=0x4542 runnable [0x00007f61f52b3000] > java.lang.Thread.State: RUNNABLE > at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210) > at > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:311) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply6_2$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:111) > at > org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:109) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) > at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > The last messages ever printed in stderr before the hang are: > 17/04/18 01:41:14 INFO DAGScheduler: Final stage: ResultStage 4 (save at > NativeMethodAccessorImpl.java:0) > 17/04/18 01:41:14 INFO DAGScheduler: Parents of final stage: List() > 17/04/18 01:41:14 INFO DAGScheduler: Missing parents: List() > 17/04/18 01:41:14 INFO DAGScheduler: Submitting ResultStage 4 > (MapPartitionsRDD[31] at save at NativeMethodAccessorImpl.java:0), which has > no missing parents > 17/04/18 01:41:14 INFO MemoryStore: Block broadcast_9 stored as values in > memory (estimated size 170.5 KB, free 2.2 GB) > 17/04/18 01:41:14 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes > in memory (estimated size 65.2 KB, free 2.2 GB) > 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory > on xxx.xxx.xxx.xx:37907 (size: 65.2 KB, free: 2.2 GB) > 17/04/18 01:41:14 INFO SparkContext: Created broadcast 9 from broadcast at > DAGScheduler.scala:996 > 17/04/18 01:41:14 INFO DAGScheduler: Submitting 9 missing tasks from > ResultStage 4 (MapPartitionsRDD[31] at save at > NativeMethodAccessorImpl.java:0) > 17/04/18 01:41:14 INFO YarnScheduler: Adding task set 4.0 with 9 tasks > 17/04/18 01:41:14 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID > 72, xxx.xxx.xx.xx.xx, executor 12, partition 1, NODE_LOCAL, 8184 bytes) > 17/04/18 01:41:14 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID > 73, xxx.xxx.xx.xx.xx, executor 13, partition 0, NODE_LOCAL, 7967 bytes) > 17/04/18 01:41:14 INFO TaskSetManager: Starting task 2.0 in stage 4.0 (TID > 74, xxx.xxx.xx.xx.xx, executor 14, partition 2, NODE_LOCAL, 8181 bytes) > 17/04/18 01:41:14 INFO TaskSetManager: Starting task 6.0 in stage 4.0 (TID > 75, xxx.xxx.xx.xx.xx, executor 16, partition 6, NODE_LOCAL, 8400 bytes) > 17/04/18 01:41:14 INFO TaskSetManager: Starting task 7.0 in stage 4.0 (TID > 76, xxx.xxx.xx.xx.xx, executor 10, partition 7, NODE_LOCAL, 8398 bytes) > 17/04/18 01:41:14 INFO TaskSetManager: Starting task 3.0 in stage 4.0 (TID > 77, xxx.xxx.xx.xx.xx, executor 11, partition 3, NODE_LOCAL, 8182 bytes) > 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory > on xxx.xxx.xx.xx.xx:46030 (size: 65.2 KB, free: 4.0 GB) > 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory > on xxx.xxx.xx.xx.xx:40494 (size: 65.2 KB, free: 4.0 GB) > 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory > on xxx.xxx.xx.xx.xx:35861 (size: 65.2 KB, free: 4.0 GB) > 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory > on xxx.xxx.xx.xx.xx:34157 (size: 65.2 KB, free: 4.0 GB) > 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory > on xxx.xxx.xx.xx.xx:43202 (size: 65.2 KB, free: 4.0 GB) > 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory > on xxx.xxx.xx.xx.xx:46053 (size: 65.2 KB, free: 4.0 GB) > 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory > on xxx.xxx.xx.xx.xx:46030 (size: 28.0 KB, free: 4.0 GB) > 17/04/18 01:41:15 INFO ExecutorAllocationManager: Requesting 9 new executors > because tasks are backlogged (new desired total will be 9) > 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory > on xxx.xxx.xx.xx.xx:34157 (size: 28.0 KB, free: 4.0 GB) > 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory > on xxx.xxx.xx.xx.xx:40494 (size: 28.0 KB, free: 4.0 GB) > 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory > on xxx.xxx.xx.xx.xx:35861 (size: 28.0 KB, free: 4.0 GB) > 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory > on xxx.xxx.xx.xx.xx:46053 (size: 28.0 KB, free: 4.0 GB) > 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory > on xxx.xxx.xx.xx.xx:43202 (size: 28.0 KB, free: 4.0 GB) > 17/04/18 01:41:17 INFO TaskSetManager: Starting task 4.0 in stage 4.0 (TID > 78, xxx.xxx.xx.xx.xx, executor 15, partition 4, RACK_LOCAL, 8400 bytes) > 17/04/18 01:41:17 INFO TaskSetManager: Starting task 5.0 in stage 4.0 (TID > 79, xxx.xxx.xx.xx.xx, executor 9, partition 5, RACK_LOCAL, 8400 bytes) > 17/04/18 01:41:17 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory > on xxx.xxx.xx.xx.xx:34045 (size: 65.2 KB, free: 4.0 GB) > 17/04/18 01:41:17 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory > on xxx.xxx.xx.xx.xx:43887 (size: 65.2 KB, free: 4.0 GB) > 17/04/18 01:41:18 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory > on xxx.xxx.xx.xx.xx:34045 (size: 28.0 KB, free: 4.0 GB) > 17/04/18 01:41:18 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory > on xxx.xxx.xx.xx.xx:43887 (size: 28.0 KB, free: 4.0 GB) -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org