P K created SPARK-20415:
---------------------------

             Summary: SPARK job hangs while writing table to HDFS
                 Key: SPARK-20415
                 URL: https://issues.apache.org/jira/browse/SPARK-20415
             Project: Spark
          Issue Type: Bug
          Components: PySpark, YARN
    Affects Versions: 2.1.0
         Environment: Latest EMR
            Reporter: P K
            Priority: Blocker


We are in POC phase with Spark. One of the Steps is reading compressed json 
files that come from sources, "explode" them into tabular format and then write 
them to HDFS. This worked for about three weeks until a few days ago, for a 
particular dataset, the writer just hangs. I logged in to the worker machines 
and see this stack trace:

"Executor task launch worker-0" #39 daemon prio=5 os_prio=0 
tid=0x00007f6210352800 nid=0x4542 runnable [0x00007f61f52b3000]
   java.lang.Thread.State: RUNNABLE
    at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210)
    at 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:311)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply6_2$(Unknown
 Source)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
    at 
org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:111)
    at 
org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:109)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
    at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
    at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
    at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
    at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
    at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
    at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
    at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
    at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

The last messages ever printed in stderr before the hang are:
17/04/18 01:41:14 INFO DAGScheduler: Final stage: ResultStage 4 (save at 
NativeMethodAccessorImpl.java:0)
17/04/18 01:41:14 INFO DAGScheduler: Parents of final stage: List()
17/04/18 01:41:14 INFO DAGScheduler: Missing parents: List()
17/04/18 01:41:14 INFO DAGScheduler: Submitting ResultStage 4 
(MapPartitionsRDD[31] at save at NativeMethodAccessorImpl.java:0), which has no 
missing parents
17/04/18 01:41:14 INFO MemoryStore: Block broadcast_9 stored as values in 
memory (estimated size 170.5 KB, free 2.2 GB)
17/04/18 01:41:14 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in 
memory (estimated size 65.2 KB, free 2.2 GB)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on 
xxx.xxx.xxx.xx:37907 (size: 65.2 KB, free: 2.2 GB)
17/04/18 01:41:14 INFO SparkContext: Created broadcast 9 from broadcast at 
DAGScheduler.scala:996
17/04/18 01:41:14 INFO DAGScheduler: Submitting 9 missing tasks from 
ResultStage 4 (MapPartitionsRDD[31] at save at NativeMethodAccessorImpl.java:0)
17/04/18 01:41:14 INFO YarnScheduler: Adding task set 4.0 with 9 tasks
17/04/18 01:41:14 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 72, 
xxx.xxx.xx.xx.xx, executor 12, partition 1, NODE_LOCAL, 8184 bytes)
17/04/18 01:41:14 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 73, 
xxx.xxx.xx.xx.xx, executor 13, partition 0, NODE_LOCAL, 7967 bytes)
17/04/18 01:41:14 INFO TaskSetManager: Starting task 2.0 in stage 4.0 (TID 74, 
xxx.xxx.xx.xx.xx, executor 14, partition 2, NODE_LOCAL, 8181 bytes)
17/04/18 01:41:14 INFO TaskSetManager: Starting task 6.0 in stage 4.0 (TID 75, 
xxx.xxx.xx.xx.xx, executor 16, partition 6, NODE_LOCAL, 8400 bytes)
17/04/18 01:41:14 INFO TaskSetManager: Starting task 7.0 in stage 4.0 (TID 76, 
xxx.xxx.xx.xx.xx, executor 10, partition 7, NODE_LOCAL, 8398 bytes)
17/04/18 01:41:14 INFO TaskSetManager: Starting task 3.0 in stage 4.0 (TID 77, 
xxx.xxx.xx.xx.xx, executor 11, partition 3, NODE_LOCAL, 8182 bytes)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on 
xxx.xxx.xx.xx.xx:46030 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on 
xxx.xxx.xx.xx.xx:40494 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on 
xxx.xxx.xx.xx.xx:35861 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on 
xxx.xxx.xx.xx.xx:34157 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on 
xxx.xxx.xx.xx.xx:43202 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on 
xxx.xxx.xx.xx.xx:46053 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on 
xxx.xxx.xx.xx.xx:46030 (size: 28.0 KB, free: 4.0 GB)
17/04/18 01:41:15 INFO ExecutorAllocationManager: Requesting 9 new executors 
because tasks are backlogged (new desired total will be 9)
17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on 
xxx.xxx.xx.xx.xx:34157 (size: 28.0 KB, free: 4.0 GB)
17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on 
xxx.xxx.xx.xx.xx:40494 (size: 28.0 KB, free: 4.0 GB)
17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on 
xxx.xxx.xx.xx.xx:35861 (size: 28.0 KB, free: 4.0 GB)
17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on 
xxx.xxx.xx.xx.xx:46053 (size: 28.0 KB, free: 4.0 GB)
17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on 
xxx.xxx.xx.xx.xx:43202 (size: 28.0 KB, free: 4.0 GB)
17/04/18 01:41:17 INFO TaskSetManager: Starting task 4.0 in stage 4.0 (TID 78, 
xxx.xxx.xx.xx.xx, executor 15, partition 4, RACK_LOCAL, 8400 bytes)
17/04/18 01:41:17 INFO TaskSetManager: Starting task 5.0 in stage 4.0 (TID 79, 
xxx.xxx.xx.xx.xx, executor 9, partition 5, RACK_LOCAL, 8400 bytes)
17/04/18 01:41:17 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on 
xxx.xxx.xx.xx.xx:34045 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:17 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on 
xxx.xxx.xx.xx.xx:43887 (size: 65.2 KB, free: 4.0 GB)
17/04/18 01:41:18 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on 
xxx.xxx.xx.xx.xx:34045 (size: 28.0 KB, free: 4.0 GB)
17/04/18 01:41:18 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on 
xxx.xxx.xx.xx.xx:43887 (size: 28.0 KB, free: 4.0 GB)




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to