advice on diagnosing Spark stall for 1.5hr out of 3.5hr job?

Michael Albert Tue, 03 Feb 2015 16:20:07 -0800

Greetings!
First, my sincere thanks to all who have given me advice.Following previous 
discussion, I've rearranged my code to try to keep the partitions to more 
manageable sizes.Thanks to all who commented.
At the moment, the input set I'm trying to work with is about 90GB (avro 
parquet format).
When I run on a reasonable chunk of the data (say half) things work reasonably.
On the full data, the spark process stalls.That is, for about 1.5 hours out of 
a 3.5 hour run, I see no activity.No cpu usage, no error message, no network 
activity.It just seems to sits there.The messages bracketing the stall are 
shown below.
Any advice on how to diagnose this? I don't get any error messages.  The spark 
UI says that it is running a stage, but it makes no discernible 
progress.Ganglia shows no CPU usage or network activity.When I shell into the 
worker nodes there are no filled disks or other obvious problems.
How can I discern what Spark is waiting for?
The only weird thing seen, other than the stall, is that the yarn logs on the 
workers have lines with messages like this:2015-02-03 22:59:58,890 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 (Container Monitor): Memory usage of ProcessTree 13158 for container-id 
container_1422834185427_0083_01_000021: 7.1 GB of 8.5 GB physical memory used; 
7.6 GB of 42.5 GB virtual memory used
It's rather strange that it mentions 42.5 GB of virtual memory.  The machines 
are EMR machines with 32 GB of physical memory and, as far as I can determine, 
no swap space.
The messages bracketing the stall are shown below.


Any advice is welcome.
Thanks!
Sincerely, Mike Albert
Before the stall.15/02/03 21:45:28 INFO cluster.YarnClientClusterScheduler: 
Removed TaskSet 5.0, whose tasks have all completed, from pool 15/02/03 
21:45:28 INFO scheduler.DAGScheduler: Stage 5 (mapPartitionsWithIndex at 
Transposer.scala:147) finished in 4880.317 s15/02/03 21:45:28 INFO 
scheduler.DAGScheduler: looking for newly runnable stages15/02/03 21:45:28 INFO 
scheduler.DAGScheduler: running: Set(Stage 3)15/02/03 21:45:28 INFO 
scheduler.DAGScheduler: waiting: Set(Stage 6, Stage 7, Stage 8)15/02/03 
21:45:28 INFO scheduler.DAGScheduler: failed: Set()15/02/03 21:45:28 INFO 
scheduler.DAGScheduler: Missing parents for Stage 6: List(Stage 3)15/02/03 
21:45:28 INFO scheduler.DAGScheduler: Missing parents for Stage 7: List(Stage 
6)15/02/03 21:45:28 INFO scheduler.DAGScheduler: Missing parents for Stage 8: 
List(Stage 7)At this point, I see no activity for 1.5 hours except for this 
(XXX for I.P. address)15/02/03 22:13:24 INFO util.AkkaUtils: Connecting to 
ExecutorActor: 
akka.tcp://sparkExecutor@ip-XXX.ec2.internal:36301/user/ExecutorActor
Then finally it started again:15/02/03 23:31:34 INFO scheduler.TaskSetManager: 
Finished task 1.0 in stage 3.0 (TID 7301) in 7208259 ms on 
ip-10-171-0-124.ec2.internal (3/4)15/02/03 23:31:34 INFO 
scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 7300) in 7208503 
ms on ip-10-171-0-128.ec2.internal (4/4)15/02/03 23:31:34 INFO 
scheduler.DAGScheduler: Stage 3 (mapPartitions at Transposer.scala:211) 
finished in 7209.534 s

advice on diagnosing Spark stall for 1.5hr out of 3.5hr job?

Reply via email to