Spark stalls or hangs: is this a clue? remote fetches seem to never return?

Michael Albert Thu, 05 Feb 2015 18:05:25 -0800

Greetings!
Again, thanks to all who have given suggestions.I am still trying to diagnose a 
problem where I have processes than run for one or several hours but 
intermittently stall or hang.By "stall" I mean that there is no CPU usage on 
the workers or the driver, nor network activity, nor do I see disk activity.It 
just hangs.
Using the Application Master to find which workers still had active tasks, I 
then went to that machine and looked in the user logs.In one of the users log's 
"stderr" files, it ends with "Started 50 remote fetches...."Should there be a 
message saying that the fetch was completed?Any suggestions as to how I might 
diagnose why the fetch was not completed?
Thanks!-Mike
Here is the last part of the log:15/02/06 01:33:46 INFO storage.MemoryStore: 
ensureFreeSpace(5368) called with curMem=875861, maxMem=231564902415/02/06 
01:33:46 INFO storage.MemoryStore: Block broadcast_10 stored as values in 
memory (estimated size 5.2 KB, free 2.2 GB)15/02/06 01:33:46 INFO 
spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 5, fetching 
them15/02/06 01:33:46 INFO spark.MapOutputTrackerWorker: Doing the fetch; 
tracker actor = 
Actor[akka.tcp://sparkDriver@ip-10-171-0-208.ec2.internal:44124/user/MapOutputTracker#-878402310]15/02/06
 01:33:46 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 
5, fetching them15/02/06 01:33:46 INFO spark.MapOutputTrackerWorker: Got the 
output locations15/02/06 01:33:46 INFO storage.ShuffleBlockFetcherIterator: 
Getting 300 non-empty blocks out of 300 blocks15/02/06 01:33:46 INFO 
storage.ShuffleBlockFetcherIterator: Getting 300 non-empty blocks out of 300 
blocks15/02/06 01:33:46 INFO storage.ShuffleBlockFetcherIterator: Started 50 
remote fetches in 47 ms15/02/06 01:33:46 INFO 
storage.ShuffleBlockFetcherIterator: Started 50 remote fetches in 48 msIt's 
been like that for half and hour.
Thanks!-Mike

Spark stalls or hangs: is this a clue? remote fetches seem to never return?

Reply via email to