My apologies for following up my own post, but I thought this might be of 
interest.
I terminated the java process corresponding to executor which had opened the 
stderr file mentioned below (kill <pid>).Then my spark job completed without 
error (it was actually almost finished).
Now I am completely confused :-).
Thanks!-Mike

      From: Michael Albert <m_albert...@yahoo.com.INVALID>
 To: "user@spark.apache.org" <user@spark.apache.org> 
 Sent: Thursday, February 5, 2015 9:04 PM
 Subject: Spark stalls or hangs: is this a clue? remote fetches seem to never 
return?
   
Greetings!
Again, thanks to all who have given suggestions.I am still trying to diagnose a 
problem where I have processes than run for one or several hours but 
intermittently stall or hang.By "stall" I mean that there is no CPU usage on 
the workers or the driver, nor network activity, nor do I see disk activity.It 
just hangs.
Using the Application Master to find which workers still had active tasks, I 
then went to that machine and looked in the user logs.In one of the users log's 
"stderr" files, it ends with "Started 50 remote fetches...."Should there be a 
message saying that the fetch was completed?Any suggestions as to how I might 
diagnose why the fetch was not completed?
Thanks!-Mike
Here is the last part of the log:15/02/06 01:33:46 INFO storage.MemoryStore: 
ensureFreeSpace(5368) called with curMem=875861, maxMem=231564902415/02/06 
01:33:46 INFO storage.MemoryStore: Block broadcast_10 stored as values in 
memory (estimated size 5.2 KB, free 2.2 GB)15/02/06 01:33:46 INFO 
spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 5, fetching 
them15/02/06 01:33:46 INFO spark.MapOutputTrackerWorker: Doing the fetch; 
tracker actor = 
Actor[akka.tcp://sparkDriver@ip-10-171-0-208.ec2.internal:44124/user/MapOutputTracker#-878402310]15/02/06
 01:33:46 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 
5, fetching them15/02/06 01:33:46 INFO spark.MapOutputTrackerWorker: Got the 
output locations15/02/06 01:33:46 INFO storage.ShuffleBlockFetcherIterator: 
Getting 300 non-empty blocks out of 300 blocks15/02/06 01:33:46 INFO 
storage.ShuffleBlockFetcherIterator: Getting 300 non-empty blocks out of 300 
blocks15/02/06 01:33:46 INFO storage.ShuffleBlockFetcherIterator: Started 50 
remote fetches in 47 ms15/02/06 01:33:46 INFO 
storage.ShuffleBlockFetcherIterator: Started 50 remote fetches in 48 msIt's 
been like that for half and hour.
Thanks!-Mike



  

Reply via email to