My apologies for following up my own post, but I thought this might be of interest. I terminated the java process corresponding to executor which had opened the stderr file mentioned below (kill <pid>).Then my spark job completed without error (it was actually almost finished). Now I am completely confused :-). Thanks!-Mike
From: Michael Albert <m_albert...@yahoo.com.INVALID> To: "user@spark.apache.org" <user@spark.apache.org> Sent: Thursday, February 5, 2015 9:04 PM Subject: Spark stalls or hangs: is this a clue? remote fetches seem to never return? Greetings! Again, thanks to all who have given suggestions.I am still trying to diagnose a problem where I have processes than run for one or several hours but intermittently stall or hang.By "stall" I mean that there is no CPU usage on the workers or the driver, nor network activity, nor do I see disk activity.It just hangs. Using the Application Master to find which workers still had active tasks, I then went to that machine and looked in the user logs.In one of the users log's "stderr" files, it ends with "Started 50 remote fetches...."Should there be a message saying that the fetch was completed?Any suggestions as to how I might diagnose why the fetch was not completed? Thanks!-Mike Here is the last part of the log:15/02/06 01:33:46 INFO storage.MemoryStore: ensureFreeSpace(5368) called with curMem=875861, maxMem=231564902415/02/06 01:33:46 INFO storage.MemoryStore: Block broadcast_10 stored as values in memory (estimated size 5.2 KB, free 2.2 GB)15/02/06 01:33:46 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 5, fetching them15/02/06 01:33:46 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker actor = Actor[akka.tcp://sparkDriver@ip-10-171-0-208.ec2.internal:44124/user/MapOutputTracker#-878402310]15/02/06 01:33:46 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 5, fetching them15/02/06 01:33:46 INFO spark.MapOutputTrackerWorker: Got the output locations15/02/06 01:33:46 INFO storage.ShuffleBlockFetcherIterator: Getting 300 non-empty blocks out of 300 blocks15/02/06 01:33:46 INFO storage.ShuffleBlockFetcherIterator: Getting 300 non-empty blocks out of 300 blocks15/02/06 01:33:46 INFO storage.ShuffleBlockFetcherIterator: Started 50 remote fetches in 47 ms15/02/06 01:33:46 INFO storage.ShuffleBlockFetcherIterator: Started 50 remote fetches in 48 msIt's been like that for half and hour. Thanks!-Mike