what's the dump info by jstack? Yours, Xuefeng Wu 吴雪峰 敬上
> On 2015年2月6日, at 上午10:20, Michael Albert <m_albert...@yahoo.com.INVALID> > wrote: > > My apologies for following up my own post, but I thought this might be of > interest. > > I terminated the java process corresponding to executor which had opened the > stderr file mentioned below (kill <pid>). > Then my spark job completed without error (it was actually almost finished). > > Now I am completely confused :-). > > Thanks! > -Mike > > > From: Michael Albert <m_albert...@yahoo.com.INVALID> > To: "user@spark.apache.org" <user@spark.apache.org> > Sent: Thursday, February 5, 2015 9:04 PM > Subject: Spark stalls or hangs: is this a clue? remote fetches seem to never > return? > > Greetings! > > Again, thanks to all who have given suggestions. > I am still trying to diagnose a problem where I have processes than run for > one or several hours but intermittently stall or hang. > By "stall" I mean that there is no CPU usage on the workers or the driver, > nor network activity, nor do I see disk activity. > It just hangs. > > Using the Application Master to find which workers still had active tasks, I > then went to that machine and looked in the user logs. > In one of the users log's "stderr" files, it ends with "Started 50 remote > fetches...." > Should there be a message saying that the fetch was completed? > Any suggestions as to how I might diagnose why the fetch was not completed? > > Thanks! > -Mike > > Here is the last part of the log: > 15/02/06 01:33:46 INFO storage.MemoryStore: ensureFreeSpace(5368) called with > curMem=875861, maxMem=2315649024 > 15/02/06 01:33:46 INFO storage.MemoryStore: Block broadcast_10 stored as > values in memory (estimated size 5.2 KB, free 2.2 GB) > 15/02/06 01:33:46 INFO spark.MapOutputTrackerWorker: Don't have map outputs > for shuffle 5, fetching them > 15/02/06 01:33:46 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker > actor = > Actor[akka.tcp://sparkDriver@ip-10-171-0-208.ec2.internal:44124/user/MapOutputTracker#-878402310] > 15/02/06 01:33:46 INFO spark.MapOutputTrackerWorker: Don't have map outputs > for shuffle 5, fetching them > 15/02/06 01:33:46 INFO spark.MapOutputTrackerWorker: Got the output locations > 15/02/06 01:33:46 INFO storage.ShuffleBlockFetcherIterator: Getting 300 > non-empty blocks out of 300 blocks > 15/02/06 01:33:46 INFO storage.ShuffleBlockFetcherIterator: Getting 300 > non-empty blocks out of 300 blocks > 15/02/06 01:33:46 INFO storage.ShuffleBlockFetcherIterator: Started 50 remote > fetches in 47 ms > 15/02/06 01:33:46 INFO storage.ShuffleBlockFetcherIterator: Started 50 remote > fetches in 48 ms > It's been like that for half and hour. > > Thanks! > -Mike > > > >