Hi Baoxu, thanks for sharing.
2014-06-26 22:51 GMT+08:00 Baoxu Shi(Dash) <b...@nd.edu>: > Hi Pei-Lun, > > I have the same problem there. The Issue is SPARK-2228, there also someone > posted a pull request on that, but he only eliminate this exception but not > the side effects. > > I think the problem may due to the hard-coded private val > EVENT_QUEUE_CAPACITY = 10000 > > in core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala. > There may have a chance that when the event_queue is full, the system start > dropping events, and causing key not found because those events never been > submitted. > > Don’t know if that can help. > > On Jun 26, 2014, at 6:41 AM, Pei-Lun Lee <pl...@appier.com> wrote: > > > > > Hi, > > > > We have a long running spark application runs on spark 1.0 standalone > server and after it runs several hours the following exception shows up: > > > > > > 14/06/25 23:13:08 ERROR LiveListenerBus: Listener JobProgressListener > threw an exception > > java.util.NoSuchElementException: key not found: 6375 > > at scala.collection.MapLike$class.default(MapLike.scala:228) > > at scala.collection.AbstractMap.default(Map.scala:58) > > at scala.collection.mutable.HashMap.apply(HashMap.scala:64) > > at > org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:78) > > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) > > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) > > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > > at > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > at > org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) > > at > org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) > > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) > > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > > at scala.Option.foreach(Option.scala:236) > > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) > > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) > > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) > > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) > > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) > > > > > > And then the web UI (driver:4040) starts showing weird results like: > (see attached screenshots) > > 1. negative active tasks number > > 2. complete stages still in active section or showing tasks incomplete > > 3. unpersisted rdd still in storage page and having fraction cached < > 100% > > > > Eventually the application crashed but this is usually the first > exception shows up. > > Any idea how to fix it? > > > > -- > > Pei-Lun Lee > > > > > > <Screen Shot 2014-06-26 at 12.52.38 PM.png><Screen Shot 2014-06-26 at > 12.52.21 PM.png><Screen Shot 2014-06-26 at 12.52.07 PM.png><Screen Shot > 2014-06-26 at 12.51.15 PM.png> > >