Hi Pei-Lun, I have the same problem there. The Issue is SPARK-2228, there also someone posted a pull request on that, but he only eliminate this exception but not the side effects.
I think the problem may due to the hard-coded private val EVENT_QUEUE_CAPACITY = 10000 in core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala. There may have a chance that when the event_queue is full, the system start dropping events, and causing key not found because those events never been submitted. Don’t know if that can help. On Jun 26, 2014, at 6:41 AM, Pei-Lun Lee <pl...@appier.com> wrote: > > Hi, > > We have a long running spark application runs on spark 1.0 standalone server > and after it runs several hours the following exception shows up: > > > 14/06/25 23:13:08 ERROR LiveListenerBus: Listener JobProgressListener threw > an exception > java.util.NoSuchElementException: key not found: 6375 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:58) > at scala.collection.mutable.HashMap.apply(HashMap.scala:64) > at > org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:78) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) > at > org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) > > > And then the web UI (driver:4040) starts showing weird results like: (see > attached screenshots) > 1. negative active tasks number > 2. complete stages still in active section or showing tasks incomplete > 3. unpersisted rdd still in storage page and having fraction cached < 100% > > Eventually the application crashed but this is usually the first exception > shows up. > Any idea how to fix it? > > -- > Pei-Lun Lee > > > <Screen Shot 2014-06-26 at 12.52.38 PM.png><Screen Shot 2014-06-26 at > 12.52.21 PM.png><Screen Shot 2014-06-26 at 12.52.07 PM.png><Screen Shot > 2014-06-26 at 12.51.15 PM.png>