[ https://issues.apache.org/jira/browse/SPARK-26074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
liupengcheng updated SPARK-26074: --------------------------------- Comment: was deleted (was: I will submit a PR soon) > AsyncEventQueue.stop hangs when eventQueue is full > -------------------------------------------------- > > Key: SPARK-26074 > URL: https://issues.apache.org/jira/browse/SPARK-26074 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.1.0 > Reporter: liupengcheng > Priority: Major > > In our production environment, we found a case that the Driver hangs when the > app finished and about to exit. > Detail information: > The spark-listener-group-shared Thread may exited due to some plugin > TaskFailedListener(here in our case is XGBoost),thus the eventQueue might > easily enter Full state and stay unchanged. After that, if app finished and > SparkContext.stop is called, the > Driver will hangs at the following stack: > {code:java} > "Driver" #36 prio=5 os_prio=0 tid=0x00007fb08a948800 nid=0x6edc waiting on > condition [0x00007faff5017000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000006c718d7d8> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:350) > at org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:117) > at > org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:202) > at > org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:202) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:202) > - locked <0x00000006c052f0a8> (a org.apache.spark.scheduler.LiveListenerBus) > at > org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1842) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1294) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1841){code} > Find out it's because spark put a POSION_PILL when AsyncEventQueue.stop, > however, the queue is in full state, so the put action will be blocked > forever. > spark-listener-group-shared exit message: > {code:java} > 2018-11-15,14:44:04,782 INFO org.apache.spark.scheduler.AsyncEventQueue: > Stopping listener queue shared. > java.lang.InterruptedException: ExecutorLost during XGBoost Training: > TaskKilled (killed intentionally) > at > org.apache.spark.TaskFailedListener.onTaskEnd(SparkParallelismTracker.scala:116) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:35) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:35) > at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:83) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:83) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:79) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:75) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1256) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:74) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org