[ https://issues.apache.org/jira/browse/SPARK-18881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030025#comment-16030025 ]
Mathieu D edited comment on SPARK-18881 at 5/30/17 7:52 PM: ------------------------------------------------------------ Just to mention a workaround for those experiencing the problem : try increase {{spark.scheduler.listenerbus.eventqueue.size}} (default 10000). It may only postpone the problem, if the queue filling is faster than listeners for a long time. In our case, we have bursts of activity and raising this limit helps. was (Author: mathieude): Just to mention a workaround for those experiencing the problem : try increase {{spark.scheduler.listenerbus.eventqueue.size}} (default 10000). It may only postpone the problem, if the queue filling is faster than listeners for a long time. In our case, we have bursts of activity and raising this limits helps. > Spark never finishes jobs and stages, JobProgressListener fails > --------------------------------------------------------------- > > Key: SPARK-18881 > URL: https://issues.apache.org/jira/browse/SPARK-18881 > Project: Spark > Issue Type: Bug > Affects Versions: 2.0.2 > Environment: yarn, deploy-mode = client > Reporter: Mathieu D > > We have a Spark application that process continuously a lot of incoming jobs. > Several jobs are processed in parallel, on multiple threads. > During intensive workloads, at some point, we start to have hundreds of > warnings like this : > {code} > 16/12/14 21:04:03 WARN JobProgressListener: Task end for unknown stage 147379 > 16/12/14 21:04:03 WARN JobProgressListener: Job completed for unknown job > 64610 > 16/12/14 21:04:04 WARN JobProgressListener: Task start for unknown stage > 147405 > 16/12/14 21:04:04 WARN JobProgressListener: Task end for unknown stage 147406 > 16/12/14 21:04:04 WARN JobProgressListener: Job completed for unknown job > 64622 > {code} > Starting from that, the performance of the app plummet, most of Stages and > Jobs never finish. On SparkUI, I can see figures like 13000 pending jobs. > I can't see clearly another related exception happening before. Maybe this > one, but it concerns another listener : > {code} > 16/12/14 21:03:54 ERROR LiveListenerBus: Dropping SparkListenerEvent because > no remaining room in event queue. This likely means one of the SparkListeners > is too slow and cannot keep up with the rate at which tasks are being started > by the scheduler. > 16/12/14 21:03:54 WARN LiveListenerBus: Dropped 1 SparkListenerEvents since > Thu Jan 01 01:00:00 CET 1970 > {code} > This is very problematic for us, since it's hard to detect, and requires an > app restart. > *EDIT :* > I confirm the sequence : > 1- ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining > room in event queue > then > 2- JobProgressListener losing track of job and stages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org