[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126022#comment-16126022
 ] 

Jason Dunkelberger commented on SPARK-18838:
--------------------------------------------

It's been about a week now, and we've have had 100% completed runs since the 
blocking change. We've forked spark for the moment with the hack mashed on top 
of 2.2.0. Here's the diff: 
https://github.com/apache/spark/compare/v2.2.0...allenai:v2.2.0-ai2-SNAPSHOT#diff-ca0fe05a42fd5edcab8a1bdaa8e58db9

To be clear I don't think I've actually fixed anything specific. I've just 
changed the possible failures away from whatever leaves it hanging. [~irashid] 
I'll look into what you suggest. For now the goal was stability, which we've 
achieved with the naive change. I'm pretty confident that total performance has 
gone down, but again, that's still better than hanging altogether.

One other thought, I looked at subclasses of SparkListener which all go through 
LiveListenerBus (?). These seem pretty critical: ExecutorAllocationListener, 
BlockStatusListener/StorageListener (?) I didn't say before but we run on EMR 
via Yarn which may be relevant.

> High latency of event processing for large jobs
> -----------------------------------------------
>
>                 Key: SPARK-18838
>                 URL: https://issues.apache.org/jira/browse/SPARK-18838
>             Project: Spark
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Sital Kedia
>         Attachments: perfResults.pdf, SparkListernerComputeTime.xlsx
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to