[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-18838:
--------------------------------
    Description: 
Currently we are observing the issue of very high event processing delay in 
driver's `ListenerBus` for large jobs with many tasks. Many critical component 
of the scheduler like `ExecutorAllocationManager`, `HeartbeatReceiver` depend 
on the `ListenerBus` events and these delay is causing job failure. For 
example, a significant delay in receiving the `SparkListenerTaskStart` might 
cause `ExecutorAllocationManager` manager to remove an executor which is not 
idle.  The event processor in `ListenerBus` is a single thread which loops 
through all the Listeners for each event and processes each event synchronously 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
 
The single threaded processor often becomes the bottleneck for large jobs.  In 
addition to that, if one of the Listener is very slow, all the listeners will 
pay the price of delay incurred by the slow listener. 

To solve the above problems, we plan to have a single threaded executor service 
and separate event queue per listener. That way we are not bottlenecked by the 
single threaded processor and also critical listeners will not be penalized by 
the slow listeners. The downside of this approach is separate event queue per 
listener will increase the driver memory footprint. 




  was:
Currently we are observing the issue of very high event processing delay in 
driver's `ListenerBus` for large jobs with many tasks. Many critical component 
of the scheduler like `ExecutorAllocationManager`, `HeartbeatReceiver` depend 
on the `ListenerBus` events and these delay is causing job failure. For 
example, a significant delay in receiving the `SparkListenerTaskStart` might 
cause `ExecutorAllocationManager` manager to remove an executor which is not 
idle.  The event processor in `ListenerBus` is a single thread which loops 
through all the Listeners for each event and processes each event 
synchronously. The single threaded processor often becomes the bottleneck for 
large jobs.  In addition to that, if one of the Listener is very slow, all the 
listeners will pay the price of delay incurred by the slow listener. 

To solve the above problems, we plan to have a single threaded executor service 
and separate event queue per listener. That way we are not bottlenecked by the 
single threaded processor and also critical listeners will not be penalized by 
the slow listeners.





> High latency of event processing for large jobs
> -----------------------------------------------
>
>                 Key: SPARK-18838
>                 URL: https://issues.apache.org/jira/browse/SPARK-18838
>             Project: Spark
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Sital Kedia
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and these delay is 
> causing job failure. For example, a significant delay in receiving the 
> `SparkListenerTaskStart` might cause `ExecutorAllocationManager` manager to 
> remove an executor which is not idle.  The event processor in `ListenerBus` 
> is a single thread which loops through all the Listeners for each event and 
> processes each event synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  
> The single threaded processor often becomes the bottleneck for large jobs.  
> In addition to that, if one of the Listener is very slow, all the listeners 
> will pay the price of delay incurred by the slow listener. 
> To solve the above problems, we plan to have a single threaded executor 
> service and separate event queue per listener. That way we are not 
> bottlenecked by the single threaded processor and also critical listeners 
> will not be penalized by the slow listeners. The downside of this approach is 
> separate event queue per listener will increase the driver memory footprint. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to