[ 
https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15384331#comment-15384331
 ] 

Dhruve Ashar commented on SPARK-15703:
--------------------------------------

Here are some of the findings: 

LiveListenerBus replaces the AsynchronousListenerBus. With dynamic allocation 
enabled and setting maximum executors to ~2000, I am consistently seeing 
excessive messages being dropped for an input data size of 300GB. These events 
are being dropped (UI gets messed up here) because the event queue is not being 
drained fast enough. 

>From the thread dumps, the event queue dispatcher freezes up momentarily 
>during which the queue gets full in a short span and messages are dropped, and 
>once its active, the queue clears up fast. The race condition happens in 
>ExecutorAllocationManager because of the synchronization. And the dispatcher 
>threads waits for the locks to be released. See attached dumps.

The remedy for this is two fold:
1 - Decouple the event dispatch and handling of dynamic executor allocation. 
2 - Make the listener event queue size configurable. For users who want to run 
with smaller heartbeat intervals, the no. of events floating around would be 
large and it would be helpful to have the flexibility to tune this.






> Spark UI doesn't show all tasks as completed when it should
> -----------------------------------------------------------
>
>                 Key: SPARK-15703
>                 URL: https://issues.apache.org/jira/browse/SPARK-15703
>             Project: Spark
>          Issue Type: Bug
>          Components: Web UI
>    Affects Versions: 2.0.0
>            Reporter: Thomas Graves
>            Priority: Critical
>         Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot 
> 2016-06-01 at 11.23.48 AM.png, SparkListenerBus .png, 
> spark-dynamic-executor-allocation.png
>
>
> The Spark UI doesn't seem to be showing all the tasks and metrics.
> I ran a job with 100000 tasks but Detail stage page says it completed 93029:
> Summary Metrics for 93029 Completed Tasks
> The Stages for all jobs pages list that only 89519/100000 tasks finished but 
> its completed.  The metrics for shuffled write and input are also incorrect.
> I will attach screen shots.
> I checked the logs and it does show that all the tasks actually finished.
> 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 
> (TID 54038) in 265309 ms on 10.213.45.51 (100000/100000)
> 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks 
> have all completed, from pool



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to