Oleksandr Konopko created SPARK-22213:
-----------------------------------------

             Summary: Spark to detect slow executors on nodes with problematic 
hardware
                 Key: SPARK-22213
                 URL: https://issues.apache.org/jira/browse/SPARK-22213
             Project: Spark
          Issue Type: Improvement
          Components: Scheduler
    Affects Versions: 2.0.0
         Environment: - AWS EMR clusters 
- window time is 60s
- several millions of events processed per minute
            Reporter: Oleksandr Konopko


Sometimes when new cluster is created it contains 1-2 slow nodes. When average 
Task finishes in 5 seconds, it takes up to 50 seconds to finish on slow node. 
As a result, batch processing time increases for 45s
In order to avoid that we could use `speculation` feature, but it seems that it 
can be improved
 
- 1st issue with `speculation` is that we do not want to use `speculation` on 
all tasks, since we have tens of thousands of them during processing of one 
batch. Spawning extra several thousands would not be resource-efficient. I 
suggest to create new parameter `spark.speculation.mintime`. This would specify 
minimal task run time for speculation to be enabled for this task

- 2nd issue is that even if Spark spawns speculative tasks only for 
long-running ones (longer than 10s for example), task on slow node still will 
run for some significant time before it is killed. Which still makes batch 
processing time bigger than it should be. Solution is to enable `blacklisting` 
for slow nodes. With speculation and blacklisting combined, only first 1-2 
batches would take more time when expected. After faulty node is blacklisted 
batch processing time is as expected



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to