Harry Brundage created SPARK-4732:
-------------------------------------

             Summary: All application progress on the standalone scheduler can 
be halted by one systematically faulty node
                 Key: SPARK-4732
                 URL: https://issues.apache.org/jira/browse/SPARK-4732
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.1.0, 1.2.0
         Environment:  - Spark Standalone scheduler
            Reporter: Harry Brundage


We've experienced several cluster wide outages caused by unexpected system wide 
faults on one of our spark workers if that worker is failing systematically. By 
systematically, I mean that every executor launched by that worker will 
definitely fail due to some reason out of Spark's control like the log 
directory disk being completely out of space, or a permissions error for a file 
that's always read during executor launch. We screw up all the time on our team 
and cause stuff like this to happen, but because of the way the standalone 
scheduler allocates resources, our cluster doesn't recover gracefully from 
these failures. 

Correct me if I am wrong but when there are more tasks to do than executors, I 
am pretty sure the way the scheduler works is that it just waits for more 
resource offers and then allocates tasks from the queue to those resources. If 
an executor dies immediately after starting, the worker monitor process will 
notice that it's dead. The master will allocate that worker's now free 
cores/memory to a currently running application that is below its 
spark.cores.max, which in our case I've observed as usually the app that just 
had the executor die. A new executor gets spawned on the same worker that the 
last one just died on, gets allocated that one task that failed, and then the 
whole process fails again for the same systematic reason, and lather rinse 
repeat. This happens 10 times or whatever the max task failure count is, and 
then the whole app is deemed a failure by the driver and shut down completely.

For us, we usually run roughly as many cores as we have hadoop nodes. We also 
usually have many more input splits than we have tasks, which means the 
locality of the first few tasks which I believe determines where our executors 
run is well spread out over the cluster, and often covers 90-100% of nodes. 
This means the likelihood of any application getting an executor scheduled any 
broken node is quite high. So, in my experience, after an application goes 
through the above mentioned process and dies, the next application to start or 
not be at it's requested max capacity gets an executor scheduled on the broken 
node, and is promptly taken down as well. This happens over and over as well, 
to the point where none of our spark jobs are making any progress because of 
one tiny permissions mistake on one node.

Now, I totally understand this is usually an "error between keyboard and 
screen" kind of situation where it is the responsibility of the people 
deploying spark to ensure it is deployed correctly. The systematic issues we've 
encountered are almost always of this nature: permissions errors, disk full 
errors, one node not getting a new spark jar from a configuration error, 
configurations being out of sync, etc. That said, disks are going to fail or 
half fail, fill up, node rot is going to ruin configurations, etc etc etc, and 
as hadoop clusters scale in size this becomes more and more likely, so I think 
its reasonable to ask that Spark be resilient to this kind of failure and keep 
on truckin'. 

I think a good simple fix would be to have applications, or the master, 
blacklist workers (not executors) at a failure count lower than the task 
failure count. This would also serve as a belt and suspenders fix for 
SPARK-4498.
 If the scheduler stopped trying to schedule on nodes that fail a lot, we could 
still make progress. These blacklist events are really important and I think 
would need to be well logged and surfaced in the UI, but I'd rather log and 
carry on than fail hard. I think the tradeoff here is that you risk 
blacklisting ever worker as well if there is something systematically wrong 
with communication or whatever else I can't imagine.

Please let me know if I've misunderstood how the scheduler works or you need 
more information or anything like that and I'll be happy to provide. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to