[jira] [Commented] (SPARK-24413) Executor Blacklisting shouldn't immediately fail the application if dynamic allocation is enabled and no active executors
[ https://issues.apache.org/jira/browse/SPARK-24413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494279#comment-16494279 ] Thomas Graves commented on SPARK-24413: --- thanks for linking those we can just dup this to SPARK-22148 > Executor Blacklisting shouldn't immediately fail the application if dynamic > allocation is enabled and no active executors > - > > Key: SPARK-24413 > URL: https://issues.apache.org/jira/browse/SPARK-24413 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > > Currently with executor blacklisting enabled, dynamic allocation on, and you > only have 1 active executor (spark.blacklist.killBlacklistedExecutors setting > doesn't matter in this case, can be on or off), if you have a task fail that > results in the 1 executor you have getting blacklisted, then your entire > application will fail. The error you get is something like: > Aborting TaskSet 0.0 because task 9 (partition 9) > cannot run anywhere due to node and executor blacklist. > This is very undesirable behavior because you may have a huge job but one > task is the long tail and if it happens to hit a bad node that would > blacklist it, the entire job fail. > Ideally since dynamic allocation is on, the schedule should not immediately > fail but it should let dynamic allocation try to get more executors. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24413) Executor Blacklisting shouldn't immediately fail the application if dynamic allocation is enabled and no active executors
[ https://issues.apache.org/jira/browse/SPARK-24413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494183#comment-16494183 ] Imran Rashid commented on SPARK-24413: -- yeah I agree about this. I linked two related jiras that are very close. I put down some thoughts earlier on those jiras for good ways to do this, but haven't had time to work on it > Executor Blacklisting shouldn't immediately fail the application if dynamic > allocation is enabled and no active executors > - > > Key: SPARK-24413 > URL: https://issues.apache.org/jira/browse/SPARK-24413 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > > Currently with executor blacklisting enabled, dynamic allocation on, and you > only have 1 active executor (spark.blacklist.killBlacklistedExecutors setting > doesn't matter in this case, can be on or off), if you have a task fail that > results in the 1 executor you have getting blacklisted, then your entire > application will fail. The error you get is something like: > Aborting TaskSet 0.0 because task 9 (partition 9) > cannot run anywhere due to node and executor blacklist. > This is very undesirable behavior because you may have a huge job but one > task is the long tail and if it happens to hit a bad node that would > blacklist it, the entire job fail. > Ideally since dynamic allocation is on, the schedule should not immediately > fail but it should let dynamic allocation try to get more executors. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24413) Executor Blacklisting shouldn't immediately fail the application if dynamic allocation is enabled and no active executors
[ https://issues.apache.org/jira/browse/SPARK-24413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493978#comment-16493978 ] Thomas Graves commented on SPARK-24413: --- [~imranr] thoughts on this? > Executor Blacklisting shouldn't immediately fail the application if dynamic > allocation is enabled and no active executors > - > > Key: SPARK-24413 > URL: https://issues.apache.org/jira/browse/SPARK-24413 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > > Currently with executor blacklisting enabled, dynamic allocation on, and you > only have 1 active executor (spark.blacklist.killBlacklistedExecutors setting > doesn't matter in this case, can be on or off), if you have a task fail that > results in the 1 executor you have getting blacklisted, then your entire > application will fail. The error you get is something like: > Aborting TaskSet 0.0 because task 9 (partition 9) > cannot run anywhere due to node and executor blacklist. > This is very undesirable behavior because you may have a huge job but one > task is the long tail and if it happens to hit a bad node that would > blacklist it, the entire job fail. > Ideally since dynamic allocation is on, the schedule should not immediately > fail but it should let dynamic allocation try to get more executors. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org