GitHub user squito opened a pull request: https://github.com/apache/spark/pull/13234
[WIP] [SPARK-8426] Enhance Blacklist mechanism for fault-tolerance ## What changes were proposed in this pull request? Update of https://github.com/apache/spark/pull/8760 by @mwws. The current blacklist mechanism only considers one task a time -- this expands that by considering: 1. When we determine an executor is bad, we blacklist *all* tasks from that blacklist, both within the taskset and subsequent task sets. 2. When many executors on a node appear to be bad, we blacklist the entire node. ## How was this patch tested? Unit tests via jenkins. Also I ran the additional tests proposed [here](https://github.com/apache/spark/pull/8559) which include blacklist tests. TODO: [ ] performance tests [ ] more internal comments (in particular on concurrency) [ ] manual testing on a cluster You can merge this pull request into a Git repository by running: $ git pull https://github.com/squito/spark blacklist-SPARK-8426 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13234.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13234 ---- commit 975a2a3c2b810f6b462eb46813075aac4928c0ae Author: mwws <wei....@intel.com> Date: 2015-12-29T06:01:17Z enhance blacklist mechanism 1. create new BlacklistTracker and BlacklistStrategy interface to support complex use case for blacklist mechanism. 2. make Yarn allocator aware of node blacklist information 3. three strategies implemented for convenience, also user can define his own strategy SingleTaskStrategy: remain default behavior before this change. AdvanceSingleTaskStrategy: enhance SingleTaskStrategy by supporting stage level node blacklist ExecutorAndNodeStrategy: different taskSet can share blacklist information. commit 51d3c88720faffd6a1fb6910b999cdce0d446bcf Author: mwws <wei....@intel.com> Date: 2016-01-13T05:43:46Z change import order to meet new scala style check rule commit 7e52311bcf4b5528d127d1d0a16bade7c039517e Author: mwws <wei....@intel.com> Date: 2016-02-23T05:28:56Z simplify code and fix typo 1. fix compile error after rebase to latest codebas. 2. simplify configuration. 3. fix typo. 4. enhance comment and unit text. 5. remove unused import. 6. remove ExecutorAndNode strategy. commit b600604a0920054cf3b33bff047d84cbd302fb3c Author: Imran Rashid <iras...@cloudera.com> Date: 2016-05-10T17:49:05Z style commit 45525a118db078f80b3e0e74abe7d7f2e04a7883 Author: Imran Rashid <iras...@cloudera.com> Date: 2016-05-10T19:27:39Z small refactoring commit f6bb6de673cae7058c26d2f124d3de0d2eb5b06b Author: Imran Rashid <iras...@cloudera.com> Date: 2016-05-20T21:09:13Z Merge branch 'master' into blacklist-SPARK-8426 ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org