Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/14162 I'd be curious if you find out what was wrong with that node. If its the leveldb file not being created, that should be fixed by https://github.com/apache/spark/commit/aab99d31a927adfa9216dd14e76493a187b6d6e7 which is supposed to use the approved recovery path and if that is bad I believe the nodemanager and all services won't come up. But ignoring the actual cause I think if we put this in we should make it configurable, with default to not throw. From a YARN point of view I don't necessarily want one bad service to take the entire cluster down. For instance, lets say we have a bug in the spark shuffle services, we try to deploy a 5000 node cluster, this change now causes none of the nodemanagers to come up. But my workload on that cluster is such that spark is only like 1%. I don't necessarily want that to block the other 99% of jobs on that cluster while I try to fix the spark shuffle handler or roll it back. This also should get better once we have the node blacklisting stuff in.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org