On Tue, Apr 2, 2019 at 9:39 PM Ankur Gupta <ankur.gu...@cloudera.com> wrote:
> Hi Steve, > > Thanks for your feedback. From your email, I could gather the following > two important points: > > 1. Report failures to something (cluster manager) which can opt to > destroy the node and request a new one > 2. Pluggable failure detection algorithms > > Regarding #1, current blacklisting implementation does report blacklist > status to Yarn here > <https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala#L126>, > which can choose to take appropriate action based on failures across > different applications (though it seems it doesn't currently). This doesn't > work in static allocation though and for other cluster managers. Those > issues are still open: > > - https://issues.apache.org/jira/browse/SPARK-24016 > - https://issues.apache.org/jira/browse/SPARK-19755 > - https://issues.apache.org/jira/browse/SPARK-23485 > > Regarding #2, that is a good point but I think that is optional and may > not be tied to enabling the blacklisting feature in the current form. > I'd expect the algorithms to be done in the controllers, as failures were reported. One other thing to consider is how to rect where you are down to ~0 nodes. At that point you may as well give up on the blacklisting because you've just implicitly shut down the cluster. I seem to remember something (HDFS?) trying to deal with that > > Coming back to the concerns raised by Reynold, Chris and Steve, it seems > that there are at least two tasks that we need to complete before we decide > to enable blacklisting by default in it's current form: > > 1. Avoid resource starvation because of blacklisting > 2. Use exponential backoff for blacklisting instead of a configurable > threshold > 3. Report blacklisting status to all cluster managers (I am not sure > if this is necessary to move forward though) > > Thanks for all the feedback. Please let me know if there are other > concerns that we would like to resolve before enabling blacklisting. > > Thanks, > Ankur > > >>