Hi Han, You may be seeing the same issue I described here: https://issues.apache.org/jira/browse/SPARK-22342?focusedCommentId=16411780&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16411780 Do you see "TASK_LOST" in your driver logs? I got past that issue by updating my version of libmesos (see my second comment in the ticket).
There's also this PR that is in progress: https://github.com/apache/spark/pull/20640 Susan On Sun, Apr 8, 2018 at 4:06 PM, hantuzun <m...@hantuzun.com> wrote: > Hi all, > > Spark currently has blacklisting enabled on Mesos, no matter what: > [SPARK-19755][Mesos] Blacklist is always active for > MesosCoarseGrainedSchedulerBackend > > Blacklisting also prevents new drivers from running on our nodes where > previous drivers' had failed tasks. > > We've tried restarting Spark dispatcher before sending new tasks. Even > creating new machines (with the same hostname) does not help. > > Looking at TaskSetBlacklist > <https://github.com/apache/spark/blob/e18d6f5326e0d9ea03d31de5ce04cb > 84d3b8ab37/core/src/main/scala/org/apache/spark/ > scheduler/TaskSetBlacklist.scala#L66> > , I don't understand how a fresh Spark job submitted from a fresh Spark > Dispatcher starts saying all the nodes are blacklisted right away. How does > Spark know previous task failures? > > This issue severely interrupts us. How could we disable blacklisting on > Spark 2.3.0? Creative ideas are welcome :) > > Best, > Han > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Susan X. Huynh Software engineer, Data Agility xhu...@mesosphere.com