Yuchen Feng created SPARK-31373: ----------------------------------- Summary: Cluster tried to fetch blocks from blacklisted node of previous stage Key: SPARK-31373 URL: https://issues.apache.org/jira/browse/SPARK-31373 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 2.4.2 Reporter: Yuchen Feng
We enabled blacklist on our Spark application but recently we saw some wierd issue. Our code is like rdd.repartitions(...).mapPartitions(...).groupByKey(...).map().collect() {{}}In mapPartitions stage, some executors has exception "Can't connect to host xxxxxx: Connection rest by peer" and tasks on them were failed, so all executors under this node were blacklisted, as well as this node. These executors did complete some tasks before blacklisted. Then in next stage (groupByKey(...).map()), application failed with fetch failure: IndexOutOfBound Exception when some healthy executor want to fetch block from one of above blacklisted executors. It happened multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org