[ 
https://issues.apache.org/jira/browse/SPARK-31373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuchen Feng updated SPARK-31373:
--------------------------------
    Environment: EMR cluster with r5.4xlarge and r5.8xlarge instances

> Cluster tried to fetch blocks from blacklisted node of previous stage
> ---------------------------------------------------------------------
>
>                 Key: SPARK-31373
>                 URL: https://issues.apache.org/jira/browse/SPARK-31373
>             Project: Spark
>          Issue Type: Question
>          Components: Block Manager
>    Affects Versions: 2.4.2
>         Environment: EMR cluster with r5.4xlarge and r5.8xlarge instances
>            Reporter: Yuchen Feng
>            Priority: Major
>
> We enabled blacklist on our Spark application but recently we saw some wierd 
> issue.
> Our code is like
>  rdd.repartitions(...).mapPartitions(...).groupByKey(...).map().collect()
> {{}}In mapPartitions stage, some executors has exception "Can't connect to 
> host xxxxxx: Connection rest by peer" and tasks on them were failed, so all 
> executors under this node were blacklisted, as well as this node. These 
> executors did complete some tasks before blacklisted.
> Then in next stage (groupByKey(...).map()), application failed with fetch 
> failure: IndexOutOfBound Exception when some healthy executor want to fetch 
> block from one of above blacklisted executors.
> It happened multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to