[ 
https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082191#comment-15082191
 ] 

Mridul Muralidharan commented on SPARK-6166:
--------------------------------------------


The resource usage incurred per node as part of handling large number of 
inbound (shuffle) connections causes the node to go down - either directly due 
to resource issues (direct buffer exhaustion, OOM, etc), or due to getting yarn 
killing the executor when it consumes memory outside of resource limits.
Before the fix, setting 75% of memory to overhead in a 700 node application at 
56GB each was insufficient to alleviate the issue in our production cluster.

Note that this fix directly does not remove the root cause [1], it drastically 
removes its occurance in our production clusters. Removing root cause will 
probably require some central coordination and is not worth the effort.

[1] It is possible for small number of mappers and large number of reducers to 
still swamp the mappers.
This can either be due to actually having a small number of mappers, or due to 
skew.

> Add config to limit number of concurrent outbound connections for shuffle 
> fetch
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-6166
>                 URL: https://issues.apache.org/jira/browse/SPARK-6166
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.4.0
>            Reporter: Mridul Muralidharan
>            Assignee: Shixiong Zhu
>            Priority: Minor
>
> spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of 
> size.
> But this is not always sufficient : when the number of hosts in the cluster 
> increase, this can lead to very large number of in-bound connections to one 
> more nodes - causing workers to fail under the load.
> I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on 
> number of outstanding outbound connections.
> This might still cause hotspots in the cluster, but in our tests this has 
> significantly reduced the occurance of worker failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to