[ https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082164#comment-15082164 ]
Shixiong Zhu commented on SPARK-6166: ------------------------------------- [~mridulm80] AFAIK, Netty is able to support thousands of connections. Could you post the exception you encountered in your case? Is it a connection reset exception or a timeout exception? Just want to know it's because Netty cannot handle thousands of connections, or Spark cannot reply thousands of requests in time. > Add config to limit number of concurrent outbound connections for shuffle > fetch > ------------------------------------------------------------------------------- > > Key: SPARK-6166 > URL: https://issues.apache.org/jira/browse/SPARK-6166 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 1.4.0 > Reporter: Mridul Muralidharan > Assignee: Shixiong Zhu > Priority: Minor > > spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of > size. > But this is not always sufficient : when the number of hosts in the cluster > increase, this can lead to very large number of in-bound connections to one > more nodes - causing workers to fail under the load. > I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on > number of outstanding outbound connections. > This might still cause hotspots in the cluster, but in our tests this has > significantly reduced the occurance of worker failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org