[jira] [Commented] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch
[ https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144146#comment-15144146 ] Reynold Xin commented on SPARK-6166: [~zsxwing] can you update the title of this ticket to something that reflects what's been resolved? Thanks. > Add config to limit number of concurrent outbound connections for shuffle > fetch > --- > > Key: SPARK-6166 > URL: https://issues.apache.org/jira/browse/SPARK-6166 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Mridul Muralidharan >Priority: Minor > Fix For: 2.0.0 > > > spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of > size. > But this is not always sufficient : when the number of hosts in the cluster > increase, this can lead to very large number of in-bound connections to one > more nodes - causing workers to fail under the load. > I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on > number of outstanding outbound connections. > This might still cause hotspots in the cluster, but in our tests this has > significantly reduced the occurance of worker failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch
[ https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15107526#comment-15107526 ] Apache Spark commented on SPARK-6166: - User 'redsanket' has created a pull request for this issue: https://github.com/apache/spark/pull/10838 > Add config to limit number of concurrent outbound connections for shuffle > fetch > --- > > Key: SPARK-6166 > URL: https://issues.apache.org/jira/browse/SPARK-6166 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Mridul Muralidharan >Assignee: Shixiong Zhu >Priority: Minor > > spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of > size. > But this is not always sufficient : when the number of hosts in the cluster > increase, this can lead to very large number of in-bound connections to one > more nodes - causing workers to fail under the load. > I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on > number of outstanding outbound connections. > This might still cause hotspots in the cluster, but in our tests this has > significantly reduced the occurance of worker failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch
[ https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103065#comment-15103065 ] Mridul Muralidharan commented on SPARK-6166: We actually dont care about number of sockets or connections - but number of block requests which is going out. That is what we want to put a limit to - think of it as a corresponding limit to number of active requests as we have for number of outstanding bytes in flight. > Add config to limit number of concurrent outbound connections for shuffle > fetch > --- > > Key: SPARK-6166 > URL: https://issues.apache.org/jira/browse/SPARK-6166 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Mridul Muralidharan >Assignee: Shixiong Zhu >Priority: Minor > > spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of > size. > But this is not always sufficient : when the number of hosts in the cluster > increase, this can lead to very large number of in-bound connections to one > more nodes - causing workers to fail under the load. > I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on > number of outstanding outbound connections. > This might still cause hotspots in the cluster, but in our tests this has > significantly reduced the occurance of worker failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch
[ https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15101995#comment-15101995 ] Sanket Reddy commented on SPARK-6166: - Hi, I modified the code to fit the latest Spark build, I will have the patch up soon. > Add config to limit number of concurrent outbound connections for shuffle > fetch > --- > > Key: SPARK-6166 > URL: https://issues.apache.org/jira/browse/SPARK-6166 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Mridul Muralidharan >Assignee: Shixiong Zhu >Priority: Minor > > spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of > size. > But this is not always sufficient : when the number of hosts in the cluster > increase, this can lead to very large number of in-bound connections to one > more nodes - causing workers to fail under the load. > I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on > number of outstanding outbound connections. > This might still cause hotspots in the cluster, but in our tests this has > significantly reduced the occurance of worker failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch
[ https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082562#comment-15082562 ] Shixiong Zhu commented on SPARK-6166: - [~mridulm80] Thanks for your clarification. I saw your PR. But it's actually adding a limit number of blocks rather than the connections. So I want to know if we should limit the max connections or just the max block number. I will investigate which one is better. > Add config to limit number of concurrent outbound connections for shuffle > fetch > --- > > Key: SPARK-6166 > URL: https://issues.apache.org/jira/browse/SPARK-6166 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Mridul Muralidharan >Assignee: Shixiong Zhu >Priority: Minor > > spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of > size. > But this is not always sufficient : when the number of hosts in the cluster > increase, this can lead to very large number of in-bound connections to one > more nodes - causing workers to fail under the load. > I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on > number of outstanding outbound connections. > This might still cause hotspots in the cluster, but in our tests this has > significantly reduced the occurance of worker failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch
[ https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082354#comment-15082354 ] Mridul Muralidharan commented on SPARK-6166: /CC [~tgraves] who might be interested in this. > Add config to limit number of concurrent outbound connections for shuffle > fetch > --- > > Key: SPARK-6166 > URL: https://issues.apache.org/jira/browse/SPARK-6166 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Mridul Muralidharan >Assignee: Shixiong Zhu >Priority: Minor > > spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of > size. > But this is not always sufficient : when the number of hosts in the cluster > increase, this can lead to very large number of in-bound connections to one > more nodes - causing workers to fail under the load. > I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on > number of outstanding outbound connections. > This might still cause hotspots in the cluster, but in our tests this has > significantly reduced the occurance of worker failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch
[ https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082191#comment-15082191 ] Mridul Muralidharan commented on SPARK-6166: The resource usage incurred per node as part of handling large number of inbound (shuffle) connections causes the node to go down - either directly due to resource issues (direct buffer exhaustion, OOM, etc), or due to getting yarn killing the executor when it consumes memory outside of resource limits. Before the fix, setting 75% of memory to overhead in a 700 node application at 56GB each was insufficient to alleviate the issue in our production cluster. Note that this fix directly does not remove the root cause [1], it drastically removes its occurance in our production clusters. Removing root cause will probably require some central coordination and is not worth the effort. [1] It is possible for small number of mappers and large number of reducers to still swamp the mappers. This can either be due to actually having a small number of mappers, or due to skew. > Add config to limit number of concurrent outbound connections for shuffle > fetch > --- > > Key: SPARK-6166 > URL: https://issues.apache.org/jira/browse/SPARK-6166 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Mridul Muralidharan >Assignee: Shixiong Zhu >Priority: Minor > > spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of > size. > But this is not always sufficient : when the number of hosts in the cluster > increase, this can lead to very large number of in-bound connections to one > more nodes - causing workers to fail under the load. > I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on > number of outstanding outbound connections. > This might still cause hotspots in the cluster, but in our tests this has > significantly reduced the occurance of worker failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch
[ https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082164#comment-15082164 ] Shixiong Zhu commented on SPARK-6166: - [~mridulm80] AFAIK, Netty is able to support thousands of connections. Could you post the exception you encountered in your case? Is it a connection reset exception or a timeout exception? Just want to know it's because Netty cannot handle thousands of connections, or Spark cannot reply thousands of requests in time. > Add config to limit number of concurrent outbound connections for shuffle > fetch > --- > > Key: SPARK-6166 > URL: https://issues.apache.org/jira/browse/SPARK-6166 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Mridul Muralidharan >Assignee: Shixiong Zhu >Priority: Minor > > spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of > size. > But this is not always sufficient : when the number of hosts in the cluster > increase, this can lead to very large number of in-bound connections to one > more nodes - causing workers to fail under the load. > I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on > number of outstanding outbound connections. > This might still cause hotspots in the cluster, but in our tests this has > significantly reduced the occurance of worker failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch
[ https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075757#comment-15075757 ] Shixiong Zhu commented on SPARK-6166: - Sure > Add config to limit number of concurrent outbound connections for shuffle > fetch > --- > > Key: SPARK-6166 > URL: https://issues.apache.org/jira/browse/SPARK-6166 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Mridul Muralidharan >Priority: Minor > > spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of > size. > But this is not always sufficient : when the number of hosts in the cluster > increase, this can lead to very large number of in-bound connections to one > more nodes - causing workers to fail under the load. > I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on > number of outstanding outbound connections. > This might still cause hotspots in the cluster, but in our tests this has > significantly reduced the occurance of worker failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch
[ https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075678#comment-15075678 ] Reynold Xin commented on SPARK-6166: [~zsxwing] can you pick this one up? > Add config to limit number of concurrent outbound connections for shuffle > fetch > --- > > Key: SPARK-6166 > URL: https://issues.apache.org/jira/browse/SPARK-6166 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Mridul Muralidharan >Priority: Minor > > spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of > size. > But this is not always sufficient : when the number of hosts in the cluster > increase, this can lead to very large number of in-bound connections to one > more nodes - causing workers to fail under the load. > I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on > number of outstanding outbound connections. > This might still cause hotspots in the cluster, but in our tests this has > significantly reduced the occurance of worker failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch
[ https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14524205#comment-14524205 ] Apache Spark commented on SPARK-6166: - User 'mridulm' has created a pull request for this issue: https://github.com/apache/spark/pull/5852 > Add config to limit number of concurrent outbound connections for shuffle > fetch > --- > > Key: SPARK-6166 > URL: https://issues.apache.org/jira/browse/SPARK-6166 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Mridul Muralidharan >Priority: Minor > > spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of > size. > But this is not always sufficient : when the number of hosts in the cluster > increase, this can lead to very large number of in-bound connections to one > more nodes - causing workers to fail under the load. > I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on > number of outstanding outbound connections. > This might still cause hotspots in the cluster, but in our tests this has > significantly reduced the occurance of worker failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org