[jira] [Commented] (YARN-10738) When multi thread scheduling with multi node, we should shuffle with a gap to prevent hot accessing nodes.
[ https://issues.apache.org/jira/browse/YARN-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17802645#comment-17802645 ] Shilun Fan commented on YARN-10738: --- Bulk update: moved all 3.4.0 non-blocker issues, please move back if it is a blocker. Retarget 3.5.0. > When multi thread scheduling with multi node, we should shuffle with a gap to > prevent hot accessing nodes. > -- > > Key: YARN-10738 > URL: https://issues.apache.org/jira/browse/YARN-10738 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > Now the multi threading scheduling with multi node is not reasonable. > In large clusters, it will cause the hot accessing nodes, which will lead the > abnormal boom node. > Solution: > I think we should shuffle the sorted node (such the available resource sort > policy) with an interval. > I will solve the above problem, and avoid the hot accessing node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10738) When multi thread scheduling with multi node, we should shuffle with a gap to prevent hot accessing nodes.
[ https://issues.apache.org/jira/browse/YARN-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340599#comment-17340599 ] Qi Zhu commented on YARN-10738: --- Thanks a lot [~bibinchundatt] for reply and value information. For above hot spots case, if we can allocate based the dominated resource utilization, if the vcore is full, the vcore is dominated, we will allocate other nodes whose dominated resource utilization is not full. Based the dominated resource utilization, i think we still need to shuffle but the shuffle gap may be consistent with the cluster size. > When multi thread scheduling with multi node, we should shuffle with a gap to > prevent hot accessing nodes. > -- > > Key: YARN-10738 > URL: https://issues.apache.org/jira/browse/YARN-10738 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Now the multi threading scheduling with multi node is not reasonable. > In large clusters, it will cause the hot accessing nodes, which will lead the > abnormal boom node. > Solution: > I think we should shuffle the sorted node (such the available resource sort > policy) with an interval. > I will solve the above problem, and avoid the hot accessing node. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10738) When multi thread scheduling with multi node, we should shuffle with a gap to prevent hot accessing nodes.
[ https://issues.apache.org/jira/browse/YARN-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338910#comment-17338910 ] Bibin Chundatt commented on YARN-10738: --- [~zhuqi] Following are the probable issue i see with using ResourceUsageMultiNodeLookupPolicy on large cluster which could cause hot spots The sorting happens based on available resource consider memory , cpu then nodes ID. # If the memory is available on node and vcores is full still we use the full nodes for allocation attempt . # On the cluster if we have nodes of diff resource sizes the hotspot cases become more serious. The larger machines get preferred always creating under utilization in lower profile machines. # If all the nodes are of same size and not used then the ordering is based on nodeID which could cause machines allocation attempt in canonical order > When multi thread scheduling with multi node, we should shuffle with a gap to > prevent hot accessing nodes. > -- > > Key: YARN-10738 > URL: https://issues.apache.org/jira/browse/YARN-10738 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Now the multi threading scheduling with multi node is not reasonable. > In large clusters, it will cause the hot accessing nodes, which will lead the > abnormal boom node. > Solution: > I think we should shuffle the sorted node (such the available resource sort > policy) with an interval. > I will solve the above problem, and avoid the hot accessing node. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10738) When multi thread scheduling with multi node, we should shuffle with a gap to prevent hot accessing nodes.
[ https://issues.apache.org/jira/browse/YARN-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335104#comment-17335104 ] Qi Zhu commented on YARN-10738: --- Thanks [~Jim_Brennan] for review and very patient investigation. The original ResourceUsageMultiNodeLookupPolicy policy sometimes cause the hot node in test cluster, and after the gap shuffle about more than 50% reduce the hot node case, but the gap 10 we should discuss about it, it related to the size of the cluster, and it will get better result if we choose the good gap. I agree with you, that another option to consider would be to have a policy that uses node utilization, which should more accurately reflect how busy the node is. And we should also shuffle based the node utilization, because multi thread scheduling, will commit to the first same node, it will cause the hot node, and the hot node is the big bottleneck of real time cluster. And actually the hot node is mainly affected the real time cluster, because it is more restrict to the delay of job. Thanks. > When multi thread scheduling with multi node, we should shuffle with a gap to > prevent hot accessing nodes. > -- > > Key: YARN-10738 > URL: https://issues.apache.org/jira/browse/YARN-10738 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Now the multi threading scheduling with multi node is not reasonable. > In large clusters, it will cause the hot accessing nodes, which will lead the > abnormal boom node. > Solution: > I think we should shuffle the sorted node (such the available resource sort > policy) with an interval. > I will solve the above problem, and avoid the hot accessing node. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10738) When multi thread scheduling with multi node, we should shuffle with a gap to prevent hot accessing nodes.
[ https://issues.apache.org/jira/browse/YARN-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335017#comment-17335017 ] Jim Brennan commented on YARN-10738: [~zhuqi], I am not very familiar with the multi-threaded scheduling code - we have not started using it yet. So it would be very helpful if you could provide more details about what you are observing in your cluster, and how you think this will fix it. Is your cluster made up of many nodes that are the same size, or do you have a mix of different sizes? If you have any data that shows some nodes being more heavily utilized than others, that would be helpful. Looking at {{ResourceUsageMultiNodeLookupPolicy}}, it seems to sort by allocated resources to a node, so this seems to be trying to ensure we allocate more evenly across nodes. It doesn't consider the relative sizes of the nodes though, so in a heterogenous cluster, I could see it leading to smaller nodes being busier than larger nodes. I wonder if a reverse sort by unallocated resources might be more fair, because it would favor nodes that have more room for new resource requests, rather than those that currently have fewer resources allocated. Another option to consider would be to have a policy that uses node utilization, which should more accurately reflect how busy the node is. With respect to the policy proposed in this ticket, I am not convinced it will help very much? It's doing the same sort by allocated resources, but just adding a shuffle of every 10 nodes. I'm not sure how much that will help in practice on a large cluster. A rack is usually more than 10 nodes, so it's possible the same set of racks will be over-utilized. Again, it would be helpful if you had some before/after data to show how it helps in a real cluster. > When multi thread scheduling with multi node, we should shuffle with a gap to > prevent hot accessing nodes. > -- > > Key: YARN-10738 > URL: https://issues.apache.org/jira/browse/YARN-10738 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Now the multi threading scheduling with multi node is not reasonable. > In large clusters, it will cause the hot accessing nodes, which will lead the > abnormal boom node. > Solution: > I think we should shuffle the sorted node (such the available resource sort > policy) with an interval. > I will solve the above problem, and avoid the hot accessing node. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10738) When multi thread scheduling with multi node, we should shuffle with a gap to prevent hot accessing nodes.
[ https://issues.apache.org/jira/browse/YARN-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322560#comment-17322560 ] Qi Zhu commented on YARN-10738: --- [~Jim_Brennan] Could you help review this, when you are free? Thanks. > When multi thread scheduling with multi node, we should shuffle with a gap to > prevent hot accessing nodes. > -- > > Key: YARN-10738 > URL: https://issues.apache.org/jira/browse/YARN-10738 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Now the multi threading scheduling with multi node is not reasonable. > In large clusters, it will cause the hot accessing nodes, which will lead the > abnormal boom node. > Solution: > I think we should shuffle the sorted node (such the available resource sort > policy) with an interval. > I will solve the above problem, and avoid the hot accessing node. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10738) When multi thread scheduling with multi node, we should shuffle with a gap to prevent hot accessing nodes.
[ https://issues.apache.org/jira/browse/YARN-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322004#comment-17322004 ] Qi Zhu commented on YARN-10738: --- I have changed to *MultiNodeLookupPolicy * implementation in latest PR , as suggested by [~bibinchundatt]. Thanks. > When multi thread scheduling with multi node, we should shuffle with a gap to > prevent hot accessing nodes. > -- > > Key: YARN-10738 > URL: https://issues.apache.org/jira/browse/YARN-10738 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Now the multi threading scheduling with multi node is not reasonable. > In large clusters, it will cause the hot accessing nodes, which will lead the > abnormal boom node. > Solution: > I think we should shuffle the sorted node (such the available resource sort > policy) with an interval. > I will solve the above problem, and avoid the hot accessing node. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org