[GitHub] [hadoop-ozone] timmylicheng commented on issue #668: HDDS-3139. Pipeline placement should max out pipeline usage
timmylicheng commented on issue #668: URL: https://github.com/apache/hadoop-ozone/pull/668#issuecomment-616920062 @sodonnel Thanks for the efforts for benchmarking. That was a valid concern for large cluster in the future. I created https://issues.apache.org/jira/browse/HDDS-3466 to track this potential bottleneck. Say we allow 5 pipelines per DN, 10K pipelines require 10K*3/5 = 6K DNs. My team is testing how many DNs every OM and SCM can hold up. My takeaway is we are going to resolve quite a lot of issues on GRPC and Ratis before we move Ozone up to 6K DNs per SCM. So we may not be able to see this bottleneck in short term in prod cluster. However, we def wanna track this or do some testing for this to prevent it from being a blocker. The major impact for this once being a bottleneck is createPipeline process would be long which could cause a slow cold start for SCM and DNs. On the other hand, 10K pipeline leads to many pipeline reports and container reports for single SCM. We may wanna limit overall pipeline counts per SCM ultimately. @fapifta Yea I agree with you. Let's use HDDS-3466 to track this and we are going to learn more about SCM's ability and best setup as we gradually move up to large cluster. My team is testing 200 DNs with single OM and SCM at this point (but in k8s env). Hopefully we can move further to have a clearer picture. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[GitHub] [hadoop-ozone] timmylicheng commented on issue #668: HDDS-3139. Pipeline placement should max out pipeline usage
timmylicheng commented on issue #668: HDDS-3139. Pipeline placement should max out pipeline usage URL: https://github.com/apache/hadoop-ozone/pull/668#issuecomment-613865710 > I think this looks better and I have just a few minor comments. I added a few inline plus these two: > > At line 243: > > ``` > // First choose an anchor nodes randomly > DatanodeDetails anchor = chooseNode(healthyNodes); > ``` > > This no longer picks a random node - it just picks the lowest loaded node. I wonder if this first anchor node should be a random node? > > Method getHigherLoadNodes() does not seem to be used any longer, so we can remove it. > > This code has changed a lot with this patch, and this is really my first time looking at it. It would be great if @ChenSammi could give this a review too, after addressing the minor points I mentioned here? I've addressed these comments. Thanks. I will check if @ChenSammi had bandwidth to do a review. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[GitHub] [hadoop-ozone] timmylicheng commented on issue #668: HDDS-3139. Pipeline placement should max out pipeline usage
timmylicheng commented on issue #668: HDDS-3139. Pipeline placement should max out pipeline usage URL: https://github.com/apache/hadoop-ozone/pull/668#issuecomment-611927572 > > @sodonnel I rebase the code with minor conflicts in test class, but the test won't pass. I took a close look and made some change. But I realize the issue that I mention in the last comment about how to leverage with chooseNodeFromTopology. Wanna learn your thoughts. > > I think one problem is this line: > > ``` > datanodeDetails = nodes.stream().findAny().get(); > ``` > > The findAny method does not seem to return a random entry - so the same node is returned until it uses up its pipeline allocation. > > I am also not sure about the limit calculation in getLowerLoadNodes: > > ``` > int limit = nodes.size() * heavyNodeCriteria > / HddsProtos.ReplicationFactor.THREE.getNumber(); > ``` > > Adding debug, I find this method starts to return an empty list when there are still available nodes to handle the pipeline. > > Also in `filterViableNodes()` via the `meetCriteria()` method, nodes with more than the heavy load limit are already filtered out, so you are guaranteed your healthy node list container only nodes with the capacity to take another pipeline. So I wonder why we need to filter the nodes further. > > > But I realize the issue that I mention in the last comment about how to leverage with chooseNodeFromTopology. > > There seems to be some inconsistency in how we pick the nodes (not just in this PR, but in the wider code). Eg in `chooseNodeBasedOnRackAwareness()` we don't call into NetworkTopology(), but instead we use the `getNetworkLocation()` method on the `DatanodeDetails` object to find nodes that do not match the anchor's location. > > Then later in `chooseNodeFromNetworkTopology()` we try to find a node where location is equal to the anchor and that is where we call into `networkTopology.chooseRandom()`. Could we not avoid that call, and avoid generating a new list of nodes and do something similar to `chooseNodeBasedOnRackAwareness()`, using the `getNetworkLocation()` method to find matching nodes. That would probably be more efficient that the current implementation. > > As we are also then able to re-use the same list of healthy nodes everywhere without more filtering, maybe we could sort that list once by pipeline count in filterViableNodes or meetCriteria and then later always pick the node with the lowest load, filling the nodes up that way. > > I hope this comment makes sense as it is very long. @sodonnel Thanks for the consideration. I've updated the patch according to your example. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[GitHub] [hadoop-ozone] timmylicheng commented on issue #668: HDDS-3139 Pipeline placement should max out pipeline usage
timmylicheng commented on issue #668: HDDS-3139 Pipeline placement should max out pipeline usage URL: https://github.com/apache/hadoop-ozone/pull/668#issuecomment-607065266 > Hi @timmylicheng this patch is giving some conflicts now, probably as we merged the other pipeline related change. Could you rebase against master and push it again please? @sodonnel I rebase the code with minor conflicts in test class, but the test won't pass. I took a close look and made some change. But I realize the issue that I mention in the last comment about how to leverage with chooseNodeFromTopology. Wanna learn your thoughts. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[GitHub] [hadoop-ozone] timmylicheng commented on issue #668: HDDS-3139 Pipeline placement should max out pipeline usage
timmylicheng commented on issue #668: HDDS-3139 Pipeline placement should max out pipeline usage URL: https://github.com/apache/hadoop-ozone/pull/668#issuecomment-603840741 Before pipeline max: 13, pipeline actual: 12 Pipeline count on node: 1 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 After pipeline max: 13, pipeline actual: 13 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 4 Pipeline count on node: 5 Pipeline count on node: 5 When the node number gets large, they perform identically. Before pipeline max: 63, pipeline actual: 63 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 4 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 After pipeline max: 63, pipeline actual: 63 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 4 Pipeline count on node: 5 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org
[GitHub] [hadoop-ozone] timmylicheng commented on issue #668: HDDS-3139 Pipeline placement should max out pipeline usage
timmylicheng commented on issue #668: HDDS-3139 Pipeline placement should max out pipeline usage URL: https://github.com/apache/hadoop-ozone/pull/668#issuecomment-600464593 When there are 5 nodes and each is allowed to have 5 pipelines: Former pipeline placement track: Pipeline count on node: 2 Pipeline count on node: 5 Pipeline count on node: 4 Pipeline count on node: 5 Pipeline count on node: 5 pipeline max: 8, pipeline actual: 7 after this change: Pipeline count on node: 4 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 Pipeline count on node: 5 pipeline max: 8, pipeline actual: 8 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org