Github user revans2 commented on the issue: https://github.com/apache/storm/pull/2400 @jerrypeng, I am not sure your intuition is right, but this is all still theoretical until we roll it out and see what happens in real life. We have run some simulations that at least for our simulated load it does not appear to be any worse and in cases where there are GPU like resources it is a lot better. Solving fragmentation should mostly be around matching the ratio of resources in a request to the resources left on a node, which is a lot of what your initial RAS paper was about. The problem is that when sorting the nodes we prioritize proximity to other parts of the same topology over fixing fragmentation. So fragmentation really only matters when scheduling the first executor on a node. Because of this I don't think it is the size of the topology that matters as much as it is the size of the individual components. To really solve this we need a way to balance the desire to co-locate components with how well does this request fill what is left on the node. I am hopeful that when we finish https://issues.apache.org/jira/browse/STORM-2684 the scheduler will group parts of a topology that give it the biggest win within a single "super component" and then if we need to we can look at having a config that controls when to switch from sorting by co-locating to sorting to reduce fragmentation. i.e. when do we move on to the next node in the rack even if this one is not full because we are starting to run low on resource X. If you have suggestions or want to collaborate on it that would be great, but you know how hard it is to for us to get legal approval to share too much more than just code. So for now we want to try and get this feature rolled out and then monitor it to see how it goes and if we need to adjust anything. @govind-menon, Do you have some of the simulation results in a human readable format we can share? Also if you have the code we used to run the simulation putting an Apache license on it posting it would be great so that others can reproduce what we have done.
---