[ https://issues.apache.org/jira/browse/HDFS-4253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13532779#comment-13532779 ]
Andy Isaacson commented on HDFS-4253: ------------------------------------- bq. The bug comes later, where you always return 1 if neither Node is on the local rack. This is wrong; it violates anticommutation (see link). But that's not what the code does. If neither Node is on the local rack, then {{aIsLocalRack == bIsLocalRack}} and we use the shuffle for a total ordering, right here: {code} 858 if (aIsLocalRack == bIsLocalRack) { 859 int ai = shuffle.get(a), bi = shuffle.get(b); 860 if (ai < bi) { 861 return -1; 862 } else if (ai > bi) { 863 return 1; 864 } else { 865 return 0; 866 } {code} The final {{else}} is only reached when {{bIsLocalRack && !aIsLocalRack}}. So I'm pretty sure this implementation does satisfy anticommutation. > block replica reads get hot-spots due to NetworkTopology#pseudoSortByDistance > ----------------------------------------------------------------------------- > > Key: HDFS-4253 > URL: https://issues.apache.org/jira/browse/HDFS-4253 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 3.0.0, 2.0.2-alpha > Reporter: Andy Isaacson > Assignee: Andy Isaacson > Attachments: hdfs4253-1.txt, hdfs4253.txt > > > When many nodes (10) read from the same block simultaneously, we get > asymmetric distribution of read load. This can result in slow block reads > when one replica is serving most of the readers and the other replicas are > idle. The busy DN bottlenecks on its network link. > This is especially visible with large block sizes and high replica counts (I > reproduced the problem with {{-Ddfs.block.size=4294967296}} and replication > 5), but the same behavior happens on a small scale with normal-sized blocks > and replication=3. > The root of the problem is in {{NetworkTopology#pseudoSortByDistance}} which > explicitly does not try to spread traffic among replicas in a given rack -- > it only randomizes usage for off-rack replicas. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira