Max Mizikar created HDFS-15420: ---------------------------------- Summary: approx scheduled blocks not reseting over time Key: HDFS-15420 URL: https://issues.apache.org/jira/browse/HDFS-15420 Project: Hadoop HDFS Issue Type: Bug Components: block placement Affects Versions: 3.0.0, 2.6.0 Environment: Our 2.6.0 environment is a 3 node cluster running cdh5.15.0. Our 3.0.0 environment is a 4 node cluster running cdh6.3.0. Reporter: Max Mizikar Attachments: Screenshot from 2020-06-18 09-29-57.png, Screenshot from 2020-06-18 09-31-15.png
We have been experiencing large amounts of scheduled blocks that never get cleared out. This is preventing blocks from being placed even when there is plenty of space on the system. Here is an example of the block growth over 24 hours on one of our systems running 2.6.0 !Screenshot from 2020-06-18 09-29-57.png! Here is an example of the block growth over 24 hours on one of our systems running 3.0.0 !Screenshot from 2020-06-18 09-31-15.png! https://issues.apache.org/jira/browse/HDFS-1172 appears to be the main issue we were having on 2.6.0 so the growth has decreased since upgrading to 3.0.0, however, there appears to still be a systemic growth in scheduled blocks over time and our systems will still need to restart the namenode on occasion to reset this count. I have not determined what is causing the leaked blocks in 3.0.0. Looking into the issue, I discovered that the intention is for scheduled blocks to slowly go back down to 0 after errors cause blocks to be leaked. {code} /** Increment the number of blocks scheduled. */ void incrementBlocksScheduled(StorageType t) { currApproxBlocksScheduled.add(t, 1); } /** Decrement the number of blocks scheduled. */ void decrementBlocksScheduled(StorageType t) { if (prevApproxBlocksScheduled.get(t) > 0) { prevApproxBlocksScheduled.subtract(t, 1); } else if (currApproxBlocksScheduled.get(t) > 0) { currApproxBlocksScheduled.subtract(t, 1); } // its ok if both counters are zero. } /** Adjusts curr and prev number of blocks scheduled every few minutes. */ private void rollBlocksScheduled(long now) { if (now - lastBlocksScheduledRollTime > BLOCKS_SCHEDULED_ROLL_INTERVAL) { prevApproxBlocksScheduled.set(currApproxBlocksScheduled); currApproxBlocksScheduled.reset(); lastBlocksScheduledRollTime = now; } } {code} However, this code does not do what is intended if the system has a constant flow of written blocks. If blocks make it into prevApproxBlocksScheduled, the next scheduled block increments currApproxBlocksScheduled and when it completes, it decrements prevApproxBlocksScheduled preventing the leaked block to be removed from the approx count. So, for errors to be corrected, we have to not write any data for the roll period of 10 minutes. The number of blocks we write per 10 minutes is quite high. This allows the error on the approx counts to grow to very large numbers. The comments in the ticket for the original implementation suggest this issues was known. https://issues.apache.org/jira/browse/HADOOP-3707. However, it's not clear to me if the severity of it was known at the time. > So if there are some blocks that are not reported back by the datanode, they > will eventually get adjusted (usually 10 min; bit longer if datanode is > continuously receiving blocks). The comments suggest it will eventually get cleared out, but in our case, it never gets cleared out. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org