[ https://issues.apache.org/jira/browse/HDFS-8849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14654706#comment-14654706 ]
Allen Wittenauer commented on HDFS-8849: ---------------------------------------- bq. If a DN then goes offline that was containing some TeraSort output, then blocks appear missing and users get concerned because they see missing blocks on the NN web UI and via dfsadmin -report/fsck, but it's not obvious that those blocks were in fact set to replication factor 1. You don't have to explain this issue to me because, see, I've actually supported the same Hadoop systems long term and not just visiting every-so-often. In fact, let me explain something to you: try this experiment with min repl=2 and see what happens. But I'll save you some time: if you think min repl=1 is confusing, min repl=2 is worse. fsck pre-the min repl message gives you exactly *zero* direct information. You end up having to do a lot of funky math and counting messages in the fsck missing block output to figure out what is going on. Because the summary says all blocks are accounted for and "healthy" but the NN won't come out of safemode. Unless you know the NN is waiting for these blocks to appear, it's pure panic. Now because we actually hit this issue, on machines we run and support, I actually filed the JIRA to get the min repl block count missing message added. So I'm *intimately* familiar with the issue. It's not 3rd hand from a 2nd tier support person or from a random JIRA issue. It's not theoretical. The message that fsck pumps out (at least in trunk; I don't follow branch-2) gives *exactly* the information an ops person needs to know that X blocks are below that minimal replication number, whether it be 1, 2, or 10. They can take that information and know how many blocks they are on the hunt for and if the fsck reports healthy, they know they can force it out of safemode and let the NN do the replication itself. ... and let's be clear: the vast majority of people who are running fsck are operations people and they are almost certainly doing it as either part of their maintenance or when stuff breaks. Ignoring the "2 people in a garage" scenario, the vast majority of users are completely ignorant about fsck. They are almost certainly completely unaware that the tool exists and go running to the ops team if Hadoop is down. bq. Separately, using phrases like "Meanwhile, back in real life" and calling a proposed improvement a "useless feature" is not an appropriate way to communicate in this forum. I'm sticking with those comments unless you can give an example that isn't teragen. Because my real world, not in a lab, talking with users and operations folks on a regular basis experience says a purposefully set repl=1 that isn't teragen is almost always about avoiding quota. teragen has *always* been a bad actor on the system and we're MUCH better off setting the default min repl 2. Yes, this will likely break QA and single node test systems. We *seriously* need to get past this idea that we expect production people to change our idiotic defaults because it's inconvenient for builds that will only be up for a few hours. > fsck should report number of missing blocks with replication factor 1 > --------------------------------------------------------------------- > > Key: HDFS-8849 > URL: https://issues.apache.org/jira/browse/HDFS-8849 > Project: Hadoop HDFS > Issue Type: Improvement > Components: tools > Affects Versions: 2.7.1 > Reporter: Zhe Zhang > Assignee: Zhe Zhang > Priority: Minor > > HDFS-7165 supports reporting number of blocks with replication factor 1 in > {{dfsadmin}} and NN metrics. But it didn't extend {{fsck}} with the same > support, which is the aim of this JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332)