[ 
https://issues.apache.org/jira/browse/HDFS-8849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14654706#comment-14654706
 ] 

Allen Wittenauer commented on HDFS-8849:
----------------------------------------

bq.  If a DN then goes offline that was containing some TeraSort output, then 
blocks appear missing and users get concerned because they see missing blocks 
on the NN web UI and via dfsadmin -report/fsck, but it's not obvious that those 
blocks were in fact set to replication factor 1.

You don't have to explain this issue to me because, see, I've actually 
supported the same Hadoop systems long term and not just visiting 
every-so-often.  In fact, let me explain something to you:  try this experiment 
with min repl=2 and see what happens.

But I'll save you some time: if you think min repl=1 is confusing, min repl=2 
is worse.  fsck pre-the min repl message gives you exactly *zero* direct 
information. You end up having to do a lot of funky math and counting messages 
in the fsck missing block output to figure out what is going on. Because the 
summary says all blocks are accounted for and "healthy" but the NN won't come 
out of safemode.  Unless you know the NN is waiting for these blocks to appear, 
it's pure panic.

Now because we actually hit this issue, on machines we run and support, I 
actually filed the JIRA to get the min repl block count missing message added. 
So I'm *intimately* familiar with the issue.  It's not 3rd hand from a 2nd tier 
support person or from a random JIRA issue. It's not theoretical. The message 
that fsck pumps out (at least in trunk; I don't follow branch-2) gives 
*exactly* the information an ops person needs to know that X blocks are below 
that minimal replication number, whether it be 1, 2, or 10.  They can take that 
information and know how many blocks they are on the hunt for and if the fsck 
reports healthy, they know they can force it out of safemode and let the NN do 
the replication itself.  

... and let's be clear: the vast majority of people who are running fsck are 
operations people and they are almost certainly doing it as either part of 
their maintenance or when stuff breaks.  Ignoring the "2 people in a garage" 
scenario, the vast majority of users are completely ignorant about fsck.  They 
are almost certainly completely unaware that the tool exists and go running to 
the ops team if Hadoop is down.  

bq.  Separately, using phrases like "Meanwhile, back in real life" and calling 
a proposed improvement a "useless feature" is not an appropriate way to 
communicate in this forum.

I'm sticking with those comments unless you can give an example that isn't 
teragen.  Because my real world, not in a lab, talking with users and 
operations folks on a regular basis experience says a purposefully set repl=1 
that isn't teragen is almost always about avoiding quota.  teragen has *always* 
been a bad actor on the system and we're MUCH better off setting the default 
min repl 2. Yes, this will likely break QA and single node test systems.  We 
*seriously* need to get past this idea that we expect production people to 
change our idiotic defaults because it's inconvenient for builds that will only 
be up for a few hours.

> fsck should report number of missing blocks with replication factor 1
> ---------------------------------------------------------------------
>
>                 Key: HDFS-8849
>                 URL: https://issues.apache.org/jira/browse/HDFS-8849
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: tools
>    Affects Versions: 2.7.1
>            Reporter: Zhe Zhang
>            Assignee: Zhe Zhang
>            Priority: Minor
>
> HDFS-7165 supports reporting number of blocks with replication factor 1 in 
> {{dfsadmin}} and NN metrics. But it didn't extend {{fsck}} with the same 
> support, which is the aim of this JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to