Re: AW: Curious: Corrupted HDFS self-healing?

Chris Nauroth Tue, 17 May 2016 09:20:14 -0700

Hello Henning,

If the file reported as corrupt was actively open for write by another process 
(i.e. HBase) at the time that you ran fsck, then it's possible that you're 
seeing the effects of bug HDFS-8809.  This bug caused fsck to report the final 
under-construction block of an open file as corrupt.  This condition is normal 
and expected, so it's incorrect for fsck to report it as corruption.  HDFS-8809 
has a fix committed for Apache Hadoop 2.8.0.


https://issues.apache.org/jira/browse/HDFS-8809

--Chris Nauroth

From: Henning Blohm <henning.bl...@zfabrik.de<mailto:henning.bl...@zfabrik.de>>
Date: Tuesday, May 17, 2016 at 8:02 AM
Cc: "mirko.kaempf" <mirko.kae...@gmail.com<mailto:mirko.kae...@gmail.com>>, 
"user@hadoop.apache.org<mailto:user@hadoop.apache.org>" 
<user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Subject: Re: AW: Curious: Corrupted HDFS self-healing?

Hi Mirko,

thanks for commenting!

Right, no replication no healing. My specific problem is that

a) It went corrupt although there was no conceivable problem (no node crash, no 
outofmemory...)
b) It did heal itself - after reporting as corrupt.

It is mostly b) that I find irritating.

It is as if the data node forgot about some block that it has no problem 
finding again later (after a restart). And all the time (btw) HBase reports 
that everything is cool (before and after having a corrupt HDFS).

Henning

On 17.05.2016 16:45, mirko.kaempf wrote:
Hello Henning,
since you reduced replication level to 1 in your one node cluster you do not 
have any redundancy and thus you loose the self healing capabilities of HDFS.
Try to work with at least 3 Worker nodes which gives you 3 fold replication.
Cheers, Mirko




Von Samsung Mobile gesendet


-------- Ursprüngliche Nachricht --------
Von: Henning Blohm <henning.bl...@zfabrik.de><mailto:henning.bl...@zfabrik.de>
Datum:17.05.2016 16:24 (GMT+01:00)
An: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Cc:
Betreff: Curious: Corrupted HDFS self-healing?

Hi all,

after some 20h loading of data into Hbase (v1.0 on Hadoop 2.6.0), single
node, I noticed that Hadoop reported a corrupt file system. It says:

Status: CORRUPT
   CORRUPT FILES:    1
   CORRUPT BLOCKS:     1
The filesystem under path '/' is CORRUPT


and checking the details it says:

---
FSCK started by hb (auth:SIMPLE) from /127.0.0.1 for path
/hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38
at Tue May 17 15:54:03 CEST 2016
/hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38
2740218577 bytes, 11 block(s):
/hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38:
CORRUPT blockpool BP-130837870-192.168.178.29-1462900512452 block
blk_1073746166
  MISSING 1 blocks of total size 268435456 B
0. BP-130837870-192.168.178.29-1462900512452:blk_1073746164_5344
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
1. BP-130837870-192.168.178.29-1462900512452:blk_1073746165_5345
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
2. BP-130837870-192.168.178.29-1462900512452:blk_1073746166_5346
len=268435456 MISSING!
3. BP-130837870-192.168.178.29-1462900512452:blk_1073746167_5347
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
4. BP-130837870-192.168.178.29-1462900512452:blk_1073746168_5348
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
5. BP-130837870-192.168.178.29-1462900512452:blk_1073746169_5349
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
6. BP-130837870-192.168.178.29-1462900512452:blk_1073746170_5350
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
7. BP-130837870-192.168.178.29-1462900512452:blk_1073746171_5351
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
8. BP-130837870-192.168.178.29-1462900512452:blk_1073746172_5352
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
9. BP-130837870-192.168.178.29-1462900512452:blk_1073746173_5353
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
10. BP-130837870-192.168.178.29-1462900512452:blk_1073746174_5354
len=55864017 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
---

(note 2.)

I did not try to repair using fsck. Instead restarting the node made
this problem go away:

---
FSCK started by hb (auth:SIMPLE) from /127.0.0.1 for path
/hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38
at Tue May 17 16:10:52 CEST 2016
/hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38
2740218577 bytes, 11 block(s):  OK
0. BP-130837870-192.168.178.29-1462900512452:blk_1073746164_5344
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
1. BP-130837870-192.168.178.29-1462900512452:blk_1073746165_5345
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
2. BP-130837870-192.168.178.29-1462900512452:blk_1073746166_5346
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
3. BP-130837870-192.168.178.29-1462900512452:blk_1073746167_5347
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
4. BP-130837870-192.168.178.29-1462900512452:blk_1073746168_5348
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
5. BP-130837870-192.168.178.29-1462900512452:blk_1073746169_5349
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
6. BP-130837870-192.168.178.29-1462900512452:blk_1073746170_5350
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
7. BP-130837870-192.168.178.29-1462900512452:blk_1073746171_5351
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
8. BP-130837870-192.168.178.29-1462900512452:blk_1073746172_5352
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
9. BP-130837870-192.168.178.29-1462900512452:blk_1073746173_5353
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
10. BP-130837870-192.168.178.29-1462900512452:blk_1073746174_5354
len=55864017 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]

Status: HEALTHY
---

I guess that means that the datanode reported the missing block now.

How is that possible? Is that an acceptable, expectable behavior?

Is there anything I can do to prevent this sort of problem?

Here is my hdfs config (substitute ${nosql.home} with the installation
folder and ${nosql.master} with localhost):

Any clarification would be great!

Thanks!
Henning

---
<configuration>

     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>

     <property>
         <name>dfs.namenode.name.dir</name>
         <value>file://${nosql.home}/data/name</value>
     </property>

     <property>
         <name>dfs.datanode.data.dir</name>
         <value>file://${nosql.home}/data/data</value>
     </property>


     <property>
         <name>dfs.datanode.max.transfer.threads</name>
         <value>4096</value>
     </property>

     <property>
         <name>dfs.support.append</name>
         <value>true</value>
     </property>

     <property>
         <name>dfs.datanode.synconclose</name>
         <value>true</value>
     </property>

     <property>
         <name>dfs.datanode.sync.behind.writes</name>
         <value>true</value>
     </property>

     <property>
<name>dfs.namenode.avoid.read.stale.datanode</name>
         <value>true</value>
     </property>

     <property>
<name>dfs.namenode.avoid.write.stale.datanode</name>
         <value>true</value>
     </property>

     <property>
<name>dfs.namenode.stale.datanode.interval</name>
         <value>3000</value>
     </property>

     <!--
       <property>
         <name>dfs.client.read.shortcircuit</name>
         <value>true</value>
     </property>

     <property>
         <name>dfs.domain.socket.path</name>
         <value>/var/lib/seritrack/dn_socket</value>
     </property>

     <property>
<name>dfs.client.read.shortcircuit.buffer.size</name>
         <value>131072</value>
     </property>
     -->

     <property>
         <name>dfs.block.size</name>
         <value>268435456</value>
     </property>

     <property>
         <name>ipc.server.tcpnodelay</name>
         <value>true</value>
     </property>

     <property>
         <name>ipc.client.tcpnodelay</name>
         <value>true</value>
     </property>

     <property>
         <name>dfs.datanode.max.xcievers</name>
         <value>4096</value>
     </property>

     <property>
         <name>dfs.namenode.handler.count</name>
         <value>64</value>
     </property>

     <property>
         <name>dfs.datanode.handler.count</name>
         <value>8</value>
     </property>

</configuration>
---



---------------------------------------------------------------------
To unsubscribe, e-mail: 
user-unsubscr...@hadoop.apache.org<mailto:user-unsubscr...@hadoop.apache.org>
For additional commands, e-mail: 
user-h...@hadoop.apache.org<mailto:user-h...@hadoop.apache.org>

Re: AW: Curious: Corrupted HDFS self-healing?

Reply via email to