Re: AW: Curious: Corrupted HDFS self-healing?

Henning Blohm Wed, 18 May 2016 01:52:09 -0700

Hi Chris,

that fits perfectly to my observation (and what is reported in theissue). That must be it then.


I suppose running  "fsck / -delete" is a rather bad idea in that case?

Feeling a bit relieved.

Thanks a lot!
Henning

On 17.05.2016 18:19, Chris Nauroth wrote:

Hello Henning,

If the file reported as corrupt was actively open for write by anotherprocess (i.e. HBase) at the time that you ran fsck, then it's possiblethat you're seeing the effects of bug HDFS-8809. This bug caused fsckto report the final under-construction block of an open file ascorrupt. This condition is normal and expected, so it's incorrect forfsck to report it as corruption. HDFS-8809 has a fix committed forApache Hadoop 2.8.0.


https://issues.apache.org/jira/browse/HDFS-8809

--Chris Nauroth

From: Henning Blohm <henning.bl...@zfabrik.de<mailto:henning.bl...@zfabrik.de>>

Date: Tuesday, May 17, 2016 at 8:02 AM

Cc: "mirko.kaempf" <mirko.kae...@gmail.com<mailto:mirko.kae...@gmail.com>>, "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.apache.org<mailto:user@hadoop.apache.org>>

Subject: Re: AW: Curious: Corrupted HDFS self-healing?

Hi Mirko,

thanks for commenting!

Right, no replication no healing. My specific problem is that

a) It went corrupt although there was no conceivable problem (no nodecrash, no outofmemory...)

b) It did heal itself - after reporting as corrupt.

It is mostly b) that I find irritating.

It is as if the data node forgot about some block that it has noproblem finding again later (after a restart). And all the time (btw)HBase reports that everything is cool (before and after having acorrupt HDFS).


Henning

On 17.05.2016 16:45, mirko.kaempf wrote:

Hello Henning,

since you reduced replication level to 1 in your one node cluster youdo not have any redundancy and thus you loose the self healingcapabilities of HDFS.Try to work with at least 3 Worker nodes which gives you 3 foldreplication.

Cheers, Mirko




Von Samsung Mobile gesendet


-------- Ursprüngliche Nachricht --------
Von: Henning Blohm <henning.bl...@zfabrik.de>
Datum:17.05.2016 16:24 (GMT+01:00)
An: user@hadoop.apache.org
Cc:
Betreff: Curious: Corrupted HDFS self-healing?

Hi all,

after some 20h loading of data into Hbase (v1.0 on Hadoop 2.6.0), single
node, I noticed that Hadoop reported a corrupt file system. It says:

Status: CORRUPT
   CORRUPT FILES:    1
   CORRUPT BLOCKS:     1
The filesystem under path '/' is CORRUPT


and checking the details it says:

---
FSCK started by hb (auth:SIMPLE) from /127.0.0.1 for path

/hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38

at Tue May 17 15:54:03 CEST 2016

/hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38

2740218577 bytes, 11 block(s):

/hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38:

CORRUPT blockpool BP-130837870-192.168.178.29-1462900512452 block
blk_1073746166
  MISSING 1 blocks of total size 268435456 B
0. BP-130837870-192.168.178.29-1462900512452:blk_1073746164_5344
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
1. BP-130837870-192.168.178.29-1462900512452:blk_1073746165_5345
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
2. BP-130837870-192.168.178.29-1462900512452:blk_1073746166_5346
len=268435456 MISSING!
3. BP-130837870-192.168.178.29-1462900512452:blk_1073746167_5347
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
4. BP-130837870-192.168.178.29-1462900512452:blk_1073746168_5348
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
5. BP-130837870-192.168.178.29-1462900512452:blk_1073746169_5349
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
6. BP-130837870-192.168.178.29-1462900512452:blk_1073746170_5350
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
7. BP-130837870-192.168.178.29-1462900512452:blk_1073746171_5351
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
8. BP-130837870-192.168.178.29-1462900512452:blk_1073746172_5352
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
9. BP-130837870-192.168.178.29-1462900512452:blk_1073746173_5353
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
10. BP-130837870-192.168.178.29-1462900512452:blk_1073746174_5354
len=55864017 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
---

(note 2.)

I did not try to repair using fsck. Instead restarting the node made
this problem go away:

---
FSCK started by hb (auth:SIMPLE) from /127.0.0.1 for path

/hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38

at Tue May 17 16:10:52 CEST 2016

/hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38

2740218577 bytes, 11 block(s):  OK
0. BP-130837870-192.168.178.29-1462900512452:blk_1073746164_5344
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
1. BP-130837870-192.168.178.29-1462900512452:blk_1073746165_5345
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
2. BP-130837870-192.168.178.29-1462900512452:blk_1073746166_5346
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
3. BP-130837870-192.168.178.29-1462900512452:blk_1073746167_5347
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
4. BP-130837870-192.168.178.29-1462900512452:blk_1073746168_5348
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
5. BP-130837870-192.168.178.29-1462900512452:blk_1073746169_5349
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
6. BP-130837870-192.168.178.29-1462900512452:blk_1073746170_5350
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
7. BP-130837870-192.168.178.29-1462900512452:blk_1073746171_5351
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
8. BP-130837870-192.168.178.29-1462900512452:blk_1073746172_5352
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
9. BP-130837870-192.168.178.29-1462900512452:blk_1073746173_5353
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
10. BP-130837870-192.168.178.29-1462900512452:blk_1073746174_5354
len=55864017 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]

Status: HEALTHY
---

I guess that means that the datanode reported the missing block now.

How is that possible? Is that an acceptable, expectable behavior?

Is there anything I can do to prevent this sort of problem?

Here is my hdfs config (substitute ${nosql.home} with the installation
folder and ${nosql.master} with localhost):

Any clarification would be great!

Thanks!
Henning

---
<configuration>

     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>

     <property>
         <name>dfs.namenode.name.dir</name>
         <value>file://${nosql.home}/data/name</value>
     </property>

     <property>
         <name>dfs.datanode.data.dir</name>
         <value>file://${nosql.home}/data/data</value>
     </property>


     <property>
<name>dfs.datanode.max.transfer.threads</name>
         <value>4096</value>
     </property>

     <property>
         <name>dfs.support.append</name>
         <value>true</value>
     </property>

     <property>
         <name>dfs.datanode.synconclose</name>
         <value>true</value>
     </property>

     <property>
<name>dfs.datanode.sync.behind.writes</name>
         <value>true</value>
     </property>

     <property>
<name>dfs.namenode.avoid.read.stale.datanode</name>
         <value>true</value>
     </property>

     <property>
<name>dfs.namenode.avoid.write.stale.datanode</name>
         <value>true</value>
     </property>

     <property>
<name>dfs.namenode.stale.datanode.interval</name>
         <value>3000</value>
     </property>

     <!--
       <property>
<name>dfs.client.read.shortcircuit</name>
         <value>true</value>
     </property>

     <property>
         <name>dfs.domain.socket.path</name>
<value>/var/lib/seritrack/dn_socket</value>
     </property>

     <property>
<name>dfs.client.read.shortcircuit.buffer.size</name>
         <value>131072</value>
     </property>
     -->

     <property>
         <name>dfs.block.size</name>
         <value>268435456</value>
     </property>

     <property>
         <name>ipc.server.tcpnodelay</name>
         <value>true</value>
     </property>

     <property>
         <name>ipc.client.tcpnodelay</name>
         <value>true</value>
     </property>

     <property>
<name>dfs.datanode.max.xcievers</name>
         <value>4096</value>
     </property>

     <property>
<name>dfs.namenode.handler.count</name>
         <value>64</value>
     </property>

     <property>
<name>dfs.datanode.handler.count</name>
         <value>8</value>
     </property>

</configuration>
---



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org

Re: AW: Curious: Corrupted HDFS self-healing?

Reply via email to