Hi,

These days we encountered a critical bug in HDFS which can result in HBase
can't start normally.
The scenario is like following:
1.  rs1 writes data to HDFS file f1, and the first block is written
successfully
2.  rs1 apply to create the second block successfully, at this time,
nn1(ann) is crashed due to writing journal timeout
3. nn2(snn) isn't become active because of zkfc2 is in abnormal state
4. nn1 is restarted and becomes active
5. During the process of nn1 restarting, rs1 is crashed due to writing to
safemode nn(nn1)
6. As a result, the file f1 is in abnormal state and the HBase cluster
can't serve any more

We can use the command line shell to list the file, look like following:

-rw-------   3 hbase_srv supergroup  134217728 2014-09-05 11:32
/hbase/lgsrv-push/xxx

But when we try to download the file from hdfs, the dfs client complains:

14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not
available. Datanodes might not have reported blocks completely. Will
retry for 3 times
14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not
available. Datanodes might not have reported blocks completely. Will
retry for 2 times
14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not
available. Datanodes might not have reported blocks completely. Will
retry for 1 times
get: Could not obtain the last block locations.

Anyone can help on this?

-- 
Best Wishes!

Yours, Zesheng

Reply via email to