Hi, These days we encountered a critical bug in HDFS which can result in HBase can't start normally. The scenario is like following: 1. rs1 writes data to HDFS file f1, and the first block is written successfully 2. rs1 apply to create the second block successfully, at this time, nn1(ann) is crashed due to writing journal timeout 3. nn2(snn) isn't become active because of zkfc2 is in abnormal state 4. nn1 is restarted and becomes active 5. During the process of nn1 restarting, rs1 is crashed due to writing to safemode nn(nn1) 6. As a result, the file f1 is in abnormal state and the HBase cluster can't serve any more
We can use the command line shell to list the file, look like following: -rw------- 3 hbase_srv supergroup 134217728 2014-09-05 11:32 /hbase/lgsrv-push/xxx But when we try to download the file from hdfs, the dfs client complains: 14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 3 times 14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 2 times 14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 1 times get: Could not obtain the last block locations. Anyone can help on this? -- Best Wishes! Yours, Zesheng