That’s great. Regards, Yi Liu
From: Zesheng Wu [mailto:wuzeshen...@gmail.com] Sent: Wednesday, September 10, 2014 8:25 PM To: user@hadoop.apache.org Subject: Re: HDFS: Couldn't obtain the locations of the last block Hi Yi, I went through HDFS-4516, and it really solves our problem, thanks very much! 2014-09-10 16:39 GMT+08:00 Zesheng Wu <wuzeshen...@gmail.com<mailto:wuzeshen...@gmail.com>>: Thanks Yi, I will look into HDFS-4516. 2014-09-10 15:03 GMT+08:00 Liu, Yi A <yi.a....@intel.com<mailto:yi.a....@intel.com>>: Hi Zesheng, I got from an offline email of you and knew your Hadoop version was 2.0.0-alpha and you also said “The block is allocated successfully in NN, but isn’t created in DN”. Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is similar with HDFS-4516. And can you try Hadoop 2.4 or later, you should not be able to re-produce it for these versions. From your description, the second block is created successfully and NN would flush the edit log info to shared journal and shared storage might persist the info, but before reporting back in rpc, there might be timeout to NN from shared storage. So the block exist in shared edit log, but DN doesn’t create it in anyway. On restart, client could fail, because in that Hadoop version, client would retry only in the case of NN last block size reported as non-zero if it was synced (see more in HDFS-4516). Regards, Yi Liu From: Zesheng Wu [mailto:wuzeshen...@gmail.com<mailto:wuzeshen...@gmail.com>] Sent: Tuesday, September 09, 2014 6:16 PM To: user@hadoop.apache.org<mailto:user@hadoop.apache.org> Subject: HDFS: Couldn't obtain the locations of the last block Hi, These days we encountered a critical bug in HDFS which can result in HBase can't start normally. The scenario is like following: 1. rs1 writes data to HDFS file f1, and the first block is written successfully 2. rs1 apply to create the second block successfully, at this time, nn1(ann) is crashed due to writing journal timeout 3. nn2(snn) isn't become active because of zkfc2 is in abnormal state 4. nn1 is restarted and becomes active 5. During the process of nn1 restarting, rs1 is crashed due to writing to safemode nn(nn1) 6. As a result, the file f1 is in abnormal state and the HBase cluster can't serve any more We can use the command line shell to list the file, look like following: -rw------- 3 hbase_srv supergroup 134217728 2014-09-05 11:32 /hbase/lgsrv-push/xxx But when we try to download the file from hdfs, the dfs client complains: 14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 3 times 14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 2 times 14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. Datanodes might not have reported blocks completely. Will retry for 1 times get: Could not obtain the last block locations. Anyone can help on this? -- Best Wishes! Yours, Zesheng -- Best Wishes! Yours, Zesheng -- Best Wishes! Yours, Zesheng