RE: HDFS: Couldn't obtain the locations of the last block

2014-09-10 Thread Liu, Yi A
Hi Zesheng,

I got from an offline email of you and knew your Hadoop version was 2.0.0-alpha 
and you also said “The block is allocated successfully in NN, but isn’t created 
in DN”.
Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is similar 
with HDFS-4516.   And can you try Hadoop 2.4 or later, you should not be able 
to re-produce it for these versions.

From your description, the second block is created successfully and NN would 
flush the edit log info to shared journal and shared storage might persist the 
info, but before reporting back in rpc, there might be timeout to NN from 
shared storage.  So the block exist in shared edit log, but DN doesn’t create 
it in anyway.  On restart, client could fail, because in that Hadoop version, 
client would retry only in the case of NN last block size reported as non-zero 
if it was synced (see more in HDFS-4516).

Regards,
Yi Liu

From: Zesheng Wu [mailto:wuzeshen...@gmail.com]
Sent: Tuesday, September 09, 2014 6:16 PM
To: user@hadoop.apache.org
Subject: HDFS: Couldn't obtain the locations of the last block

Hi,

These days we encountered a critical bug in HDFS which can result in HBase 
can't start normally.
The scenario is like following:
1.  rs1 writes data to HDFS file f1, and the first block is written successfully
2.  rs1 apply to create the second block successfully, at this time, nn1(ann) 
is crashed due to writing journal timeout
3. nn2(snn) isn't become active because of zkfc2 is in abnormal state
4. nn1 is restarted and becomes active
5. During the process of nn1 restarting, rs1 is crashed due to writing to 
safemode nn(nn1)
6. As a result, the file f1 is in abnormal state and the HBase cluster can't 
serve any more

We can use the command line shell to list the file, look like following:

-rw---   3 hbase_srv supergroup  134217728 2014-09-05 11:32 
/hbase/lgsrv-push/xxx
But when we try to download the file from hdfs, the dfs client complains:

14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. 
Datanodes might not have reported blocks completely. Will retry for 3 times

14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. 
Datanodes might not have reported blocks completely. Will retry for 2 times

14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. 
Datanodes might not have reported blocks completely. Will retry for 1 times

get: Could not obtain the last block locations.

Anyone can help on this?
--
Best Wishes!

Yours, Zesheng


Re: HDFS: Couldn't obtain the locations of the last block

2014-09-10 Thread Zesheng Wu
Thanks Yi, I will look into HDFS-4516.


2014-09-10 15:03 GMT+08:00 Liu, Yi A yi.a@intel.com:

  Hi Zesheng,



 I got from an offline email of you and knew your Hadoop version was
 2.0.0-alpha and you also said “The block is allocated successfully in NN,
 but isn’t created in DN”.

 Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is
 similar with HDFS-4516.   And can you try Hadoop 2.4 or later, you should
 not be able to re-produce it for these versions.



 From your description, the second block is created successfully and NN
 would flush the edit log info to shared journal and shared storage might
 persist the info, but before reporting back in rpc, there might be timeout
 to NN from shared storage.  So the block exist in shared edit log, but DN
 doesn’t create it in anyway.  On restart, client could fail, because in
 that Hadoop version, client would retry only in the case of NN last block
 size reported as non-zero if it was synced (see more in HDFS-4516).



 Regards,

 Yi Liu



 *From:* Zesheng Wu [mailto:wuzeshen...@gmail.com]
 *Sent:* Tuesday, September 09, 2014 6:16 PM
 *To:* user@hadoop.apache.org
 *Subject:* HDFS: Couldn't obtain the locations of the last block



 Hi,



 These days we encountered a critical bug in HDFS which can result in HBase
 can't start normally.

 The scenario is like following:

 1.  rs1 writes data to HDFS file f1, and the first block is written
 successfully

 2.  rs1 apply to create the second block successfully, at this time,
 nn1(ann) is crashed due to writing journal timeout

 3. nn2(snn) isn't become active because of zkfc2 is in abnormal state

 4. nn1 is restarted and becomes active

 5. During the process of nn1 restarting, rs1 is crashed due to writing to
 safemode nn(nn1)

 6. As a result, the file f1 is in abnormal state and the HBase cluster
 can't serve any more



 We can use the command line shell to list the file, look like following:

 -rw---   3 hbase_srv supergroup  134217728 2014-09-05 11:32 
 /hbase/lgsrv-push/xxx

  But when we try to download the file from hdfs, the dfs client complains:

 14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. 
 Datanodes might not have reported blocks completely. Will retry for 3 times

 14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. 
 Datanodes might not have reported blocks completely. Will retry for 2 times

 14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. 
 Datanodes might not have reported blocks completely. Will retry for 1 times

 get: Could not obtain the last block locations.

 Anyone can help on this?

  --
 Best Wishes!

 Yours, Zesheng




-- 
Best Wishes!

Yours, Zesheng


Re: HDFS: Couldn't obtain the locations of the last block

2014-09-10 Thread Zesheng Wu
Hi Yi,

I went through HDFS-4516, and it really solves our problem, thanks very
much!

2014-09-10 16:39 GMT+08:00 Zesheng Wu wuzeshen...@gmail.com:

 Thanks Yi, I will look into HDFS-4516.


 2014-09-10 15:03 GMT+08:00 Liu, Yi A yi.a@intel.com:

  Hi Zesheng,



 I got from an offline email of you and knew your Hadoop version was
 2.0.0-alpha and you also said “The block is allocated successfully in NN,
 but isn’t created in DN”.

 Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is
 similar with HDFS-4516.   And can you try Hadoop 2.4 or later, you should
 not be able to re-produce it for these versions.



 From your description, the second block is created successfully and NN
 would flush the edit log info to shared journal and shared storage might
 persist the info, but before reporting back in rpc, there might be timeout
 to NN from shared storage.  So the block exist in shared edit log, but DN
 doesn’t create it in anyway.  On restart, client could fail, because in
 that Hadoop version, client would retry only in the case of NN last block
 size reported as non-zero if it was synced (see more in HDFS-4516).



 Regards,

 Yi Liu



 *From:* Zesheng Wu [mailto:wuzeshen...@gmail.com]
 *Sent:* Tuesday, September 09, 2014 6:16 PM
 *To:* user@hadoop.apache.org
 *Subject:* HDFS: Couldn't obtain the locations of the last block



 Hi,



 These days we encountered a critical bug in HDFS which can result in
 HBase can't start normally.

 The scenario is like following:

 1.  rs1 writes data to HDFS file f1, and the first block is written
 successfully

 2.  rs1 apply to create the second block successfully, at this time,
 nn1(ann) is crashed due to writing journal timeout

 3. nn2(snn) isn't become active because of zkfc2 is in abnormal state

 4. nn1 is restarted and becomes active

 5. During the process of nn1 restarting, rs1 is crashed due to writing to
 safemode nn(nn1)

 6. As a result, the file f1 is in abnormal state and the HBase cluster
 can't serve any more



 We can use the command line shell to list the file, look like following:

 -rw---   3 hbase_srv supergroup  134217728 2014-09-05 11:32 
 /hbase/lgsrv-push/xxx

  But when we try to download the file from hdfs, the dfs client
 complains:

 14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. 
 Datanodes might not have reported blocks completely. Will retry for 3 times

 14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. 
 Datanodes might not have reported blocks completely. Will retry for 2 times

 14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. 
 Datanodes might not have reported blocks completely. Will retry for 1 times

 get: Could not obtain the last block locations.

 Anyone can help on this?

  --
 Best Wishes!

 Yours, Zesheng




 --
 Best Wishes!

 Yours, Zesheng




-- 
Best Wishes!

Yours, Zesheng


RE: HDFS: Couldn't obtain the locations of the last block

2014-09-10 Thread Liu, Yi A
That’s great.

Regards,
Yi Liu

From: Zesheng Wu [mailto:wuzeshen...@gmail.com]
Sent: Wednesday, September 10, 2014 8:25 PM
To: user@hadoop.apache.org
Subject: Re: HDFS: Couldn't obtain the locations of the last block

Hi Yi,

I went through HDFS-4516, and it really solves our problem, thanks very much!

2014-09-10 16:39 GMT+08:00 Zesheng Wu 
wuzeshen...@gmail.commailto:wuzeshen...@gmail.com:
Thanks Yi, I will look into HDFS-4516.


2014-09-10 15:03 GMT+08:00 Liu, Yi A 
yi.a@intel.commailto:yi.a@intel.com:

Hi Zesheng,

I got from an offline email of you and knew your Hadoop version was 2.0.0-alpha 
and you also said “The block is allocated successfully in NN, but isn’t created 
in DN”.
Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is similar 
with HDFS-4516.   And can you try Hadoop 2.4 or later, you should not be able 
to re-produce it for these versions.

From your description, the second block is created successfully and NN would 
flush the edit log info to shared journal and shared storage might persist the 
info, but before reporting back in rpc, there might be timeout to NN from 
shared storage.  So the block exist in shared edit log, but DN doesn’t create 
it in anyway.  On restart, client could fail, because in that Hadoop version, 
client would retry only in the case of NN last block size reported as non-zero 
if it was synced (see more in HDFS-4516).

Regards,
Yi Liu

From: Zesheng Wu [mailto:wuzeshen...@gmail.commailto:wuzeshen...@gmail.com]
Sent: Tuesday, September 09, 2014 6:16 PM
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: HDFS: Couldn't obtain the locations of the last block

Hi,

These days we encountered a critical bug in HDFS which can result in HBase 
can't start normally.
The scenario is like following:
1.  rs1 writes data to HDFS file f1, and the first block is written successfully
2.  rs1 apply to create the second block successfully, at this time, nn1(ann) 
is crashed due to writing journal timeout
3. nn2(snn) isn't become active because of zkfc2 is in abnormal state
4. nn1 is restarted and becomes active
5. During the process of nn1 restarting, rs1 is crashed due to writing to 
safemode nn(nn1)
6. As a result, the file f1 is in abnormal state and the HBase cluster can't 
serve any more

We can use the command line shell to list the file, look like following:

-rw---   3 hbase_srv supergroup  134217728 2014-09-05 11:32 
/hbase/lgsrv-push/xxx
But when we try to download the file from hdfs, the dfs client complains:

14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. 
Datanodes might not have reported blocks completely. Will retry for 3 times

14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. 
Datanodes might not have reported blocks completely. Will retry for 2 times

14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. 
Datanodes might not have reported blocks completely. Will retry for 1 times

get: Could not obtain the last block locations.

Anyone can help on this?
--
Best Wishes!

Yours, Zesheng



--
Best Wishes!

Yours, Zesheng



--
Best Wishes!

Yours, Zesheng