Re: Re: Could not initialize all stores for the region

2016-03-31 Thread Zheng Shen
Yes, I mean after nodeA come back online.

I cannot find any special events happens around 19:19:20, except for some 
memstore flush and compact event from HMaster log.

However I do niced one thing - the namenode on nodeB exists for several times 
during nodeA's absent  (as I manually starts the namenode after it's shutdown 
by using commandline interface, because the only instance of cloudera manager 
is on nodeA which was still offline at that time.) because it failed to 
comunicate to enough journal nodes (we have 3 hournal nodes, 1 was down along 
with nodeA).

2016-03-31 19:27:15,870 WARN 
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms 
(timeout=2 ms) for a response for sendEdits. Succeeded so far: 
[192.168.1.15:8485]
2016-03-31 19:27:16,538 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap 
updated: 192.168.1.15:50010 is added to blk_1075206720_1466791 size 43694020
2016-03-31 19:27:16,855 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: 
Error: flush failed for required journal (JournalAndStream(mgr=QJM to 
[192.168.1.17:8485, 192.168.1.24:8485, 192.168.1.15:8485], 
stream=QuorumOutputStream starting at txid 15010187))
java.io.IOException: Timed out waiting 2ms for a quorum of nodes to respond.
at 
org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at 
org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
...
2016-03-31 19:27:16,856 WARN 
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting 
QuorumOutputStream starting at txid 15010187
2016-03-31 19:27:16,857 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1
2016-03-31 19:27:16,860 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down NameNode at hp1.server/192.168.1.106
/

Not sure if this is related to the missing file (in blocks)? (however the nn on 
nodeB was succesfully connected to one of 3 journal nodes before it die.)

Thanks,
Zheng


zhengshe...@outlook.com

From: Ted Yu<mailto:yuzhih...@gmail.com>
Date: 2016-04-01 01:06
To: user@hbase.apache.org<mailto:user@hbase.apache.org>
Subject: Re: Re: Could not initialize all stores for the region
Can you check server log on node 106 around 19:19:20 to see if there was
more clue ?

bq. being informed by the events which happens during their absent somehow?

Did you mean after nodeA came back online ?

Cheers

On Thu, Mar 31, 2016 at 9:57 AM, Zheng Shen <zhengshe...@outlook.com> wrote:

> Hi Ted,
>
> Thank you very much for your reply!
>
> We do have mutliple HMaster nodes, one of them is on the offline node
> (let's call it nodeA). Another is on node which is alwasy online (nodeB).
>
> I scanned the audit log, and found that during nodeA offline, the nodeB
> HDFS auditlog shows:
>
> hdfs-audit.log:2016-03-31 19:19:24,158 INFO FSNamesystem.audit:
> allowed=true ugi=hbase (auth:SIMPLE) ip=/192.168.1.106 cmd=delete
> src=/hbase/archive/data/default/vocabulary/2639c4d082646bb4a4fa2d8119f9aaef/cnt/2dc367d0e1c24a3b848c68d3b171b06d
> dst=null perm=null proto=rpc
>
> where (192.168.1.106) is the IP of nodeB.
>
> So it looks like nodeB deleted this file during nodeA's offline. However,
> should'nt services on nodeA (like HMaster and namenode) being informed by
> the events which happens during their absent somehow?
>
> Although we have only 5 nodes in this cluster, we do perform HA on every
> levels of HBase service stack. So yes, there are multiple instances of
> every services as long as it's possible or necessay (e.g. we have 3
> HMaster, 2 name node, 3 journal node)
>
> Thanks,
> Zheng
>
> 
> zhengshe...@outlook.com
>
> From: Ted Yu<mailto:yuzhih...@gmail.com>
> Date: 2016-04-01 00:00
> To: user@hbase.apache.org<mailto:user@hbase.apache.org>
> Subject: Re: Could not initialize all stores for the region
> bq. File does not exist: /hbase/data/default/vocabulary/
> 2639c4d082646bb4a4fa2d8119f9aaef/cnt/2dc367d0e1c24a3b848c68d3b171b06d
>
> Can you search in namenode audit log to see which node initiated the delete
> request of the above file ?
> Then you can search in that node's region server log to get more clue.
>
> bq. hosts the HDFS namenode and datanode, Cloudera Manager, as well as
> HBase master and region server
>
> Can you separate some daemons off this node (e.g. HBase master) ?
> I assume you have second HBase master running somewhere else. Otherwise
> this node becomes the weak point of the cluster.
>
> On Thu, Mar 31, 2016 at 7:58 AM, Zheng Shen <zhengshe...@outlook.com>
> wrote:
>
> 

Re: Re: Could not initialize all stores for the region

2016-03-31 Thread Ted Yu
Can you check server log on node 106 around 19:19:20 to see if there was
more clue ?

bq. being informed by the events which happens during their absent somehow?

Did you mean after nodeA came back online ?

Cheers

On Thu, Mar 31, 2016 at 9:57 AM, Zheng Shen <zhengshe...@outlook.com> wrote:

> Hi Ted,
>
> Thank you very much for your reply!
>
> We do have mutliple HMaster nodes, one of them is on the offline node
> (let's call it nodeA). Another is on node which is alwasy online (nodeB).
>
> I scanned the audit log, and found that during nodeA offline, the nodeB
> HDFS auditlog shows:
>
> hdfs-audit.log:2016-03-31 19:19:24,158 INFO FSNamesystem.audit:
> allowed=true ugi=hbase (auth:SIMPLE) ip=/192.168.1.106 cmd=delete
> src=/hbase/archive/data/default/vocabulary/2639c4d082646bb4a4fa2d8119f9aaef/cnt/2dc367d0e1c24a3b848c68d3b171b06d
> dst=null perm=null proto=rpc
>
> where (192.168.1.106) is the IP of nodeB.
>
> So it looks like nodeB deleted this file during nodeA's offline. However,
> should'nt services on nodeA (like HMaster and namenode) being informed by
> the events which happens during their absent somehow?
>
> Although we have only 5 nodes in this cluster, we do perform HA on every
> levels of HBase service stack. So yes, there are multiple instances of
> every services as long as it's possible or necessay (e.g. we have 3
> HMaster, 2 name node, 3 journal node)
>
> Thanks,
> Zheng
>
> 
> zhengshe...@outlook.com
>
> From: Ted Yu<mailto:yuzhih...@gmail.com>
> Date: 2016-04-01 00:00
> To: user@hbase.apache.org<mailto:user@hbase.apache.org>
> Subject: Re: Could not initialize all stores for the region
> bq. File does not exist: /hbase/data/default/vocabulary/
> 2639c4d082646bb4a4fa2d8119f9aaef/cnt/2dc367d0e1c24a3b848c68d3b171b06d
>
> Can you search in namenode audit log to see which node initiated the delete
> request of the above file ?
> Then you can search in that node's region server log to get more clue.
>
> bq. hosts the HDFS namenode and datanode, Cloudera Manager, as well as
> HBase master and region server
>
> Can you separate some daemons off this node (e.g. HBase master) ?
> I assume you have second HBase master running somewhere else. Otherwise
> this node becomes the weak point of the cluster.
>
> On Thu, Mar 31, 2016 at 7:58 AM, Zheng Shen <zhengshe...@outlook.com>
> wrote:
>
> > Hi,
> >
> > Our Hbase cannot performance any write operation while the read operation
> > are fine. I found the following error from regision server log
> >
> >
> > Could not initialize all stores for the
> >
> region=vocabulary,576206_6513944,1459420417369.19faeb6e4da0b1873f68da271b0f5788.
> >
> > Failed open of
> >
> region=vocabulary,576206_6513944,1459420417369.19faeb6e4da0b1873f68da271b0f5788.,
> > starting to roll back the global memstore size.
> > java.io.IOException: java.io.IOException: java.io.FileNotFoundException:
> > File does not exist:
> >
> /hbase/data/default/vocabulary/2639c4d082646bb4a4fa2d8119f9aaef/cnt/2dc367d0e1c24a3b848c68d3b171b06d
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1932)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1873)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1853)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1825)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:559)
> > at
> >
> >
> > Opening of region {ENCODED => 19faeb6e4da0b1873f68da271b0f5788, NAME =>
> >
> 'vocabulary,576206_6513944,1459420417369.19faeb6e4da0b1873f68da271b0f5788.',
> > STARTKEY => '576206_6513944', ENDKEY => '599122_6739914'} failed,
> > transitioning from OPENING to FAILED_OPEN in ZK, expecting version 22
> >
> >
> > We are using Cloudera CDH 5.4.7, the HBase version is 1.0.0-cdh_5.4.7,
> > with HDFS HA enabled (one of the namenode is running on the server being
> > shutdown). Our HBase cluster expereienced an expected node shutdown today
> > for about 4 hours. The node which is shutdown hosts the HDFS namenode and
> > datanode, Cloudera Manager

Re: Re: Could not initialize all stores for the region

2016-03-31 Thread Zheng Shen
Hi Ted,

Thank you very much for your reply!

We do have mutliple HMaster nodes, one of them is on the offline node (let's 
call it nodeA). Another is on node which is alwasy online (nodeB).

I scanned the audit log, and found that during nodeA offline, the nodeB HDFS 
auditlog shows:

hdfs-audit.log:2016-03-31 19:19:24,158 INFO FSNamesystem.audit: allowed=true 
ugi=hbase (auth:SIMPLE) ip=/192.168.1.106 cmd=delete 
src=/hbase/archive/data/default/vocabulary/2639c4d082646bb4a4fa2d8119f9aaef/cnt/2dc367d0e1c24a3b848c68d3b171b06d
 dst=null perm=null proto=rpc

where (192.168.1.106) is the IP of nodeB.

So it looks like nodeB deleted this file during nodeA's offline. However, 
should'nt services on nodeA (like HMaster and namenode) being informed by the 
events which happens during their absent somehow?

Although we have only 5 nodes in this cluster, we do perform HA on every levels 
of HBase service stack. So yes, there are multiple instances of every services 
as long as it's possible or necessay (e.g. we have 3 HMaster, 2 name node, 3 
journal node)

Thanks,
Zheng


zhengshe...@outlook.com

From: Ted Yu<mailto:yuzhih...@gmail.com>
Date: 2016-04-01 00:00
To: user@hbase.apache.org<mailto:user@hbase.apache.org>
Subject: Re: Could not initialize all stores for the region
bq. File does not exist: /hbase/data/default/vocabulary/
2639c4d082646bb4a4fa2d8119f9aaef/cnt/2dc367d0e1c24a3b848c68d3b171b06d

Can you search in namenode audit log to see which node initiated the delete
request of the above file ?
Then you can search in that node's region server log to get more clue.

bq. hosts the HDFS namenode and datanode, Cloudera Manager, as well as
HBase master and region server

Can you separate some daemons off this node (e.g. HBase master) ?
I assume you have second HBase master running somewhere else. Otherwise
this node becomes the weak point of the cluster.

On Thu, Mar 31, 2016 at 7:58 AM, Zheng Shen <zhengshe...@outlook.com> wrote:

> Hi,
>
> Our Hbase cannot performance any write operation while the read operation
> are fine. I found the following error from regision server log
>
>
> Could not initialize all stores for the
> region=vocabulary,576206_6513944,1459420417369.19faeb6e4da0b1873f68da271b0f5788.
>
> Failed open of
> region=vocabulary,576206_6513944,1459420417369.19faeb6e4da0b1873f68da271b0f5788.,
> starting to roll back the global memstore size.
> java.io.IOException: java.io.IOException: java.io.FileNotFoundException:
> File does not exist:
> /hbase/data/default/vocabulary/2639c4d082646bb4a4fa2d8119f9aaef/cnt/2dc367d0e1c24a3b848c68d3b171b06d
> at
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
> at
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1932)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1873)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1853)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1825)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:559)
> at
>
>
> Opening of region {ENCODED => 19faeb6e4da0b1873f68da271b0f5788, NAME =>
> 'vocabulary,576206_6513944,1459420417369.19faeb6e4da0b1873f68da271b0f5788.',
> STARTKEY => '576206_6513944', ENDKEY => '599122_6739914'} failed,
> transitioning from OPENING to FAILED_OPEN in ZK, expecting version 22
>
>
> We are using Cloudera CDH 5.4.7, the HBase version is 1.0.0-cdh_5.4.7,
> with HDFS HA enabled (one of the namenode is running on the server being
> shutdown). Our HBase cluster expereienced an expected node shutdown today
> for about 4 hours. The node which is shutdown hosts the HDFS namenode and
> datanode, Cloudera Manager, as well as HBase master and region server (5
> nodes in totally in our small clusder).  During the node shuting down,
> beside the services running that that node, the other HDFS namenode,
> failover server, and 2 of 3 journal node are also down. After the node is
> recovered, we restarted the whole CDH cluster, and then it ends like this
> one...
>
> The HDFS checking "hdfs fsck" does not report any corrupted blocks.
>
> Any suggesion about where we should look into for this problem?
>
> Thanks!
> Zheng
>
> 
> zhengshe...@outlook.com
>


Re: Could not initialize all stores for the region

2016-03-31 Thread Ted Yu
bq. File does not exist: /hbase/data/default/vocabulary/
2639c4d082646bb4a4fa2d8119f9aaef/cnt/2dc367d0e1c24a3b848c68d3b171b06d

Can you search in namenode audit log to see which node initiated the delete
request of the above file ?
Then you can search in that node's region server log to get more clue.

bq. hosts the HDFS namenode and datanode, Cloudera Manager, as well as
HBase master and region server

Can you separate some daemons off this node (e.g. HBase master) ?
I assume you have second HBase master running somewhere else. Otherwise
this node becomes the weak point of the cluster.

On Thu, Mar 31, 2016 at 7:58 AM, Zheng Shen <zhengshe...@outlook.com> wrote:

> Hi,
>
> Our Hbase cannot performance any write operation while the read operation
> are fine. I found the following error from regision server log
>
>
> Could not initialize all stores for the
> region=vocabulary,576206_6513944,1459420417369.19faeb6e4da0b1873f68da271b0f5788.
>
> Failed open of
> region=vocabulary,576206_6513944,1459420417369.19faeb6e4da0b1873f68da271b0f5788.,
> starting to roll back the global memstore size.
> java.io.IOException: java.io.IOException: java.io.FileNotFoundException:
> File does not exist:
> /hbase/data/default/vocabulary/2639c4d082646bb4a4fa2d8119f9aaef/cnt/2dc367d0e1c24a3b848c68d3b171b06d
> at
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
> at
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1932)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1873)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1853)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1825)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:559)
> at
>
>
> Opening of region {ENCODED => 19faeb6e4da0b1873f68da271b0f5788, NAME =>
> 'vocabulary,576206_6513944,1459420417369.19faeb6e4da0b1873f68da271b0f5788.',
> STARTKEY => '576206_6513944', ENDKEY => '599122_6739914'} failed,
> transitioning from OPENING to FAILED_OPEN in ZK, expecting version 22
>
>
> We are using Cloudera CDH 5.4.7, the HBase version is 1.0.0-cdh_5.4.7,
> with HDFS HA enabled (one of the namenode is running on the server being
> shutdown). Our HBase cluster expereienced an expected node shutdown today
> for about 4 hours. The node which is shutdown hosts the HDFS namenode and
> datanode, Cloudera Manager, as well as HBase master and region server (5
> nodes in totally in our small clusder).  During the node shuting down,
> beside the services running that that node, the other HDFS namenode,
> failover server, and 2 of 3 journal node are also down. After the node is
> recovered, we restarted the whole CDH cluster, and then it ends like this
> one...
>
> The HDFS checking "hdfs fsck" does not report any corrupted blocks.
>
> Any suggesion about where we should look into for this problem?
>
> Thanks!
> Zheng
>
> 
> zhengshe...@outlook.com
>


Re: Could not initialize all stores for the region

2016-03-31 Thread Zheng Shen
By disabling table "vocabulary" and the creating a new table, the hbase is 
recovered. Now write operations (not only on the new table but also on other 
tables) can performed without any issue.

But I still don't understand what is the root cause, and how HBase lost data 
(with it's strong consistency feature)?

Thanks,
Zheng


zhengshe...@outlook.com

From: Zheng Shen<mailto:zhengshe...@outlook.com>
Date: 2016-03-31 22:58
To: user<mailto:user@hbase.apache.org>
Subject: Could not initialize all stores for the region
Hi,

Our Hbase cannot performance any write operation while the read operation are 
fine. I found the following error from regision server log


Could not initialize all stores for the 
region=vocabulary,576206_6513944,1459420417369.19faeb6e4da0b1873f68da271b0f5788.

Failed open of 
region=vocabulary,576206_6513944,1459420417369.19faeb6e4da0b1873f68da271b0f5788.,
 starting to roll back the global memstore size.
java.io.IOException: java.io.IOException: java.io.FileNotFoundException: File 
does not exist: 
/hbase/data/default/vocabulary/2639c4d082646bb4a4fa2d8119f9aaef/cnt/2dc367d0e1c24a3b848c68d3b171b06d
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1932)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1873)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1853)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1825)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:559)
at


Opening of region {ENCODED => 19faeb6e4da0b1873f68da271b0f5788, NAME => 
'vocabulary,576206_6513944,1459420417369.19faeb6e4da0b1873f68da271b0f5788.', 
STARTKEY => '576206_6513944', ENDKEY => '599122_6739914'} failed, transitioning 
from OPENING to FAILED_OPEN in ZK, expecting version 22


We are using Cloudera CDH 5.4.7, the HBase version is 1.0.0-cdh_5.4.7, with 
HDFS HA enabled (one of the namenode is running on the server being shutdown). 
Our HBase cluster expereienced an expected node shutdown today for about 4 
hours. The node which is shutdown hosts the HDFS namenode and datanode, 
Cloudera Manager, as well as HBase master and region server (5 nodes in totally 
in our small clusder).  During the node shuting down, beside the services 
running that that node, the other HDFS namenode, failover server, and 2 of 3 
journal node are also down. After the node is recovered, we restarted the whole 
CDH cluster, and then it ends like this one...

The HDFS checking "hdfs fsck" does not report any corrupted blocks.

Any suggesion about where we should look into for this problem?

Thanks!
Zheng


zhengshe...@outlook.com


Could not initialize all stores for the region

2016-03-31 Thread Zheng Shen
Hi,

Our Hbase cannot performance any write operation while the read operation are 
fine. I found the following error from regision server log


Could not initialize all stores for the 
region=vocabulary,576206_6513944,1459420417369.19faeb6e4da0b1873f68da271b0f5788.

Failed open of 
region=vocabulary,576206_6513944,1459420417369.19faeb6e4da0b1873f68da271b0f5788.,
 starting to roll back the global memstore size.
java.io.IOException: java.io.IOException: java.io.FileNotFoundException: File 
does not exist: 
/hbase/data/default/vocabulary/2639c4d082646bb4a4fa2d8119f9aaef/cnt/2dc367d0e1c24a3b848c68d3b171b06d
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1932)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1873)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1853)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1825)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:559)
at


Opening of region {ENCODED => 19faeb6e4da0b1873f68da271b0f5788, NAME => 
'vocabulary,576206_6513944,1459420417369.19faeb6e4da0b1873f68da271b0f5788.', 
STARTKEY => '576206_6513944', ENDKEY => '599122_6739914'} failed, transitioning 
from OPENING to FAILED_OPEN in ZK, expecting version 22


We are using Cloudera CDH 5.4.7, the HBase version is 1.0.0-cdh_5.4.7, with 
HDFS HA enabled (one of the namenode is running on the server being shutdown). 
Our HBase cluster expereienced an expected node shutdown today for about 4 
hours. The node which is shutdown hosts the HDFS namenode and datanode, 
Cloudera Manager, as well as HBase master and region server (5 nodes in totally 
in our small clusder).  During the node shuting down, beside the services 
running that that node, the other HDFS namenode, failover server, and 2 of 3 
journal node are also down. After the node is recovered, we restarted the whole 
CDH cluster, and then it ends like this one...

The HDFS checking "hdfs fsck" does not report any corrupted blocks.

Any suggesion about where we should look into for this problem?

Thanks!
Zheng


zhengshe...@outlook.com