On Thu, May 10, 2012 at 1:17 AM, Eran Kutner <e...@gigya.com> wrote:
> Here is an example of the HBase log (showing only errors):
>
> 2012-05-10 03:34:54,291 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_-8928911185099340956_5189425java.io.IOException: Bad response 1 for
> block blk_-8928911185099340956_5189425 from datanode 10.1.104.6:50010
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2986)
>
> 2012-05-10 03:34:54,494 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
> Exception: java.io.InterruptedIOException: Interruped while waiting for IO
> on channel java.nio.channels.SocketChannel[connected
> local=/10.1.104.9:59642remote=/
> 10.1.104.9:50010]. 0 millis timeout left.
>        at
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>        at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>        at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2848)
>
> 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-8928911185099340956_5189425 bad datanode[2]
> 10.1.104.6:50010
> 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-8928911185099340956_5189425 in pipeline
> 10.1.104.9:50010, 10.1.104.8:50010, 10.1.104.6:50010: bad datanode
> 10.1.104.6:50010

Above is complaint about a DN in a write pipeline.  Anything else
around the above logging?  You sure the write didn't go through after
the dfsclient purged the 'bad datanode'.

A few minutes pass and then you ge the below....

> 2012-05-10 03:48:30,174 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> serverName=hadoop1-s09.farm-ny.gigya.com,60020,1336476100422,
> load=(requests=15741, regions=789, usedHeap=6822, maxHeap=7983):
> regionserver:60020-0x2372c0e8a2f0008 regionserver:60020-0x2372c0e8a2f0008
> received expired from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired

Says your session expired with zk.   You think there was a big GC
pause here?  You collecting GC logging?  Can you check it?


> This is from 10.1.104.9 (same machine running the region server that
> crashed):

You probably want to look at .6 and see why it went sour.  It was
reported as the bad DN in the pipeline.

What version of hbase?  Do you have ganglia or tsdb up and running on
your cluster so you can dig in across these times of fail?

St.Ack

Reply via email to