Hi Eran,
Do you have: dfs.datanode.socket.write.timeout set in hdfs-site.xml ?
(We have set this to zero in our cluster, which means waiting as long as
necessary for the write to complete)

Igal.

On Thu, May 10, 2012 at 11:17 AM, Eran Kutner <e...@gigya.com> wrote:

> Hi,
> We're seeing occasional regionserver crashes during heavy write operations
> to Hbase (at the reduce phase of large M/R jobs). I have increased the file
> descriptors, HDFS xceivers, HDFS threads to the recommended settings and
> actually way above.
>
> Here is an example of the HBase log (showing only errors):
>
> 2012-05-10 03:34:54,291 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_-8928911185099340956_5189425java.io.IOException: Bad response 1 for
> block blk_-8928911185099340956_5189425 from datanode 10.1.104.6:50010
>        at
>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:
> 2986)
>
> 2012-05-10 03:34:54,494 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
> Exception: java.io.InterruptedIOException: Interruped while waiting for IO
> on channel java.nio.channels.SocketChannel[connected
> local=/10.1.104.9:59642remote=/
> 10.1.104.9:50010]. 0 millis timeout left.
>        at
>
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>        at
>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>        at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>        at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>        at
>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:
> 2848)
>
> 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-8928911185099340956_5189425 bad datanode[2]
> 10.1.104.6:50010
> 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-8928911185099340956_5189425 in pipeline
> 10.1.104.9:50010, 10.1.104.8:50010, 10.1.104.6:50010: bad datanode
> 10.1.104.6:50010
> 2012-05-10 03:48:30,174 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> serverName=hadoop1-s09.farm-ny.gigya.com,60020,1336476100422,
> load=(requests=15741, regions=789, usedHeap=6822, maxHeap=7983):
> regionserver:60020-0x2372c0e8a2f0008 regionserver:60020-0x2372c0e8a2f0008
> received expired from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
>        at
>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352)
>        at
>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270)
>        at
>
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
>        at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
> java.io.InterruptedIOException: Aborting compaction of store properties in
> region
>
> gs_users,6155551|QoCW/euBIKuMat/nRC5Xtw==,1334983658004.878522ea91f41cd76b903ea06ccd17f9.
> because user requested stop.
>        at
> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:998)
>        at
> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:779)
>        at
>
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:776)
>        at
>
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:721)
>        at
>
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
>
>
> This is from 10.1.104.9 (same machine running the region server that
> crashed):
> 2012-05-10 03:31:16,785 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_-8928911185099340956_5189425 src: /10.1.104.9:59642 dest: /
> 10.1.104.9:50010
> 2012-05-10 03:35:39,000 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
> Connection reset
> 2012-05-10 03:35:39,052 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock
> for block blk_-8928911185099340956_5189425
> java.nio.channels.ClosedByInterruptException
> 2012-05-10 03:35:39,053 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> blk_-8928911185099340956_5189425 received exception java.io.IOException:
> Interrupted receiveBlock
> 2012-05-10 03:35:39,055 ERROR
> org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
> as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
> blk_-8928911185099340956_5189425 length is 24384000 does not match block
> file length 24449024
> 2012-05-10 03:35:39,055 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 3 on 50020, call
> startBlockRecovery(blk_-8928911185099340956_5189425) from 10.1.104.8:50251
> :
> error: java.io.IOException: Block blk_-8928911185099340956_5189425 length
> is 24384000 does not match block file length 24449024
> java.io.IOException: Block blk_-8928911185099340956_5189425 length is
> 24384000 does not match block file length 24449024
> 2012-05-10 03:35:39,077 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
> Broken pipe
> 2012-05-10 03:35:39,077 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
> Socket closed
> 2012-05-10 03:35:39,108 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
> Socket closed
> 2012-05-10 03:35:39,136 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
> Socket closed
> 2012-05-10 03:35:39,165 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
> Socket closed
> 2012-05-10 03:35:39,196 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
> Socket closed
> 2012-05-10 03:35:39,221 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
> Socket closed
> 2012-05-10 03:35:39,246 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
> Socket closed
> 2012-05-10 03:35:39,271 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
> Socket closed
> 2012-05-10 03:35:39,296 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
> Socket closed
>
> This is the log from 10.1.104.6 datanode for
> "blk_-8928911185099340956_5189425":
> 2012-05-10 03:31:16,772 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_-8928911185099340956_5189425 src: /10.1.104.8:43828 dest: /
> 10.1.104.6:50010
> 2012-05-10 03:35:39,041 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock
> for block blk_-8928911185099340956_5189425 java.net.SocketException:
> Connection reset
> 2012-05-10 03:35:39,043 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
> block blk_-8928911185099340956_5189425 Interrupted.
> 2012-05-10 03:35:39,043 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
> block blk_-8928911185099340956_5189425 terminating
> 2012-05-10 03:35:39,043 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> blk_-8928911185099340956_5189425 received exception
> java.net.SocketException: Connection reset
>
>
> Any idea why is this happening?
>
> Thanks.
>
> -eran
>

Reply via email to