Hi Eran, Do you have: dfs.datanode.socket.write.timeout set in hdfs-site.xml ? (We have set this to zero in our cluster, which means waiting as long as necessary for the write to complete)
Igal. On Thu, May 10, 2012 at 11:17 AM, Eran Kutner <e...@gigya.com> wrote: > Hi, > We're seeing occasional regionserver crashes during heavy write operations > to Hbase (at the reduce phase of large M/R jobs). I have increased the file > descriptors, HDFS xceivers, HDFS threads to the recommended settings and > actually way above. > > Here is an example of the HBase log (showing only errors): > > 2012-05-10 03:34:54,291 WARN org.apache.hadoop.hdfs.DFSClient: > DFSOutputStream ResponseProcessor exception for block > blk_-8928911185099340956_5189425java.io.IOException: Bad response 1 for > block blk_-8928911185099340956_5189425 from datanode 10.1.104.6:50010 > at > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java: > 2986) > > 2012-05-10 03:34:54,494 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer > Exception: java.io.InterruptedIOException: Interruped while waiting for IO > on channel java.nio.channels.SocketChannel[connected > local=/10.1.104.9:59642remote=/ > 10.1.104.9:50010]. 0 millis timeout left. > at > > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349) > at > > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) > at java.io.DataOutputStream.write(DataOutputStream.java:90) > at > > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java: > 2848) > > 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_-8928911185099340956_5189425 bad datanode[2] > 10.1.104.6:50010 > 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_-8928911185099340956_5189425 in pipeline > 10.1.104.9:50010, 10.1.104.8:50010, 10.1.104.6:50010: bad datanode > 10.1.104.6:50010 > 2012-05-10 03:48:30,174 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server > serverName=hadoop1-s09.farm-ny.gigya.com,60020,1336476100422, > load=(requests=15741, regions=789, usedHeap=6822, maxHeap=7983): > regionserver:60020-0x2372c0e8a2f0008 regionserver:60020-0x2372c0e8a2f0008 > received expired from ZooKeeper, aborting > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at > > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352) > at > > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270) > at > > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) > java.io.InterruptedIOException: Aborting compaction of store properties in > region > > gs_users,6155551|QoCW/euBIKuMat/nRC5Xtw==,1334983658004.878522ea91f41cd76b903ea06ccd17f9. > because user requested stop. > at > org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:998) > at > org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:779) > at > > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:776) > at > > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:721) > at > > org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81) > > > This is from 10.1.104.9 (same machine running the region server that > crashed): > 2012-05-10 03:31:16,785 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_-8928911185099340956_5189425 src: /10.1.104.9:59642 dest: / > 10.1.104.9:50010 > 2012-05-10 03:35:39,000 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > Connection reset > 2012-05-10 03:35:39,052 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock > for block blk_-8928911185099340956_5189425 > java.nio.channels.ClosedByInterruptException > 2012-05-10 03:35:39,053 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock > blk_-8928911185099340956_5189425 received exception java.io.IOException: > Interrupted receiveBlock > 2012-05-10 03:35:39,055 ERROR > org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException > as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block > blk_-8928911185099340956_5189425 length is 24384000 does not match block > file length 24449024 > 2012-05-10 03:35:39,055 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 3 on 50020, call > startBlockRecovery(blk_-8928911185099340956_5189425) from 10.1.104.8:50251 > : > error: java.io.IOException: Block blk_-8928911185099340956_5189425 length > is 24384000 does not match block file length 24449024 > java.io.IOException: Block blk_-8928911185099340956_5189425 length is > 24384000 does not match block file length 24449024 > 2012-05-10 03:35:39,077 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > Broken pipe > 2012-05-10 03:35:39,077 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > Socket closed > 2012-05-10 03:35:39,108 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > Socket closed > 2012-05-10 03:35:39,136 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > Socket closed > 2012-05-10 03:35:39,165 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > Socket closed > 2012-05-10 03:35:39,196 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > Socket closed > 2012-05-10 03:35:39,221 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > Socket closed > 2012-05-10 03:35:39,246 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > Socket closed > 2012-05-10 03:35:39,271 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > Socket closed > 2012-05-10 03:35:39,296 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > Socket closed > > This is the log from 10.1.104.6 datanode for > "blk_-8928911185099340956_5189425": > 2012-05-10 03:31:16,772 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_-8928911185099340956_5189425 src: /10.1.104.8:43828 dest: / > 10.1.104.6:50010 > 2012-05-10 03:35:39,041 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock > for block blk_-8928911185099340956_5189425 java.net.SocketException: > Connection reset > 2012-05-10 03:35:39,043 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for > block blk_-8928911185099340956_5189425 Interrupted. > 2012-05-10 03:35:39,043 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for > block blk_-8928911185099340956_5189425 terminating > 2012-05-10 03:35:39,043 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock > blk_-8928911185099340956_5189425 received exception > java.net.SocketException: Connection reset > > > Any idea why is this happening? > > Thanks. > > -eran >