Hi Mike, Not sure I understand the question about the reducer. I'm using a reducer because my M/R jobs require one and I want to write the result to Hbase. I have two tables I'm writing two, one is using the default file size (256MB if I remember correctly) the other one is 512MB. There are ~700 regions on each server. Didn't know there is a bandwidth limit, is it on HDFS or HBase? How can it be configured?
-eran On Thu, May 10, 2012 at 2:53 PM, Michel Segel <michael_se...@hotmail.com>wrote: > Silly question... > Why are you using a reducer when working w HBase? > > Second silly question... What is the max file size of your table that you > are writing to? > > Third silly question... How many regions are on each of your region servers > > Fourth silly question ... There is this bandwidth setting... Default is > 10MB... Did you modify it? > > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On May 10, 2012, at 6:33 AM, Eran Kutner <e...@gigya.com> wrote: > > > Thanks Igal, but we already have that setting. These are the relevant > > setting from hdfs-site.xml : > > <property> > > <name>dfs.datanode.max.xcievers</name> > > <value>65536</value> > > </property> > > <property> > > <name>dfs.datanode.handler.count</name> > > <value>10</value> > > </property> > > <property> > > <name>dfs.datanode.socket.write.timeout</name> > > <value>0</value> > > </property> > > > > Other ideas? > > > > -eran > > > > > > > > On Thu, May 10, 2012 at 12:25 PM, Igal Shilman <ig...@wix.com> wrote: > > > >> Hi Eran, > >> Do you have: dfs.datanode.socket.write.timeout set in hdfs-site.xml ? > >> (We have set this to zero in our cluster, which means waiting as long as > >> necessary for the write to complete) > >> > >> Igal. > >> > >> On Thu, May 10, 2012 at 11:17 AM, Eran Kutner <e...@gigya.com> wrote: > >> > >>> Hi, > >>> We're seeing occasional regionserver crashes during heavy write > >> operations > >>> to Hbase (at the reduce phase of large M/R jobs). I have increased the > >> file > >>> descriptors, HDFS xceivers, HDFS threads to the recommended settings > and > >>> actually way above. > >>> > >>> Here is an example of the HBase log (showing only errors): > >>> > >>> 2012-05-10 03:34:54,291 WARN org.apache.hadoop.hdfs.DFSClient: > >>> DFSOutputStream ResponseProcessor exception for block > >>> blk_-8928911185099340956_5189425java.io.IOException: Bad response 1 for > >>> block blk_-8928911185099340956_5189425 from datanode 10.1.104.6:50010 > >>> at > >>> > >>> > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java: > >>> 2986) > >>> > >>> 2012-05-10 03:34:54,494 WARN org.apache.hadoop.hdfs.DFSClient: > >> DataStreamer > >>> Exception: java.io.InterruptedIOException: Interruped while waiting for > >> IO > >>> on channel java.nio.channels.SocketChannel[connected > >>> local=/10.1.104.9:59642remote=/ > >>> 10.1.104.9:50010]. 0 millis timeout left. > >>> at > >>> > >>> > >> > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349) > >>> at > >>> > >>> > >> > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) > >>> at > >>> > >> > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) > >>> at > >>> > >> > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) > >>> at > >> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) > >>> at java.io.DataOutputStream.write(DataOutputStream.java:90) > >>> at > >>> > >>> > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java: > >>> 2848) > >>> > >>> 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error > >>> Recovery for block blk_-8928911185099340956_5189425 bad datanode[2] > >>> 10.1.104.6:50010 > >>> 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error > >>> Recovery for block blk_-8928911185099340956_5189425 in pipeline > >>> 10.1.104.9:50010, 10.1.104.8:50010, 10.1.104.6:50010: bad datanode > >>> 10.1.104.6:50010 > >>> 2012-05-10 03:48:30,174 FATAL > >>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region > >> server > >>> serverName=hadoop1-s09.farm-ny.gigya.com,60020,1336476100422, > >>> load=(requests=15741, regions=789, usedHeap=6822, maxHeap=7983): > >>> regionserver:60020-0x2372c0e8a2f0008 > regionserver:60020-0x2372c0e8a2f0008 > >>> received expired from ZooKeeper, aborting > >>> org.apache.zookeeper.KeeperException$SessionExpiredException: > >>> KeeperErrorCode = Session expired > >>> at > >>> > >>> > >> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352) > >>> at > >>> > >>> > >> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270) > >>> at > >>> > >>> > >> > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531) > >>> at > >>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) > >>> java.io.InterruptedIOException: Aborting compaction of store properties > >> in > >>> region > >>> > >>> > >> > gs_users,6155551|QoCW/euBIKuMat/nRC5Xtw==,1334983658004.878522ea91f41cd76b903ea06ccd17f9. > >>> because user requested stop. > >>> at > >>> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:998) > >>> at > >>> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:779) > >>> at > >>> > >>> > >> > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:776) > >>> at > >>> > >>> > >> > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:721) > >>> at > >>> > >>> > >> > org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81) > >>> > >>> > >>> This is from 10.1.104.9 (same machine running the region server that > >>> crashed): > >>> 2012-05-10 03:31:16,785 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > >>> blk_-8928911185099340956_5189425 src: /10.1.104.9:59642 dest: / > >>> 10.1.104.9:50010 > >>> 2012-05-10 03:35:39,000 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > >>> Connection reset > >>> 2012-05-10 03:35:39,052 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in > >> receiveBlock > >>> for block blk_-8928911185099340956_5189425 > >>> java.nio.channels.ClosedByInterruptException > >>> 2012-05-10 03:35:39,053 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock > >>> blk_-8928911185099340956_5189425 received exception > java.io.IOException: > >>> Interrupted receiveBlock > >>> 2012-05-10 03:35:39,055 ERROR > >>> org.apache.hadoop.security.UserGroupInformation: > >> PriviledgedActionException > >>> as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block > >>> blk_-8928911185099340956_5189425 length is 24384000 does not match > block > >>> file length 24449024 > >>> 2012-05-10 03:35:39,055 INFO org.apache.hadoop.ipc.Server: IPC Server > >>> handler 3 on 50020, call > >>> startBlockRecovery(blk_-8928911185099340956_5189425) from > >> 10.1.104.8:50251 > >>> : > >>> error: java.io.IOException: Block blk_-8928911185099340956_5189425 > length > >>> is 24384000 does not match block file length 24449024 > >>> java.io.IOException: Block blk_-8928911185099340956_5189425 length is > >>> 24384000 does not match block file length 24449024 > >>> 2012-05-10 03:35:39,077 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > >>> Broken pipe > >>> 2012-05-10 03:35:39,077 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > >>> Socket closed > >>> 2012-05-10 03:35:39,108 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > >>> Socket closed > >>> 2012-05-10 03:35:39,136 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > >>> Socket closed > >>> 2012-05-10 03:35:39,165 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > >>> Socket closed > >>> 2012-05-10 03:35:39,196 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > >>> Socket closed > >>> 2012-05-10 03:35:39,221 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > >>> Socket closed > >>> 2012-05-10 03:35:39,246 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > >>> Socket closed > >>> 2012-05-10 03:35:39,271 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > >>> Socket closed > >>> 2012-05-10 03:35:39,296 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException: > >>> Socket closed > >>> > >>> This is the log from 10.1.104.6 datanode for > >>> "blk_-8928911185099340956_5189425": > >>> 2012-05-10 03:31:16,772 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > >>> blk_-8928911185099340956_5189425 src: /10.1.104.8:43828 dest: / > >>> 10.1.104.6:50010 > >>> 2012-05-10 03:35:39,041 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in > >> receiveBlock > >>> for block blk_-8928911185099340956_5189425 java.net.SocketException: > >>> Connection reset > >>> 2012-05-10 03:35:39,043 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for > >>> block blk_-8928911185099340956_5189425 Interrupted. > >>> 2012-05-10 03:35:39,043 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for > >>> block blk_-8928911185099340956_5189425 terminating > >>> 2012-05-10 03:35:39,043 INFO > >>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock > >>> blk_-8928911185099340956_5189425 received exception > >>> java.net.SocketException: Connection reset > >>> > >>> > >>> Any idea why is this happening? > >>> > >>> Thanks. > >>> > >>> -eran > >>> > >> >