So that's about 1500 regions per region server.  Should be more like 200-250 I 
think because as a practical matter the recovery time for redeploying 1500 
regions upon hardware failure will block clients for a long time. It may not 
even be possible if all region servers are almost at the point of 
overcommittment when this large chunk of regions suddenly comes up for 
reassignment. AFAIK, Google's BigTable is tuned for 100 per region server, 200 
MB per region. This is to minimize churn and region unavailability time should 
a server fail and also allows the node to serve the data mostly out of RAM. 

This is with default max file size (256 MB)? What happens if you up that to 1 
GB? Set hbase.hregion.max.filesize property in hbase-site.xml to make this 
change globally.

    - Andy




________________________________
From: elsif <elsif.t...@gmail.com>
To: hbase-user@hadoop.apache.org
Sent: Wed, November 11, 2009 6:44:34 AM
Subject: Re: HBase Exceptions on version 0.20.1

On my last set of tests I used two instances of the test program - 60
threads total generating rows containing 400kB of data.  The highest
load appeared to be on the region serving meta.

On my cluster things started to break down at about 20,000 regions with
a heap size of 6MB.  You can always lower your heap size to generate
issues quicker.

Thanks,
elsif

stack wrote:
> This second run has been going for > 24 hours now.  Its loaded up near 2k
> regions.  I'm using defaults from your script which is 30 threads and 400k
> cells.   For 400k cells, I should probably be making adjustments upping the
> flush and region size.  I'm also noticing that all load seems to land on a
> single regionserver.  Did you notice that?  Maybe I should use something
> other than java Random?  Maybe I should md5 the Random output?
>
> But I just noticed that you ran with 4M cells.  4M cells are pretty big.
> HBase defaults are for cells much smaller.
>
> So I killed the above test and am now trying w/ 4M cells.  I see that we're
> flushing every 16 or so edits.  We're spending loads of our time just
> flushing and rolling WAL logs because the cells are all so large.  Are 4M
> cells your use case?  If so, I'll try and get you tunings that'll work
> better for these cell sizes.
>
> I've been watching ZooKeeper over last 24 hours.  The average latency
> serving zk response is about 5ms.  The longest response is 120ms (I see
> these stats by sending the quorum members the 'stat' command as in "echo
> stat|nc HOST PORT").  This time around I gave zk its own disk.  My guess is
> that when I saw the TIMED OUT yesterday, it was probably zk client-side
> issue.. a long GC or so.  I have GC logging running at the moment so will be
> able to correlate should it happen again.
>
> As for things slowing down, this third time loading I added in ganglia
> logging.  Will let you know.
>
> As to the exceptions you were seeing, here's some more on those:
>
> On the "INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block...", I
> see those in my testing.  The logs are at INFO level.  They should probably
> be WARN-level instead.  See the code below from DFSClient.  The map of nodes
> carrying block replicas can shift while DFSClient is open (Looking at one
> particular block from uploading I see that the local block was removed from
> local datanode because the block was over-replicated).  When we try to get
> the local block, we can fail.  Its logged as "No live nodes contain current
> block".  The current DFSClient map doesn't contain a node for the block.  As
> you can see in the code, we'll then ask NameNode for new map of blocks for
> this file.  As long as we get the block before cycle > failures, we'll see
> INFO-level loggings (For some reason I see 3 loggings often before we'll
> succeed and get the block).  Its a failure if DFSClient throws an IOE out of
> here.  I'm just seeing the INFO loggings in my testing.  On this second test
> I added logging to DFSClient to see if I can figure why it takes 3 goes
> often to find the right block.
>
>     private DNAddrPair chooseDataNode(LocatedBlock block)
>       throws IOException {
>       while (true) {
>         DatanodeInfo[] nodes = block.getLocations();
>         try {
>           DatanodeInfo chosenNode = bestNode(nodes, deadNodes);
>           InetSocketAddress targetAddr =
>                             NetUtils.createSocketAddr(chosenNode.getName());
>           return new DNAddrPair(chosenNode, targetAddr);
>         } catch (IOException ie) {
>           String blockInfo = block.getBlock() + " file=" + src;
>           if (failures >= maxBlockAcquireFailures) {
>             throw new IOException("Could not obtain block: " + blockInfo);
>           }
>
>           if (nodes == null || nodes.length == 0) {
>             LOG.info("No node available for block: " + blockInfo);
>           }
>           LOG.info("Could not obtain block " + block.getBlock() + " from any
> node:  " + ie);
>           try {
>             Thread.sleep(3000);
>           } catch (InterruptedException iex) {
>           }
>           deadNodes.clear(); //2nd option is to remove only nodes[blockId]
>           openInfo();
>           failures++;
>           continue;
>         }
>       }
>
> St.Ack
>
>
> On Tue, Oct 20, 2009 at 2:12 PM, elsif <elsif.t...@gmail.com> wrote:
>
>  
>> Hello Everyone,
>>
>> We are experiencing many HBase exceptions while running the attached
>> test program using
>> HBase 0.20.1 and Hadoop 0.20.1 (r810220) releases on a Hadoop cluster
>> with 14 data nodes.
>>
>> The HBase operational performance degrades quickly once certain error
>> conditions are
>> encountered especially if there are issues with the META region.
>>
>> Has anyone else seen these exceptions listed below?
>>
>> We ran the test with 3 region servers and then with 14 region servers.
>> In both cases, we still got exceptions.
>>
>> In each test run, we shutdown HBase, removed the main /hbase
>> folder, restarted HBase and then ran the test program.
>> The test program uses a table named "fc_test". The program is started by
>> the following command:
>>
>> /opt/hbase/bin/hbase org.apache.hadoop.hbase.util.TableTest  30  4000000
>>
>> The attached HBase test program starts 30 threads. Each thread loops,
>> inserting randomly generated keys
>> along with a fixed data pattern of 4,000,000 bytes.
>>
>> After running the test for 2-12 hours, the table scanning of "fc_test"
>> is broken and only a limited number of put
>> operations are still succeeding.
>>
>>
>>
>> EXCEPTION LIST
>>
>> ------------------------------
>> 2009-10-15 03:28:39,005 DEBUG
>> org.apache.hadoop.hbase.regionserver.Store: Major compaction triggered
>> on store data; time since last major compaction 89677908ms
>> 2009-10-15 03:28:39,012 DEBUG
>> org.apache.hadoop.hbase.regionserver.Store: Started compaction of 4
>> file(s)  into /hbase/fc_test/compaction.dir/1073627844, seqid=396374
>> 2009-10-15 03:28:39,245 INFO org.apache.hadoop.hdfs.DFSClient: Could not
>> obtain block blk_-6054029382496843671_324379 from any node:
>> java.io.IOException: No live nodes contain current block
>> ------------------------------
>> 2009-10-15 07:41:43,426 DEBUG org.apache.zookeeper.ClientCnxn: Got ping
>> response for sessionid:0x2244f6194a5000d after 1ms
>> 2009-10-15 07:41:53,993 WARN org.apache.zookeeper.ClientCnxn: Exception
>> closing session 0x2244f6194a5000d to sun.nio.ch.selectionkeyi...@139f5ca
>> <mailto:sun.nio.ch.selectionkeyi...@139f5ca>
>> java.io.IOException: TIMED OUT
>>     at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
>> 2009-10-15 07:41:54,344 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
>> state: Disconnected, type: None, path: null
>> 2009-10-15 07:41:55,084 INFO org.apache.zookeeper.ClientCnxn: Attempting
>> connection to server hfs-029014/10.1.29.14:2181
>> 2009-10-15 07:41:55,085 INFO org.apache.zookeeper.ClientCnxn: Priming
>> connection to java.nio.channels.SocketChannel[connected
>> local=/10.1.29.19:39421 remote=hfs-029014/10.1.29.14:2181]
>> ------------------------------
>> 2009-10-15 08:37:40,982 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>> Server handler 6 on 60020 caught: java.nio.channels.ClosedChannelException
>>    at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(Unknown Source)
>>    at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
>>    at
>> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1125)
>>    at
>>
>> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:615)
>>    at
>>
>> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:679)
>>    at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:943)
>> ------------------------------
>> 2009-10-15 07:57:28,025 DEBUG
>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
>> Total=4.652237MB (4878224), Free=987.6603MB (1035636848), Max=992.3125MB
>> (1040515072), Counts: Blocks=0, Access=51003, Hit=0, Miss=51003,
>> Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss Ratio=100.0%,
>> Evicted/Run=NaN
>> 2009-10-15 07:57:28,189 INFO org.apache.hadoop.hdfs.DFSClient: Exception
>> in createBlockOutputStream java.io.EOFException
>> 2009-10-15 07:57:28,190 INFO org.apache.hadoop.hdfs.DFSClient:
>> Abandoning block blk_7166000434569842394_332632
>> 2009-10-15 07:57:28,306 WARN org.apache.hadoop.hbase.regionserver.HLog:
>> IPC Server handler 5 on 60020 took 2297ms appending an edit to hlog;
>> editcount=146
>> -----------------------------
>> 2009-10-15 09:10:00,559 INFO org.apache.zookeeper.ClientCnxn: Exception
>> while closing send thread for session 0x1244f6194780008 : Read error rc
>> = -1 java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
>> 2009-10-15 09:10:00,739 INFO org.apache.zookeeper.ClientCnxn:
>> Disconnecting ClientCnxn for session: 0x1244f6194780008
>> 2009-10-15 09:10:00,739 INFO org.apache.zookeeper.ClientCnxn:
>> EventThread shut down
>> ------------------------------
>> 2009-10-13 12:30:58,207 WARN
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to read
>> master address from ZooKeeper. Retrying. Error was:
>> java.io.IOException:
>> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
>> NoNode for /hbase/master
>>    at
>>
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:331)
>>    at
>>
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readMasterAddressOrThrow(ZooKeeperWrapper.java:240)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:1338)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1370)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:427)
>>    at java.lang.Thread.run(Unknown Source)
>> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
>> KeeperErrorCode = NoNode for /hbase/master
>>    at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>>    at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>>    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:892)
>>    at
>>
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:327)
>> ------------------------------
>> 2009-10-14 15:12:42,664 WARN
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
>> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>>    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>    at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
>>    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
>> Source)
>>    at java.lang.reflect.Constructor.newInstance(Unknown Source)
>>    at
>>
>> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
>>    at
>>
>> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
>>    at
>>
>> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
>>    at java.lang.Thread.run(Unknown Source)
>> ------------------------------
>> 2009-10-14 15:12:43,566 WARN org.apache.zookeeper.ClientCnxn: Exception
>> closing session 0x1244f619478000d to sun.nio.ch.selectionkeyi...@15e32c4
>> <mailto:sun.nio.ch.selectionkeyi...@15e32c4>
>> java.io.IOException: Session Expired
>>    at
>>
>> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
>>    at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
>>    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
>> 2009-10-14 15:12:43,567 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
>> state: Expired, type: None, path: null
>> 2009-10-14 15:12:43,568 ERROR
>> org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session
>> expired
>> ------------------------------
>> 2009-10-14 15:12:51,563 ERROR
>> org.apache.hadoop.hbase.regionserver.HRegionServer:
>> java.io.IOException: Cannot append; log is closed
>>    at org.apache.hadoop.hbase.regionserver.HLog.append(HLog.java:584)
>>    at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1445)
>>    at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1244)
>>    at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1208)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1831)
>>    at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>    at java.lang.reflect.Method.invoke(Unknown Source)
>>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>    at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>> ------------------------------
>> 2009-10-14 15:13:51,852 INFO org.apache.hadoop.fs.FileSystem: Could not
>> cancel cleanup thread, though no FileSystems are open
>> 2009-10-14 15:13:51,864 FATAL
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
>> HRegionServer: file system not available
>> java.io.IOException: File system is not available
>>    at
>>
>> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:125)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.checkFileSystem(HRegionServer.java:901)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.cleanup(HRegionServer.java:849)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.cleanup(HRegionServer.java:832)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1839)
>>    at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>    at java.lang.reflect.Method.invoke(Unknown Source)
>>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>    at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>> Caused by: java.io.IOException: Call to hfs-029010/10.1.29.10:8020
>> failed on local exception: java.nio.channels.ClosedByInterruptException
>>    at org.apache.hadoop.ipc.Client.wrapException(Client.java:774)
>>    at org.apache.hadoop.ipc.Client.call(Client.java:742)
>>    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
>>    at $Proxy1.getFileInfo(Unknown Source)
>>    at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>    at java.lang.reflect.Method.invoke(Unknown Source)
>>    at
>>
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>>    at
>>
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>>    at $Proxy1.getFileInfo(Unknown Source)
>>    at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:615)
>>    at
>>
>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
>>    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:643)
>>    at
>>
>> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:114)
>>    ... 9 more
>> Caused by: java.nio.channels.ClosedByInterruptException
>>    at java.nio.channels.spi.AbstractInterruptibleChannel.end(Unknown
>> Source)
>>    at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
>>    at
>>
>> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
>>    at
>>
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>>    at
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>>    at
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>>    at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
>>    at java.io.BufferedOutputStream.flush(Unknown Source)
>>    at java.io.DataOutputStream.flush(Unknown Source)
>>    at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:480)
>>    at org.apache.hadoop.ipc.Client.call(Client.java:720)
>> ------------------------------
>>
>> PREVIOUS EXCEPTIONS FROM OTHER TESTS INCLUDE
>>
>> 2009-10-13 00:00:00,508 ERROR
>> org.apache.hadoop.hbase.regionserver.HRegionServer:
>> org.apache.hadoop.hbase.regionserver.WrongRegionException: Requested row
>> out of range for ...
>>    ....
>>    at
>> org.apache.hadoop.hbase.regionserver.HRegion.checkRow(HRegion.java:1527)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegion.obtainRowLock(HRegion.java:1559)
>>    at
>> org.apache.hadoop.hbase.regionserver.HRegion.getLock(HRegion.java:1627)
>>    at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2277)
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1768)
>>    at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>    at java.lang.reflect.Method.invoke(Unknown Source)
>>    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
>>    at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>> ------------------------------
>> 2009-10-12 10:06:57,809 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>> Server handler 7 on 60020, call put([...@1f5b92a,
>> [Lorg.apache.hadoop.hbase.client.Put;@14f8e05) from 10.1.29.10:36017:
>> error: java.io.IOException: Server not running, aborting
>> java.io.IOException: Server not running, aborting
>>  at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2299)
>>   at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1806)
>>  at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>  at java.lang.reflect.Method.invoke(Unknown Source)
>>  at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
>>  at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>> 2009-10-12 10:06:57,812 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
>> request=0.0, regions=12, stores=24, storefiles=15, storefileIndexSize=8,
>> memstoreSize=378, usedHeap=441, maxHeap=992, blockCacheSize=4878224,
>> blockCacheFree=1035636848, blockCacheCount=0, blockCacheHitRatio=0
>> 2009-10-12 10:06:57,813 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>> Server handler 3 on 60020, call put([...@1f52982,
>> [Lorg.apache.hadoop.hbase.client.Put;@78283f) from 10.1.29.10:36017:
>> error: java.io.IOException: Server not running, aborting
>> java.io.IOException: Server not running, aborting
>>    at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2299)
>>   at
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1806)
>>  at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>  at java.lang.reflect.Method.invoke(Unknown Source)
>>  at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
>>  at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>> ------------------------------
>>
>>
>>
>>
>>
>> .
>>
>>
>>    
>
>  


      

Reply via email to