Region Servers ServerName Start time Load app-hbase-1,60020,1398141516916 Tue Apr 22 12:38:36 CST 2014 requestsPerSecond=448799, numberOfOnlineRegions=8, usedHeapMB=1241, maxHeapMB=7948 app-hbase-2,60020,1398141516914 Tue Apr 22 12:38:36 CST 2014 requestsPerSecond=0, numberOfOnlineRegions=3, usedHeapMB=829, maxHeapMB=7948 app-hbase-4,60020,1398141525533 Tue Apr 22 12:38:45 CST 2014 requestsPerSecond=217349, numberOfOnlineRegions=6, usedHeapMB=1356, maxHeapMB=7948 app-hbase-5,60020,1398141524870 Tue Apr 22 12:38:44 CST 2014 requestsPerSecond=844, numberOfOnlineRegions=3, usedHeapMB=285, maxHeapMB=7948 Total: servers: 4 requestsPerSecond=666992, numberOfOnlineRegions=20
hbase-2 and 4 only have 3 region. how to balance them? On Tue, Apr 22, 2014 at 3:53 PM, Li Li <fancye...@gmail.com> wrote: > hbase current statistics: > > Region Servers > ServerName Start time Load > app-hbase-1,60020,1398141516916 Tue Apr 22 12:38:36 CST 2014 > requestsPerSecond=6100, numberOfOnlineRegions=7, usedHeapMB=1201, > maxHeapMB=7948 > app-hbase-2,60020,1398141516914 Tue Apr 22 12:38:36 CST 2014 > requestsPerSecond=1770, numberOfOnlineRegions=4, usedHeapMB=224, > maxHeapMB=7948 > app-hbase-4,60020,1398141525533 Tue Apr 22 12:38:45 CST 2014 > requestsPerSecond=3445, numberOfOnlineRegions=5, usedHeapMB=798, > maxHeapMB=7948 > app-hbase-5,60020,1398141524870 Tue Apr 22 12:38:44 CST 2014 > requestsPerSecond=57, numberOfOnlineRegions=2, usedHeapMB=328, > maxHeapMB=7948 > Total: servers: 4 requestsPerSecond=11372, numberOfOnlineRegions=18 > > On Tue, Apr 22, 2014 at 3:40 PM, Li Li <fancye...@gmail.com> wrote: >> I am now restart the sever and running. maybe an hour later the load >> will become high >> >> On Tue, Apr 22, 2014 at 3:02 PM, Azuryy Yu <azury...@gmail.com> wrote: >>> Do you still have the same issue? >>> >>> and: >>> -Xmx8000m -server -XX:NewSize=512m -XX:MaxNewSize=512m >>> >>> the Eden size is too small. >>> >>> >>> >>> On Tue, Apr 22, 2014 at 2:55 PM, Li Li <fancye...@gmail.com> wrote: >>> >>>> <property> >>>> <name>dfs.datanode.handler.count</name> >>>> <value>100</value> >>>> <description>The number of server threads for the datanode.</description> >>>> </property> >>>> >>>> >>>> 1. namenode/master 192.168.10.48 >>>> http://pastebin.com/7M0zzAAc >>>> >>>> $free -m (this is value when I restart the hadoop and hbase now, not >>>> the value when it crashed) >>>> total used free shared buffers cached >>>> Mem: 15951 3819 12131 0 509 1990 >>>> -/+ buffers/cache: 1319 14631 >>>> Swap: 8191 0 8191 >>>> >>>> 2. datanode/region 192.168.10.45 >>>> http://pastebin.com/FiAw1yju >>>> >>>> $free -m >>>> total used free shared buffers cached >>>> Mem: 15951 3627 12324 0 1516 641 >>>> -/+ buffers/cache: 1469 14482 >>>> Swap: 8191 8 8183 >>>> >>>> On Tue, Apr 22, 2014 at 2:29 PM, Azuryy Yu <azury...@gmail.com> wrote: >>>> > one big possible issue is that you have a high concurrent request on HDFS >>>> > or HBASE, then all Data nodes handlers are all busy, then more requests >>>> are >>>> > pending, then timeout, so you can try to increase >>>> > dfs.datanode.handler.count and dfs.namenode.handler.count in the >>>> > hdfs-site.xml, then restart the HDFS. >>>> > >>>> > another, do you have datanode, namenode, region servers JVM options? if >>>> > they are all by default, then there is also have this issue. >>>> > >>>> > >>>> > >>>> > >>>> > On Tue, Apr 22, 2014 at 2:20 PM, Li Li <fancye...@gmail.com> wrote: >>>> > >>>> >> my cluster setup: both 6 machines are virtual machine. each machine: >>>> >> 4CPU Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 16GB memory >>>> >> 192.168.10.48 namenode/jobtracker >>>> >> 192.168.10.47 secondary namenode >>>> >> 192.168.10.45 datanode/tasktracker >>>> >> 192.168.10.46 datanode/tasktracker >>>> >> 192.168.10.49 datanode/tasktracker >>>> >> 192.168.10.50 datanode/tasktracker >>>> >> >>>> >> hdfs logs around 20:33 >>>> >> 192.168.10.48 namenode log http://pastebin.com/rwgmPEXR >>>> >> 192.168.10.45 datanode log http://pastebin.com/HBgZ8rtV (I found this >>>> >> datanode crash first) >>>> >> 192.168.10.46 datanode log http://pastebin.com/aQ2emnUi >>>> >> 192.168.10.49 datanode log http://pastebin.com/aqsWrrL1 >>>> >> 192.168.10.50 datanode log http://pastebin.com/V7C6tjpB >>>> >> >>>> >> hbase logs around 20:33 >>>> >> 192.168.10.48 master log http://pastebin.com/2ZfeYA1p >>>> >> 192.168.10.45 region log http://pastebin.com/idCF2a7Y >>>> >> 192.168.10.46 region log http://pastebin.com/WEh4dA0f >>>> >> 192.168.10.49 region log http://pastebin.com/cGtpbTLz >>>> >> 192.168.10.50 region log http://pastebin.com/bD6h5T6p(very strange, >>>> >> not log at 20:33, but have log at 20:32 and 20:34) >>>> >> >>>> >> On Tue, Apr 22, 2014 at 12:25 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>>> >> > Can you post more of the data node log, around 20:33 ? >>>> >> > >>>> >> > Cheers >>>> >> > >>>> >> > >>>> >> > On Mon, Apr 21, 2014 at 8:57 PM, Li Li <fancye...@gmail.com> wrote: >>>> >> > >>>> >> >> hadoop 1.0 >>>> >> >> hbase 0.94.11 >>>> >> >> >>>> >> >> datanode log from 192.168.10.45. why it shut down itself? >>>> >> >> >>>> >> >> 2014-04-21 20:33:59,309 INFO >>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock >>>> >> >> blk_-7969006819959471805_202154 received exception >>>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on >>>> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout >>>> >> >> left. >>>> >> >> 2014-04-21 20:33:59,310 ERROR >>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: >>>> >> >> DatanodeRegistration(192.168.10.45:50010, >>>> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949, >>>> >> >> infoPort=50075, ipcPort=50020):DataXceiver >>>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on >>>> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout >>>> >> >> left. >>>> >> >> at >>>> >> >> >>>> >> >>>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349) >>>> >> >> at >>>> >> >> >>>> >> >>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) >>>> >> >> at >>>> >> >> >>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) >>>> >> >> at >>>> >> >> >>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) >>>> >> >> at >>>> >> java.io.BufferedInputStream.read1(BufferedInputStream.java:273) >>>> >> >> at >>>> >> java.io.BufferedInputStream.read(BufferedInputStream.java:334) >>>> >> >> at java.io.DataInputStream.read(DataInputStream.java:149) >>>> >> >> at >>>> >> >> >>>> >> >>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265) >>>> >> >> at >>>> >> >> >>>> >> >>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312) >>>> >> >> at >>>> >> >> >>>> >> >>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376) >>>> >> >> at >>>> >> >> >>>> >> >>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532) >>>> >> >> at >>>> >> >> >>>> >> >>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398) >>>> >> >> at >>>> >> >> >>>> >> >>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107) >>>> >> >> at java.lang.Thread.run(Thread.java:722) >>>> >> >> 2014-04-21 20:33:59,310 ERROR >>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: >>>> >> >> DatanodeRegistration(192.168.10.45:50010, >>>> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949, >>>> >> >> infoPort=50075, ipcPort=50020):DataXceiver >>>> >> >> java.io.InterruptedIOException: Interruped while waiting for IO on >>>> >> >> channel java.nio.channels.SocketChannel[closed]. 466924 millis >>>> timeout >>>> >> >> left. >>>> >> >> at >>>> >> >> >>>> >> >>>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349) >>>> >> >> at >>>> >> >> >>>> >> >>>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245) >>>> >> >> at >>>> >> >> >>>> >> >>>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) >>>> >> >> at >>>> >> >> >>>> >> >>>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) >>>> >> >> at >>>> >> >> >>>> >> >>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350) >>>> >> >> at >>>> >> >> >>>> >> >>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436) >>>> >> >> at >>>> >> >> >>>> >> >>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197) >>>> >> >> at >>>> >> >> >>>> >> >>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99) >>>> >> >> at java.lang.Thread.run(Thread.java:722) >>>> >> >> 2014-04-21 20:34:00,291 INFO >>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for >>>> >> >> threadgroup to exit, active threads is 0 >>>> >> >> 2014-04-21 20:34:00,404 INFO >>>> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: >>>> >> >> Shutting down all async disk service threads... >>>> >> >> 2014-04-21 20:34:00,405 INFO >>>> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService: All >>>> >> >> async disk service threads have been shut down. >>>> >> >> 2014-04-21 20:34:00,413 INFO >>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode >>>> >> >> 2014-04-21 20:34:00,424 INFO >>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: >>>> >> >> /************************************************************ >>>> >> >> SHUTDOWN_MSG: Shutting down DataNode at app-hbase-1/192.168.10.45 >>>> >> >> ************************************************************/ >>>> >> >> >>>> >> >> On Tue, Apr 22, 2014 at 11:25 AM, Ted Yu <yuzhih...@gmail.com> >>>> wrote: >>>> >> >> > bq. one datanode failed >>>> >> >> > >>>> >> >> > Was the crash due to out of memory error ? >>>> >> >> > Can you post the tail of data node log on pastebin ? >>>> >> >> > >>>> >> >> > Giving us versions of hadoop and hbase would be helpful. >>>> >> >> > >>>> >> >> > >>>> >> >> > On Mon, Apr 21, 2014 at 7:39 PM, Li Li <fancye...@gmail.com> >>>> wrote: >>>> >> >> > >>>> >> >> >> I have a small hbase cluster with 1 namenode, 1 secondary >>>> namenode, 4 >>>> >> >> >> datanode. >>>> >> >> >> and the hbase master is on the same machine with namenode, 4 hbase >>>> >> >> >> slave on datanode machine. >>>> >> >> >> I found average requests per seconds is about 10,000. and the >>>> >> clusters >>>> >> >> >> crashed. and I found the reason is one datanode failed. >>>> >> >> >> >>>> >> >> >> the datanode configuration is about 4 cpu core and 10GB memory >>>> >> >> >> is my cluster overloaded? >>>> >> >> >> >>>> >> >> >>>> >> >>>>