Definitely smells like JDK 1.6.0_18. Downgrade that back to 16 or 17 and you should be good to go. _18 is a botched release if I ever saw one.
-Todd On Wed, Apr 28, 2010 at 10:54 PM, Chris Tarnas <c...@email.com> wrote: > Hi Stack, > > Thanks for looking. I checked the ganglia charts, no server was at more > than ~20% CPU utilization at any time during the load test and swap was > never used. Network traffic was light - just running a count through hbase > shell generates a much higher use. One the server hosting meta specifically, > it was at about 15-20% CPU, and IO wait never went above 3%, was usually > down at near 0. > > The load also died with a thrift timeout on every single node (each node > connecting to localhost for its thrift server), it looks like a datanode > just died and caused every thrift connection to timeout - I'll have to up > that limit to handle a node death. > > Checking logs this appears in the logs of the region server hosting meta, > looks like the dead datanode causing this error: > > 2010-04-29 01:01:38,948 WARN org.apache.hadoop.hdfs.DFSClient: > DFSOutputStream ResponseProcessor exception for block > blk_508630839844593817_11180java.io.IOException: Bad response 1 for block > blk_508630839844593817_11180 from datanode 10.195.150.255:50010 > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2423) > > The regionserver log on teh dead node, 10.195.150.255 has some more errors > in it: > > http://pastebin.com/EFH9jz0w > > I found this in the .out file on the datanode: > > # Java VM: Java HotSpot(TM) 64-Bit Server VM (16.0-b13 mixed mode > linux-amd64 ) > # Problematic frame: > # V [libjvm.so+0x62263c] > # > # An error report file with more information is saved as: > # /usr/local/hadoop-0.20.1/hs_err_pid1364.log > # > # If you would like to submit a bug report, please visit: > # http://java.sun.com/webapps/bugreport/crash.jsp > # > > > There is not a single error in the datanode's log though. Also of note - > this happened well into the test, so the node dying cause the load to abort > but not the prior poor performance. Looking through the mailing list it > looks like java 1.6.0_18 has a bad rep so I'll update the AMI (although I'm > using the same JVM on other servers in the office w/o issue and decent > single node performance and never dying...). > > Thanks for any help! > -chris > > > > > On Apr 28, 2010, at 10:10 PM, Stack wrote: > > > What is load on the server hosting meta like? Higher than others? > > > > > > > > On Apr 28, 2010, at 8:42 PM, Chris Tarnas <c...@email.com> wrote: > > > >> Hi JG, > >> > >> Speed is now down to 18 rows/sec/table per process. > >> > >> Here is a regionserver log that is serving two of the regions: > >> > >> http://pastebin.com/Hx5se0hz > >> > >> Here is the GC Log from the same server: > >> > >> http://pastebin.com/ChrRvxCx > >> > >> Here is the master log: > >> > >> http://pastebin.com/L1Kn66qU > >> > >> The thrift server logs have nothing in them in the same time period. > >> > >> Thanks in advance! > >> > >> -chris > >> > >> On Apr 28, 2010, at 7:32 PM, Jonathan Gray wrote: > >> > >>> Hey Chris, > >>> > >>> That's a really significant slowdown. I can't think of anything > obvious that would cause that in your setup. > >>> > >>> Any chance of some regionserver and master logs from the time it was > going slow? Is there any activity in the logs of the regionservers hosting > the regions of the table being written to? > >>> > >>> JG > >>> > >>>> -----Original Message----- > >>>> From: Christopher Tarnas [mailto:c...@tarnas.org] On Behalf Of Chris > >>>> Tarnas > >>>> Sent: Wednesday, April 28, 2010 6:27 PM > >>>> To: hbase-user@hadoop.apache.org > >>>> Subject: EC2 + Thrift inserts > >>>> > >>>> Hello all, > >>>> > >>>> First, thanks to all the HBase developers for producing this, it's a > >>>> great project and I'm glad to be able to use it. > >>>> > >>>> I'm looking for some help and hints here with insert performance help. > >>>> I'm doing some benchmarking, testing how I can scale up using HBase, > >>>> not really looking at raw speed. The testing is happening on EC2, > using > >>>> Andrew's scripts (thanks - those were very helpful) to set them up and > >>>> with a slightly customized version of the default AMIs (added my > >>>> application modules). I'm using HBase 20.3 and Hadoop 20.1. I've > looked > >>>> at the tips in the Wiki and it looks like Andrew's scripts are already > >>>> setup that way. > >>>> > >>>> I'm inserting into HBase from a hadoop streaming job that runs perl > and > >>>> uses the thrift gateway. I'm also using the Transactional tables so > >>>> that alone could be the case, but from what I can tell I don't think > >>>> so. LZO compression is also enabled for the column families (much of > >>>> the data is highly compressible). My cluster has 7 nodes, 5 > >>>> regionservers, 1 master and 1 zookeeper. The regionservers and master > >>>> are c1.xlarges. Each regionserver has the tasktrackers that runs the > >>>> hadoop streaming jobs, and regionserver also runs its own thrift > >>>> server. Each mapper that does the load talks to the localhost's thrift > >>>> server. > >>>> > >>>> The Row keys a fixed string + an incremental number then the order of > >>>> the bytes are reversed, so runA123 becomes 321Anur. I though of using > >>>> murmur hash but was worried about collisions. > >>>> > >>>> As I add more insert jobs, each jobs throughput goes down. Way down. I > >>>> went from about 200 row/sec/table per job with one job to about 24 > >>>> rows/sec/table per job with 25 running jobs. The servers are mostly > >>>> idle. I'm loading into two tables, one has several indexes and I'm > >>>> loading into three column families, the other has no indexes and one > >>>> column family. Both tables only currently have two region each. > >>>> > >>>> The regionserver that serves the indexed table's regions is using the > >>>> most CPU but is 87% idle. The other servers are all at ~90% idle. > There > >>>> is no IO wait. the perl processes are barely ticking over. Java on the > >>>> most "loaded" server is using about 50-60% of one CPU. > >>>> > >>>> Normally when I do load in a pseudo-distrbuted hbase (my development > >>>> platform) perl's speed is the limiting factor and uses about 85% of a > >>>> CPU. In this cluster they are using only 5-10% of a CPU as they are > all > >>>> waiting on thrift (hbase). When I run only 1 process on the cluster, > >>>> perl uses much more of a CPU, maybe 70%. > >>>> > >>>> Any tips or help in getting the speed/scalability up would be great. > >>>> Please let me know if you need any other info. > >>>> > >>>> As I send this - it looks like the main table has split again and is > >>>> being served by three regionservers.. My performance is going up a bit > >>>> (now 35 rows/sec/table per processes), but still seems like I'm not > >>>> using the full potential of even the limited EC2 system, no IO wait > and > >>>> lots of idle CPU. > >>>> > >>>> > >>>> many thanks > >>>> -chris > >>>> > >>>> > >>>> > >>>> > >>>> > >>> > >> > > -- Todd Lipcon Software Engineer, Cloudera