Hi Stack, Thanks for looking. I checked the ganglia charts, no server was at more than ~20% CPU utilization at any time during the load test and swap was never used. Network traffic was light - just running a count through hbase shell generates a much higher use. One the server hosting meta specifically, it was at about 15-20% CPU, and IO wait never went above 3%, was usually down at near 0.
The load also died with a thrift timeout on every single node (each node connecting to localhost for its thrift server), it looks like a datanode just died and caused every thrift connection to timeout - I'll have to up that limit to handle a node death. Checking logs this appears in the logs of the region server hosting meta, looks like the dead datanode causing this error: 2010-04-29 01:01:38,948 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_508630839844593817_11180java.io.IOException: Bad response 1 for block blk_508630839844593817_11180 from datanode 10.195.150.255:50010 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2423) The regionserver log on teh dead node, 10.195.150.255 has some more errors in it: http://pastebin.com/EFH9jz0w I found this in the .out file on the datanode: # Java VM: Java HotSpot(TM) 64-Bit Server VM (16.0-b13 mixed mode linux-amd64 ) # Problematic frame: # V [libjvm.so+0x62263c] # # An error report file with more information is saved as: # /usr/local/hadoop-0.20.1/hs_err_pid1364.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # There is not a single error in the datanode's log though. Also of note - this happened well into the test, so the node dying cause the load to abort but not the prior poor performance. Looking through the mailing list it looks like java 1.6.0_18 has a bad rep so I'll update the AMI (although I'm using the same JVM on other servers in the office w/o issue and decent single node performance and never dying...). Thanks for any help! -chris On Apr 28, 2010, at 10:10 PM, Stack wrote: > What is load on the server hosting meta like? Higher than others? > > > > On Apr 28, 2010, at 8:42 PM, Chris Tarnas <c...@email.com> wrote: > >> Hi JG, >> >> Speed is now down to 18 rows/sec/table per process. >> >> Here is a regionserver log that is serving two of the regions: >> >> http://pastebin.com/Hx5se0hz >> >> Here is the GC Log from the same server: >> >> http://pastebin.com/ChrRvxCx >> >> Here is the master log: >> >> http://pastebin.com/L1Kn66qU >> >> The thrift server logs have nothing in them in the same time period. >> >> Thanks in advance! >> >> -chris >> >> On Apr 28, 2010, at 7:32 PM, Jonathan Gray wrote: >> >>> Hey Chris, >>> >>> That's a really significant slowdown. I can't think of anything obvious >>> that would cause that in your setup. >>> >>> Any chance of some regionserver and master logs from the time it was going >>> slow? Is there any activity in the logs of the regionservers hosting the >>> regions of the table being written to? >>> >>> JG >>> >>>> -----Original Message----- >>>> From: Christopher Tarnas [mailto:c...@tarnas.org] On Behalf Of Chris >>>> Tarnas >>>> Sent: Wednesday, April 28, 2010 6:27 PM >>>> To: hbase-user@hadoop.apache.org >>>> Subject: EC2 + Thrift inserts >>>> >>>> Hello all, >>>> >>>> First, thanks to all the HBase developers for producing this, it's a >>>> great project and I'm glad to be able to use it. >>>> >>>> I'm looking for some help and hints here with insert performance help. >>>> I'm doing some benchmarking, testing how I can scale up using HBase, >>>> not really looking at raw speed. The testing is happening on EC2, using >>>> Andrew's scripts (thanks - those were very helpful) to set them up and >>>> with a slightly customized version of the default AMIs (added my >>>> application modules). I'm using HBase 20.3 and Hadoop 20.1. I've looked >>>> at the tips in the Wiki and it looks like Andrew's scripts are already >>>> setup that way. >>>> >>>> I'm inserting into HBase from a hadoop streaming job that runs perl and >>>> uses the thrift gateway. I'm also using the Transactional tables so >>>> that alone could be the case, but from what I can tell I don't think >>>> so. LZO compression is also enabled for the column families (much of >>>> the data is highly compressible). My cluster has 7 nodes, 5 >>>> regionservers, 1 master and 1 zookeeper. The regionservers and master >>>> are c1.xlarges. Each regionserver has the tasktrackers that runs the >>>> hadoop streaming jobs, and regionserver also runs its own thrift >>>> server. Each mapper that does the load talks to the localhost's thrift >>>> server. >>>> >>>> The Row keys a fixed string + an incremental number then the order of >>>> the bytes are reversed, so runA123 becomes 321Anur. I though of using >>>> murmur hash but was worried about collisions. >>>> >>>> As I add more insert jobs, each jobs throughput goes down. Way down. I >>>> went from about 200 row/sec/table per job with one job to about 24 >>>> rows/sec/table per job with 25 running jobs. The servers are mostly >>>> idle. I'm loading into two tables, one has several indexes and I'm >>>> loading into three column families, the other has no indexes and one >>>> column family. Both tables only currently have two region each. >>>> >>>> The regionserver that serves the indexed table's regions is using the >>>> most CPU but is 87% idle. The other servers are all at ~90% idle. There >>>> is no IO wait. the perl processes are barely ticking over. Java on the >>>> most "loaded" server is using about 50-60% of one CPU. >>>> >>>> Normally when I do load in a pseudo-distrbuted hbase (my development >>>> platform) perl's speed is the limiting factor and uses about 85% of a >>>> CPU. In this cluster they are using only 5-10% of a CPU as they are all >>>> waiting on thrift (hbase). When I run only 1 process on the cluster, >>>> perl uses much more of a CPU, maybe 70%. >>>> >>>> Any tips or help in getting the speed/scalability up would be great. >>>> Please let me know if you need any other info. >>>> >>>> As I send this - it looks like the main table has split again and is >>>> being served by three regionservers.. My performance is going up a bit >>>> (now 35 rows/sec/table per processes), but still seems like I'm not >>>> using the full potential of even the limited EC2 system, no IO wait and >>>> lots of idle CPU. >>>> >>>> >>>> many thanks >>>> -chris >>>> >>>> >>>> >>>> >>>> >>> >>