Every insert on an indexed would require at the very least an RPC to a different regionserver. If the regionservers are busy, your request could wait in the queue for a moment.
One param to tune would be the handler thread count. Set it to 100 at least. On Thu, Apr 29, 2010 at 2:16 AM, Chris Tarnas <c...@email.com> wrote: > I just finished some testing with JDK 1.6 u17 - so far no performance > improvements with just changing that. Disabling LZO compression did gain a > little bit (up to about 30/sec from 25/sec per thread). Turning of indexes > helped the most - that brought me up to 115/sec @ 2875 total rows a second. A > single perl/thrift process can load at over 350 rows/sec so its not scaling > as well as I would have expected, even without the indexes. > > Are the transactional indexes that costly? What is the bottleneck there? CPU > utilization and network packets went up when I disabled the indexes, I don't > think those are the bottlenecks for the indexes. I was even able to add > another 15 insert process (total of 40) and only lost about 10% on a per > process throughput. I probably could go even higher, none of the nodes are > above CPU 60% utilization and IO wait was at most 3.5%. > > Each rowkey is unique, so there should not be any blocking on the row locks. > I'll do more indexed tests tomorrow. > > thanks, > -chris > > > > > > > > On Apr 29, 2010, at 12:18 AM, Todd Lipcon wrote: > >> Definitely smells like JDK 1.6.0_18. Downgrade that back to 16 or 17 and you >> should be good to go. _18 is a botched release if I ever saw one. >> >> -Todd >> >> On Wed, Apr 28, 2010 at 10:54 PM, Chris Tarnas <c...@email.com> wrote: >> >>> Hi Stack, >>> >>> Thanks for looking. I checked the ganglia charts, no server was at more >>> than ~20% CPU utilization at any time during the load test and swap was >>> never used. Network traffic was light - just running a count through hbase >>> shell generates a much higher use. One the server hosting meta specifically, >>> it was at about 15-20% CPU, and IO wait never went above 3%, was usually >>> down at near 0. >>> >>> The load also died with a thrift timeout on every single node (each node >>> connecting to localhost for its thrift server), it looks like a datanode >>> just died and caused every thrift connection to timeout - I'll have to up >>> that limit to handle a node death. >>> >>> Checking logs this appears in the logs of the region server hosting meta, >>> looks like the dead datanode causing this error: >>> >>> 2010-04-29 01:01:38,948 WARN org.apache.hadoop.hdfs.DFSClient: >>> DFSOutputStream ResponseProcessor exception for block >>> blk_508630839844593817_11180java.io.IOException: Bad response 1 for block >>> blk_508630839844593817_11180 from datanode 10.195.150.255:50010 >>> at >>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2423) >>> >>> The regionserver log on teh dead node, 10.195.150.255 has some more errors >>> in it: >>> >>> http://pastebin.com/EFH9jz0w >>> >>> I found this in the .out file on the datanode: >>> >>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (16.0-b13 mixed mode >>> linux-amd64 ) >>> # Problematic frame: >>> # V [libjvm.so+0x62263c] >>> # >>> # An error report file with more information is saved as: >>> # /usr/local/hadoop-0.20.1/hs_err_pid1364.log >>> # >>> # If you would like to submit a bug report, please visit: >>> # http://java.sun.com/webapps/bugreport/crash.jsp >>> # >>> >>> >>> There is not a single error in the datanode's log though. Also of note - >>> this happened well into the test, so the node dying cause the load to abort >>> but not the prior poor performance. Looking through the mailing list it >>> looks like java 1.6.0_18 has a bad rep so I'll update the AMI (although I'm >>> using the same JVM on other servers in the office w/o issue and decent >>> single node performance and never dying...). >>> >>> Thanks for any help! >>> -chris >>> >>> >>> >>> >>> On Apr 28, 2010, at 10:10 PM, Stack wrote: >>> >>>> What is load on the server hosting meta like? Higher than others? >>>> >>>> >>>> >>>> On Apr 28, 2010, at 8:42 PM, Chris Tarnas <c...@email.com> wrote: >>>> >>>>> Hi JG, >>>>> >>>>> Speed is now down to 18 rows/sec/table per process. >>>>> >>>>> Here is a regionserver log that is serving two of the regions: >>>>> >>>>> http://pastebin.com/Hx5se0hz >>>>> >>>>> Here is the GC Log from the same server: >>>>> >>>>> http://pastebin.com/ChrRvxCx >>>>> >>>>> Here is the master log: >>>>> >>>>> http://pastebin.com/L1Kn66qU >>>>> >>>>> The thrift server logs have nothing in them in the same time period. >>>>> >>>>> Thanks in advance! >>>>> >>>>> -chris >>>>> >>>>> On Apr 28, 2010, at 7:32 PM, Jonathan Gray wrote: >>>>> >>>>>> Hey Chris, >>>>>> >>>>>> That's a really significant slowdown. I can't think of anything >>> obvious that would cause that in your setup. >>>>>> >>>>>> Any chance of some regionserver and master logs from the time it was >>> going slow? Is there any activity in the logs of the regionservers hosting >>> the regions of the table being written to? >>>>>> >>>>>> JG >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Christopher Tarnas [mailto:c...@tarnas.org] On Behalf Of Chris >>>>>>> Tarnas >>>>>>> Sent: Wednesday, April 28, 2010 6:27 PM >>>>>>> To: hbase-user@hadoop.apache.org >>>>>>> Subject: EC2 + Thrift inserts >>>>>>> >>>>>>> Hello all, >>>>>>> >>>>>>> First, thanks to all the HBase developers for producing this, it's a >>>>>>> great project and I'm glad to be able to use it. >>>>>>> >>>>>>> I'm looking for some help and hints here with insert performance help. >>>>>>> I'm doing some benchmarking, testing how I can scale up using HBase, >>>>>>> not really looking at raw speed. The testing is happening on EC2, >>> using >>>>>>> Andrew's scripts (thanks - those were very helpful) to set them up and >>>>>>> with a slightly customized version of the default AMIs (added my >>>>>>> application modules). I'm using HBase 20.3 and Hadoop 20.1. I've >>> looked >>>>>>> at the tips in the Wiki and it looks like Andrew's scripts are already >>>>>>> setup that way. >>>>>>> >>>>>>> I'm inserting into HBase from a hadoop streaming job that runs perl >>> and >>>>>>> uses the thrift gateway. I'm also using the Transactional tables so >>>>>>> that alone could be the case, but from what I can tell I don't think >>>>>>> so. LZO compression is also enabled for the column families (much of >>>>>>> the data is highly compressible). My cluster has 7 nodes, 5 >>>>>>> regionservers, 1 master and 1 zookeeper. The regionservers and master >>>>>>> are c1.xlarges. Each regionserver has the tasktrackers that runs the >>>>>>> hadoop streaming jobs, and regionserver also runs its own thrift >>>>>>> server. Each mapper that does the load talks to the localhost's thrift >>>>>>> server. >>>>>>> >>>>>>> The Row keys a fixed string + an incremental number then the order of >>>>>>> the bytes are reversed, so runA123 becomes 321Anur. I though of using >>>>>>> murmur hash but was worried about collisions. >>>>>>> >>>>>>> As I add more insert jobs, each jobs throughput goes down. Way down. I >>>>>>> went from about 200 row/sec/table per job with one job to about 24 >>>>>>> rows/sec/table per job with 25 running jobs. The servers are mostly >>>>>>> idle. I'm loading into two tables, one has several indexes and I'm >>>>>>> loading into three column families, the other has no indexes and one >>>>>>> column family. Both tables only currently have two region each. >>>>>>> >>>>>>> The regionserver that serves the indexed table's regions is using the >>>>>>> most CPU but is 87% idle. The other servers are all at ~90% idle. >>> There >>>>>>> is no IO wait. the perl processes are barely ticking over. Java on the >>>>>>> most "loaded" server is using about 50-60% of one CPU. >>>>>>> >>>>>>> Normally when I do load in a pseudo-distrbuted hbase (my development >>>>>>> platform) perl's speed is the limiting factor and uses about 85% of a >>>>>>> CPU. In this cluster they are using only 5-10% of a CPU as they are >>> all >>>>>>> waiting on thrift (hbase). When I run only 1 process on the cluster, >>>>>>> perl uses much more of a CPU, maybe 70%. >>>>>>> >>>>>>> Any tips or help in getting the speed/scalability up would be great. >>>>>>> Please let me know if you need any other info. >>>>>>> >>>>>>> As I send this - it looks like the main table has split again and is >>>>>>> being served by three regionservers.. My performance is going up a bit >>>>>>> (now 35 rows/sec/table per processes), but still seems like I'm not >>>>>>> using the full potential of even the limited EC2 system, no IO wait >>> and >>>>>>> lots of idle CPU. >>>>>>> >>>>>>> >>>>>>> many thanks >>>>>>> -chris >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>> >>> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera > >