Every insert on an indexed would require at the very least an RPC to a
different regionserver.  If the regionservers are busy, your request
could wait in the queue for a moment.

One param to tune would be the handler thread count.  Set it to 100 at least.

On Thu, Apr 29, 2010 at 2:16 AM, Chris Tarnas <c...@email.com> wrote:
> I just finished some testing with JDK 1.6 u17 - so far no performance 
> improvements with just changing that. Disabling LZO compression did gain a 
> little bit (up to about 30/sec from 25/sec per thread). Turning of indexes 
> helped the most - that brought me up to 115/sec @ 2875 total rows a second. A 
> single perl/thrift process can load at over 350 rows/sec so its not scaling 
> as well as I would have expected, even without the indexes.
>
> Are the transactional indexes that costly? What is the bottleneck there? CPU 
> utilization and network packets went up when I disabled the indexes, I don't 
> think those are the bottlenecks for the indexes. I was even able to add 
> another 15 insert process (total of 40) and only lost about 10% on a per 
> process throughput. I probably could go even higher, none of the nodes are 
> above CPU 60% utilization and IO wait was at most 3.5%.
>
> Each rowkey is unique, so there should not be any blocking on the row locks. 
> I'll do more indexed tests tomorrow.
>
> thanks,
> -chris
>
>
>
>
>
>
>
> On Apr 29, 2010, at 12:18 AM, Todd Lipcon wrote:
>
>> Definitely smells like JDK 1.6.0_18. Downgrade that back to 16 or 17 and you
>> should be good to go. _18 is a botched release if I ever saw one.
>>
>> -Todd
>>
>> On Wed, Apr 28, 2010 at 10:54 PM, Chris Tarnas <c...@email.com> wrote:
>>
>>> Hi Stack,
>>>
>>> Thanks for looking. I checked the ganglia charts, no server was at more
>>> than ~20% CPU utilization at any time during the load test and swap was
>>> never used. Network traffic was light - just running a count through hbase
>>> shell generates a much higher use. One the server hosting meta specifically,
>>> it was at about 15-20% CPU, and IO wait never went above 3%, was usually
>>> down at near 0.
>>>
>>> The load also died with a thrift timeout on every single node (each node
>>> connecting to localhost for its thrift server), it looks like a datanode
>>> just died and caused every thrift connection to timeout - I'll have to up
>>> that limit to handle a node death.
>>>
>>> Checking logs this appears in the logs of the region server hosting meta,
>>> looks like the dead datanode causing this error:
>>>
>>> 2010-04-29 01:01:38,948 WARN org.apache.hadoop.hdfs.DFSClient:
>>> DFSOutputStream ResponseProcessor exception  for block
>>> blk_508630839844593817_11180java.io.IOException: Bad response 1 for block
>>> blk_508630839844593817_11180 from datanode 10.195.150.255:50010
>>>       at
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2423)
>>>
>>> The regionserver log on teh dead node, 10.195.150.255 has some more errors
>>> in it:
>>>
>>> http://pastebin.com/EFH9jz0w
>>>
>>> I found this in the .out file on the datanode:
>>>
>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (16.0-b13 mixed mode
>>> linux-amd64 )
>>> # Problematic frame:
>>> # V  [libjvm.so+0x62263c]
>>> #
>>> # An error report file with more information is saved as:
>>> # /usr/local/hadoop-0.20.1/hs_err_pid1364.log
>>> #
>>> # If you would like to submit a bug report, please visit:
>>> #   http://java.sun.com/webapps/bugreport/crash.jsp
>>> #
>>>
>>>
>>> There is not a single error in the datanode's log though. Also of note -
>>> this happened well into the test, so the node dying cause the load to abort
>>> but not the prior poor performance. Looking through the mailing list it
>>> looks like java 1.6.0_18 has a bad rep so I'll update the AMI (although I'm
>>> using the same JVM on other servers in the office w/o issue and decent
>>> single node performance and never dying...).
>>>
>>> Thanks for any help!
>>> -chris
>>>
>>>
>>>
>>>
>>> On Apr 28, 2010, at 10:10 PM, Stack wrote:
>>>
>>>> What is load on the server hosting meta like?  Higher than others?
>>>>
>>>>
>>>>
>>>> On Apr 28, 2010, at 8:42 PM, Chris Tarnas <c...@email.com> wrote:
>>>>
>>>>> Hi JG,
>>>>>
>>>>> Speed is now down to 18 rows/sec/table per process.
>>>>>
>>>>> Here is a regionserver log that is serving two of the regions:
>>>>>
>>>>> http://pastebin.com/Hx5se0hz
>>>>>
>>>>> Here is the GC Log from the same server:
>>>>>
>>>>> http://pastebin.com/ChrRvxCx
>>>>>
>>>>> Here is the master log:
>>>>>
>>>>> http://pastebin.com/L1Kn66qU
>>>>>
>>>>> The thrift server logs have nothing in them in the same time period.
>>>>>
>>>>> Thanks in advance!
>>>>>
>>>>> -chris
>>>>>
>>>>> On Apr 28, 2010, at 7:32 PM, Jonathan Gray wrote:
>>>>>
>>>>>> Hey Chris,
>>>>>>
>>>>>> That's a really significant slowdown.  I can't think of anything
>>> obvious that would cause that in your setup.
>>>>>>
>>>>>> Any chance of some regionserver and master logs from the time it was
>>> going slow?  Is there any activity in the logs of the regionservers hosting
>>> the regions of the table being written to?
>>>>>>
>>>>>> JG
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Christopher Tarnas [mailto:c...@tarnas.org] On Behalf Of Chris
>>>>>>> Tarnas
>>>>>>> Sent: Wednesday, April 28, 2010 6:27 PM
>>>>>>> To: hbase-user@hadoop.apache.org
>>>>>>> Subject: EC2 + Thrift inserts
>>>>>>>
>>>>>>> Hello all,
>>>>>>>
>>>>>>> First, thanks to all the HBase developers for producing this, it's a
>>>>>>> great project and I'm glad to be able to use it.
>>>>>>>
>>>>>>> I'm looking for some help and hints here with insert performance help.
>>>>>>> I'm doing some benchmarking, testing how I can scale up using HBase,
>>>>>>> not really looking at raw speed. The testing is happening on EC2,
>>> using
>>>>>>> Andrew's scripts (thanks - those were very helpful) to set them up and
>>>>>>> with a slightly customized version of the default AMIs (added my
>>>>>>> application modules). I'm using HBase 20.3 and Hadoop 20.1. I've
>>> looked
>>>>>>> at the tips in the Wiki and it looks like Andrew's scripts are already
>>>>>>> setup that way.
>>>>>>>
>>>>>>> I'm inserting into HBase from a hadoop streaming job that runs perl
>>> and
>>>>>>> uses the thrift gateway. I'm also using the Transactional tables so
>>>>>>> that alone could be the case, but from what I can tell I don't think
>>>>>>> so. LZO compression is also enabled for the column families (much of
>>>>>>> the data is highly compressible). My cluster has 7 nodes, 5
>>>>>>> regionservers, 1 master and 1 zookeeper. The regionservers and master
>>>>>>> are c1.xlarges. Each regionserver has the tasktrackers that runs the
>>>>>>> hadoop streaming jobs, and regionserver also runs its own thrift
>>>>>>> server. Each mapper that does the load talks to the localhost's thrift
>>>>>>> server.
>>>>>>>
>>>>>>> The Row keys a fixed string + an incremental number then the order of
>>>>>>> the bytes are reversed, so runA123 becomes 321Anur. I though of using
>>>>>>> murmur hash but was worried about collisions.
>>>>>>>
>>>>>>> As I add more insert jobs, each jobs throughput goes down. Way down. I
>>>>>>> went from about 200 row/sec/table per job with one job to about 24
>>>>>>> rows/sec/table per job with 25 running jobs. The servers are mostly
>>>>>>> idle. I'm loading into two tables, one has several indexes and I'm
>>>>>>> loading into three column families, the other has no indexes and one
>>>>>>> column family. Both tables only currently have two region each.
>>>>>>>
>>>>>>> The regionserver that serves the indexed table's regions is using the
>>>>>>> most CPU but is 87% idle. The other servers are all at ~90% idle.
>>> There
>>>>>>> is no IO wait. the perl processes are barely ticking over. Java on the
>>>>>>> most "loaded" server is using about 50-60% of one CPU.
>>>>>>>
>>>>>>> Normally when I do load in a pseudo-distrbuted hbase (my development
>>>>>>> platform) perl's speed is the limiting factor and uses about 85% of a
>>>>>>> CPU. In this cluster they are using only 5-10% of a CPU as they are
>>> all
>>>>>>> waiting on thrift (hbase). When I run only 1 process on the cluster,
>>>>>>> perl uses much more of a CPU, maybe 70%.
>>>>>>>
>>>>>>> Any tips or help in getting the speed/scalability up would be great.
>>>>>>> Please let me know if you need any other info.
>>>>>>>
>>>>>>> As I send this - it looks like the main table has split again and is
>>>>>>> being served by three regionservers.. My performance is going up a bit
>>>>>>> (now 35 rows/sec/table per processes), but still seems like I'm not
>>>>>>> using the full potential of even the limited EC2 system, no IO wait
>>> and
>>>>>>> lots of idle CPU.
>>>>>>>
>>>>>>>
>>>>>>> many thanks
>>>>>>> -chris
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>
>

Reply via email to