Re: EC2 + Thrift inserts

Chris Tarnas Thu, 29 Apr 2010 21:12:59 -0700

They are all at 100, but none of the regionservers are loaded - mostare less than 20% CPU. Is this all network latency?


-chris


On Apr 29, 2010, at 8:29 PM, Ryan Rawson <ryano...@gmail.com> wrote:

Every insert on an indexed would require at the very least an RPC to a
different regionserver.  If the regionservers are busy, your request
could wait in the queue for a moment.
One param to tune would be the handler thread count. Set it to 100at least.
On Thu, Apr 29, 2010 at 2:16 AM, Chris Tarnas <c...@email.com> wrote:
I just finished some testing with JDK 1.6 u17 - so far noperformance improvements with just changing that. Disabling LZOcompression did gain a little bit (up to about 30/sec from 25/secper thread). Turning of indexes helped the most - that brought meup to 115/sec @ 2875 total rows a second. A single perl/thriftprocess can load at over 350 rows/sec so its not scaling as well asI would have expected, even without the indexes.
Are the transactional indexes that costly? What is the bottleneckthere? CPU utilization and network packets went up when I disabledthe indexes, I don't think those are the bottlenecks for theindexes. I was even able to add another 15 insert process (total of40) and only lost about 10% on a per process throughput. I probablycould go even higher, none of the nodes are above CPU 60%utilization and IO wait was at most 3.5%.
Each rowkey is unique, so there should not be any blocking on therow locks. I'll do more indexed tests tomorrow.
thanks,
-chris







On Apr 29, 2010, at 12:18 AM, Todd Lipcon wrote:
Definitely smells like JDK 1.6.0_18. Downgrade that back to 16 or17 and you
should be good to go. _18 is a botched release if I ever saw one.

-Todd
On Wed, Apr 28, 2010 at 10:54 PM, Chris Tarnas <c...@email.com>wrote:
Hi Stack,
Thanks for looking. I checked the ganglia charts, no server wasat morethan ~20% CPU utilization at any time during the load test andswap wasnever used. Network traffic was light - just running a countthrough hbaseshell generates a much higher use. One the server hosting metaspecifically,it was at about 15-20% CPU, and IO wait never went above 3%, wasusually
down at near 0.
The load also died with a thrift timeout on every single node(each nodeconnecting to localhost for its thrift server), it looks like adatanodejust died and caused every thrift connection to timeout - I'llhave to up
that limit to handle a node death.
Checking logs this appears in the logs of the region serverhosting meta,
looks like the dead datanode causing this error:

2010-04-29 01:01:38,948 WARN org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception  for block
blk_508630839844593817_11180java.io.IOException: Bad response 1for block
blk_508630839844593817_11180 from datanode 10.195.150.255:50010
      at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2423)
The regionserver log on teh dead node, 10.195.150.255 has somemore errors
in it:

http://pastebin.com/EFH9jz0w

I found this in the .out file on the datanode:

# Java VM: Java HotSpot(TM) 64-Bit Server VM (16.0-b13 mixed mode
linux-amd64 )
# Problematic frame:
# V  [libjvm.so+0x62263c]
#
# An error report file with more information is saved as:
# /usr/local/hadoop-0.20.1/hs_err_pid1364.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#
There is not a single error in the datanode's log though. Also ofnote -this happened well into the test, so the node dying cause theload to abortbut not the prior poor performance. Looking through the mailinglist itlooks like java 1.6.0_18 has a bad rep so I'll update the AMI(although I'musing the same JVM on other servers in the office w/o issue anddecent
single node performance and never dying...).

Thanks for any help!
-chris




On Apr 28, 2010, at 10:10 PM, Stack wrote:
What is load on the server hosting meta like?  Higher than others?



On Apr 28, 2010, at 8:42 PM, Chris Tarnas <c...@email.com> wrote:
Hi JG,

Speed is now down to 18 rows/sec/table per process.

Here is a regionserver log that is serving two of the regions:

http://pastebin.com/Hx5se0hz

Here is the GC Log from the same server:

http://pastebin.com/ChrRvxCx

Here is the master log:

http://pastebin.com/L1Kn66qU
The thrift server logs have nothing in them in the same timeperiod.
Thanks in advance!

-chris

On Apr 28, 2010, at 7:32 PM, Jonathan Gray wrote:
Hey Chris,

That's a really significant slowdown.  I can't think of anything
obvious that would cause that in your setup.
Any chance of some regionserver and master logs from the timeit was
going slow? Is there any activity in the logs of theregionservers hosting
the regions of the table being written to?
JG
-----Original Message-----
From: Christopher Tarnas [mailto:c...@tarnas.org] On Behalf OfChris
Tarnas
Sent: Wednesday, April 28, 2010 6:27 PM
To: hbase-user@hadoop.apache.org
Subject: EC2 + Thrift inserts

Hello all,
First, thanks to all the HBase developers for producing this,it's a
great project and I'm glad to be able to use it.
I'm looking for some help and hints here with insertperformance help.I'm doing some benchmarking, testing how I can scale up usingHBase,not really looking at raw speed. The testing is happening onEC2,
using
Andrew's scripts (thanks - those were very helpful) to setthem up andwith a slightly customized version of the default AMIs (addedmyapplication modules). I'm using HBase 20.3 and Hadoop 20.1.I've
looked
at the tips in the Wiki and it looks like Andrew's scriptsare already
setup that way.
I'm inserting into HBase from a hadoop streaming job thatruns perl
and
uses the thrift gateway. I'm also using the Transactionaltables sothat alone could be the case, but from what I can tell Idon't thinkso. LZO compression is also enabled for the column families(much of
the data is highly compressible). My cluster has 7 nodes, 5
regionservers, 1 master and 1 zookeeper. The regionserversand masterare c1.xlarges. Each regionserver has the tasktrackers thatruns thehadoop streaming jobs, and regionserver also runs its ownthriftserver. Each mapper that does the load talks to thelocalhost's thrift
server.
The Row keys a fixed string + an incremental number then theorder ofthe bytes are reversed, so runA123 becomes 321Anur. I thoughof using
murmur hash but was worried about collisions.
As I add more insert jobs, each jobs throughput goes down.Way down. Iwent from about 200 row/sec/table per job with one job toabout 24rows/sec/table per job with 25 running jobs. The servers aremostlyidle. I'm loading into two tables, one has several indexesand I'mloading into three column families, the other has no indexesand one
column family. Both tables only currently have two region each.
The regionserver that serves the indexed table's regions isusing themost CPU but is 87% idle. The other servers are all at ~90%idle.
There
is no IO wait. the perl processes are barely ticking over.Java on the
most "loaded" server is using about 50-60% of one CPU.
Normally when I do load in a pseudo-distrbuted hbase (mydevelopmentplatform) perl's speed is the limiting factor and uses about85% of aCPU. In this cluster they are using only 5-10% of a CPU asthey are
all
waiting on thrift (hbase). When I run only 1 process on thecluster,
perl uses much more of a CPU, maybe 70%.
Any tips or help in getting the speed/scalability up would begreat.
Please let me know if you need any other info.
As I send this - it looks like the main table has split againand isbeing served by three regionservers.. My performance is goingup a bit(now 35 rows/sec/table per processes), but still seems likeI'm notusing the full potential of even the limited EC2 system, noIO wait
and
lots of idle CPU.


many thanks
-chris
--
Todd Lipcon
Software Engineer, Cloudera

Re: EC2 + Thrift inserts

Reply via email to