EC2 + Thrift inserts

Chris Tarnas Wed, 28 Apr 2010 18:27:43 -0700

Hello all, 

First, thanks to all the HBase developers for producing this, it's a great 
project and I'm glad to be able to use it.


I'm looking for some help and hints here with insert performance help. I'm 
doing some benchmarking, testing how I can scale up using HBase, not really 
looking at raw speed. The testing is happening on EC2, using Andrew's scripts 
(thanks - those were very helpful) to set them up and with a slightly 
customized version of the default AMIs (added my application modules). I'm 
using HBase 20.3 and Hadoop 20.1. I've looked at the tips in the Wiki and it 
looks like Andrew's scripts are already setup that way.

I'm inserting into HBase from a hadoop streaming job that runs perl and uses 
the thrift gateway. I'm also using the Transactional tables so that alone could 
be the case, but from what I can tell I don't think so. LZO compression is also 
enabled for the column families (much of the data is highly compressible). My 
cluster has 7 nodes, 5 regionservers, 1 master and 1 zookeeper. The 
regionservers and master are c1.xlarges. Each regionserver has the tasktrackers 
that runs the hadoop streaming jobs, and regionserver also runs its own thrift 
server. Each mapper that does the load talks to the localhost's thrift server.

The Row keys a fixed string + an incremental number then the order of the bytes 
are reversed, so runA123 becomes 321Anur. I though of using murmur hash but was 
worried about collisions. 

As I add more insert jobs, each jobs throughput goes down. Way down. I went 
from about 200 row/sec/table per job with one job to about 24 rows/sec/table 
per job with 25 running jobs. The servers are mostly idle. I'm loading into two 
tables, one has several indexes and I'm loading into three column families, the 
other has no indexes and one column family. Both tables only currently have two 
region each. 

The regionserver that serves the indexed table's regions is using the most CPU 
but is 87% idle. The other servers are all at ~90% idle. There is no IO wait. 
the perl processes are barely ticking over. Java on the most "loaded" server is 
using about 50-60% of one CPU. 

Normally when I do load in a pseudo-distrbuted hbase (my development platform) 
perl's speed is the limiting factor and uses about 85% of a CPU. In this 
cluster they are using only 5-10% of a CPU as they are all waiting on thrift 
(hbase). When I run only 1 process on the cluster, perl uses much more of a 
CPU, maybe 70%.

Any tips or help in getting the speed/scalability up would be great. Please let 
me know if you need any other info.

As I send this - it looks like the main table has split again and is being 
served by three regionservers.. My performance is going up a bit (now 35 
rows/sec/table per processes), but still seems like I'm not using the full 
potential of even the limited EC2 system, no IO wait and lots of idle CPU.


many thanks
-chris

EC2 + Thrift inserts

Reply via email to