Re: HBase Random Read Performance

stack Fri, 08 Feb 2008 11:08:25 -0800

Looking at the counters below, yeah, it looks like 1700 reads/sec. ThePE job prints on STDOUT numbers too. You could check these jibe w/ theMR stats reporting.


Thanks for doing these interesting comparisons,
St.Ack

P.S. I was going to write more on why MySQL will always trounce us whilewe're slinging data in formats and sizes that are within the sizesmanageable by a single MySQL instance -- that MySQL is able to directlyexploit OS features such as memory-mapping, caching, etc., features thatmay be frustrated by hbase access patterns atop our HDFS indirection --but I gave up on it since I was only stating the obvious (You've seenthis page, http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation,and how the graphs now include measuring pure MapFile accesses?).



Joost Ouwerkerk wrote:

Current setup is three machines, one of which doubles as the master,using distributed hdfs. One million rows /1 col was just a test --definitely need to scale way beyond that, at which point MySQL willbreak down as a viable option. Besides appeal of MapReduce foroffline processing, multi-column is also definitely a requirement, andan obvious next step for benchmarking.
Actually looking now at how to bulk load data properly, since it tookhours to load 1 million rows from client with lock/put/commit forevery row, whereas PerformanceEvaluation can do this in about 15minutes in single client.
BTW, running PerformanceEvaluation randomRead with 5 clients (MR) Iget 1,687 reads/sec if I'm reading the results correctly:
08/02/08 01:14:40 INFO mapred.JobClient: Job complete:job_200802042127_0001
08/02/08 01:14:40 INFO mapred.JobClient: Counters: 12
08/02/08 01:14:40 INFO mapred.JobClient:   HBase Performance Evaluation
08/02/08 01:14:40 INFO mapred.JobClient: Elapsed time inmilliseconds=3107646
08/02/08 01:14:40 INFO mapred.JobClient:     Row count=5242850
08/02/08 01:14:40 INFO mapred.JobClient:   Job Counters
08/02/08 01:14:40 INFO mapred.JobClient:     Launched map tasks=54
08/02/08 01:14:40 INFO mapred.JobClient:     Launched reduce tasks=1
08/02/08 01:14:40 INFO mapred.JobClient:     Data-local map tasks=51
08/02/08 01:14:40 INFO mapred.JobClient:   Map-Reduce Framework
08/02/08 01:14:40 INFO mapred.JobClient:     Map input records=50
08/02/08 01:14:40 INFO mapred.JobClient:     Map output records=50
08/02/08 01:14:40 INFO mapred.JobClient:     Map input bytes=3634
08/02/08 01:14:40 INFO mapred.JobClient:     Map output bytes=700
08/02/08 01:14:40 INFO mapred.JobClient:     Reduce input groups=50
08/02/08 01:14:40 INFO mapred.JobClient:     Reduce input records=50
08/02/08 01:14:40 INFO mapred.JobClient:     Reduce output records=50

Joost.

On 8-Feb-08, at 12:05 AM, stack wrote:
Tthe test described can only favor mysql (single column, just amillion rows). Do you need Hbase?You might also tell us more about your hbase setup. Is it usinglocalfs or hdfs? Is it a distributed hdfs or all on single server?
Thanks,
St.Ack





Joost Ouwerkerk wrote:
I'm working on a web application with primarily read-orientedperformance requirements. I've been running some benchmarking teststhat include our application layer, to get a sense of what ispossible with Hbase. A variation on the Bigtable test that isreproduced by org.apache.hadoop.hbase.PerformanceEvaluation, I'mrandomly reading 1 column from a table with 1 million rows. In ourcase, the contents of that column need to be deserialized by ourapplication (which adds some overhead that I'm also trying tomeasure), the deserialized contents represent a little over 1K of data.Although a single thread can only achieve 125 reads per second, with12 client threads (from 3 different machines) I'm able to read asmany as 500 objects per second. Now, I've replicated my test on abasic MySQL table and am able to get a throughput of 2,300reads/sec; roughly 5 times what I'm seeing with Hbase. Besides theobvious code maturity thing, is the discrepancy related to randomreads not actually being served from memcache, but rather from thedisk, by Hbase? The HBase performance page(http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation) showsrandom reads(mem) as "Not implemented."
Can anyone shed some light on the state of HBase's memcaching?
Cheers,
Joost.

Re: HBase Random Read Performance

Reply via email to