Current setup is three machines, one of which doubles as the master,
using distributed hdfs. One million rows /1 col was just a test --
definitely need to scale way beyond that, at which point MySQL will
break down as a viable option. Besides appeal of MapReduce for
offline processing, multi-column is also definitely a requirement, and
an obvious next step for benchmarking.
Actually looking now at how to bulk load data properly, since it took
hours to load 1 million rows from client with lock/put/commit for
every row, whereas PerformanceEvaluation can do this in about 15
minutes in single client.
BTW, running PerformanceEvaluation randomRead with 5 clients (MR) I
get 1,687 reads/sec if I'm reading the results correctly:
08/02/08 01:14:40 INFO mapred.JobClient: Job complete:
job_200802042127_0001
08/02/08 01:14:40 INFO mapred.JobClient: Counters: 12
08/02/08 01:14:40 INFO mapred.JobClient: HBase Performance Evaluation
08/02/08 01:14:40 INFO mapred.JobClient: Elapsed time in
milliseconds=3107646
08/02/08 01:14:40 INFO mapred.JobClient: Row count=5242850
08/02/08 01:14:40 INFO mapred.JobClient: Job Counters
08/02/08 01:14:40 INFO mapred.JobClient: Launched map tasks=54
08/02/08 01:14:40 INFO mapred.JobClient: Launched reduce tasks=1
08/02/08 01:14:40 INFO mapred.JobClient: Data-local map tasks=51
08/02/08 01:14:40 INFO mapred.JobClient: Map-Reduce Framework
08/02/08 01:14:40 INFO mapred.JobClient: Map input records=50
08/02/08 01:14:40 INFO mapred.JobClient: Map output records=50
08/02/08 01:14:40 INFO mapred.JobClient: Map input bytes=3634
08/02/08 01:14:40 INFO mapred.JobClient: Map output bytes=700
08/02/08 01:14:40 INFO mapred.JobClient: Reduce input groups=50
08/02/08 01:14:40 INFO mapred.JobClient: Reduce input records=50
08/02/08 01:14:40 INFO mapred.JobClient: Reduce output records=50
Joost.
On 8-Feb-08, at 12:05 AM, stack wrote:
Tthe test described can only favor mysql (single column, just a
million rows). Do you need Hbase?
You might also tell us more about your hbase setup. Is it using
localfs or hdfs? Is it a distributed hdfs or all on single server?
Thanks,
St.Ack
Joost Ouwerkerk wrote:
I'm working on a web application with primarily read-oriented
performance requirements. I've been running some benchmarking
tests that include our application layer, to get a sense of what is
possible with Hbase. A variation on the Bigtable test that is
reproduced by org.apache.hadoop.hbase.PerformanceEvaluation, I'm
randomly reading 1 column from a table with 1 million rows. In our
case, the contents of that column need to be deserialized by our
application (which adds some overhead that I'm also trying to
measure), the deserialized contents represent a little over 1K of
data.
Although a single thread can only achieve 125 reads per second,
with 12 client threads (from 3 different machines) I'm able to read
as many as 500 objects per second. Now, I've replicated my test on
a basic MySQL table and am able to get a throughput of 2,300 reads/
sec; roughly 5 times what I'm seeing with Hbase. Besides the
obvious code maturity thing, is the discrepancy related to random
reads not actually being served from memcache, but rather from the
disk, by Hbase? The HBase performance page (http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation
) shows random reads(mem) as "Not implemented."
Can anyone shed some light on the state of HBase's memcaching?
Cheers,
Joost.