Thanks a lot St.Ack for the time you spend to answer user questions and
for developing this nice piece of software (hbase)
stack wrote:
The amount of replication should have no effect on either access mode.
Whether scanning or random-accessing, only one of the N replicas is
accessed. We'll only go to the other versions if there is trouble accessing
the first.
So, more replicas will not change the performance profile.
I am not sure if hbase or hadoop is responsible for choosing the
location of the replica. Having more replica may not avoid the disk
access random read limitations but it should probably avoid network latency?
If I have and web application with N clients accessing hbase, if one of
those clients has to get the value for a key it should be faster to
access it if the value for that key is stored on that node? (as we avoid
a network call). But you are right it does not seem I can get around the
disk random read performance limitations.
What do you need to improve? Are both scans and random-reads slow for you?
You've seen the performance page up on the wiki (I'm sure you have).
Unfortunately I am not in a position to really benchmark my application
as I currently can't run it on a true cluster (using a cluster of
virtual machines would lead to obviously wrong results ;). At this stage
I am just trying to understand how hbase/hadoop works to avoid big
mistakes in the design of the architecture. My application currently
runs in production on a postgresql database: I replicate it over several
nodes and read access performs better when I have more replicas because
each node connects to a local database.
Thanks
TuX