Thanks a lot St.Ack for the time you spend to answer user questions and for developing this nice piece of software (hbase)

stack wrote:
The amount of replication should have no effect on either access mode.
 Whether scanning or random-accessing, only one of the N replicas is
accessed.  We'll only go to the other versions if there is trouble accessing
the first.
So, more replicas will not change the performance profile.
I am not sure if hbase or hadoop is responsible for choosing the location of the replica. Having more replica may not avoid the disk access random read limitations but it should probably avoid network latency? If I have and web application with N clients accessing hbase, if one of those clients has to get the value for a key it should be faster to access it if the value for that key is stored on that node? (as we avoid a network call). But you are right it does not seem I can get around the disk random read performance limitations.
What do you need to improve?  Are both scans and random-reads slow for you?
  You've seen the performance page up on the wiki (I'm sure you have).
Unfortunately I am not in a position to really benchmark my application as I currently can't run it on a true cluster (using a cluster of virtual machines would lead to obviously wrong results ;). At this stage I am just trying to understand how hbase/hadoop works to avoid big mistakes in the design of the architecture. My application currently runs in production on a postgresql database: I replicate it over several nodes and read access performs better when I have more replicas because each node connects to a local database.

Thanks
TuX

Reply via email to