Ryan, it makes sense, so probably the order randomization or in other words load balancing has to be handled on a block/region level rather then on a single row level to reduce the number of RPC calls.
so then the question is: does Hbase load balancing work as described in the original BigTable paper "Each tablet is assigned to one tablet server at a time" (i guess in some kind of round robin partitioning) or it preserves the data locality by storing the multiple regions data (a collection of HFiles if I understand correctly) as a contiguous sequence of data blocks on hdfs datanodes. i'm looking at the documentation but i don't see it specifically addressed there Thanks Alex On Sat, Apr 24, 2010 at 5:37 PM, Ryan Rawson <ryano...@gmail.com> wrote: > While that sounds right, the issue is the overhead of multiple rpc calls. > If your data was spread out so would be your rpc pattern. > > The advantage of hbase is you can have multiple concurrent scans on > different servers and they wont share resources. Thus you can scale. > > Underlying hbase is hdfs which allows us to use more disk spindles on both > local and remote machines. This also allows a single machine to scale well > especially when you use 4 or more disks. > > On Apr 24, 2010 2:28 PM, "alex kamil" <alex.ka...@gmail.com> wrote: > > Ryan, > > wouldn't be storing time series data in chronological order sub-optimal for > sequential scans and range queries > lets say there is a large chunk of data (e.g 10M rows) representing 1hr of > recordings stored in multiple regions on a single node/regionserver > then if we run a range query for that time period we will not utilize the > entire cluster and will be largely IO bound and limited by a single node > read throughput. > i'm thinking of randomizing the input sequence order during insertion to > improve access time > > thanks > Alex > > > > On Sat, Apr 24, 2010 at 4:45 PM, Ryan Rawson <ryano...@gmail.com> wrote: > > > > Hey, > > > > So in my cas... > >