Thank you all for your considered responses. I would like to make a few comments and clarifications (inline, below).
Jim Kellerman wrote: > > Andrew Purtell wrote: > > First: How invasive to the HBase master/region model is > > the concept of specifying constraints on data mobility? > > It would be very disruptive. The current model is that > you run one or more HBase clusters per HDFS cluster. An > HBase cluster does not span HDFS clusters. > As far as I know HDFS clusters do not span data centers. > Latency and network partitioning would be big problems > for a system that requires sub-second response times. I was not suggesting spanning HDFS, only HBase, spanned across several HDFS clusters. The configuration, for example, might look like: master in US region server and local HDFS cluster in US region server and local HDFS cluster in EU region server and local HDFS cluster in APAC etc. where each region server is backed by a local HDFS cluster on a gigabit backplane, and in each region globally distributed map-reduce jobs execute with data-driven regional differences. Yet, at the same time, jobs in any given region can query rows generated within another via globally distributed/available table(s). I have set up this configuration in the lab using 0.15.1 (compiled by hadoopqa from revision 596497), even with artificial latency introduced to simulate international links, and I can say that it works for me. It may only work by accident. Also, my testing thus far has been rather limited: e.g. create table on one cluster, then insert on another, then select from a third, etc. Fault tolerance considerations due to an elevated risk of network partition are of course an issue. Allowing modified region servers to continue serving explicitly partitioned tables in the extended absence of communication with the master might be a first-cut option, but I suspect you'd take a dim view of this: perhaps "pollution" of a clean model with hacking. Sub-second response times should not be a problem because in addition to constraints on data mobility we'd use query extensions to limit query scope to the region(s) where the data is known to reside for the bulk of map-reduce operations. > A change such as this would require major changes to the > architecture and our vision of the model going forward. > (replication between data centers and a single table > residing in multiple data centers being served by > separate HBase instances running on separate HDFS > clusters). And I thank you for this, and also for the -1 from Edward, as it is instructive as to how divergent our ideas for using HBase might be from the community, at least with respect to what amounts to cluster federation. Anyway, at this time, we are only considering these things. Best regards, Andrew Purtell Advanced Threats Research Trend Micro, Inc, Pasadena, CA USA (personal mail) ____________________________________________________________________________________ Looking for last minute shopping deals? Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping