Hi,

I'm working at Mendeley.com and we are just looking into the ways in which
to store a large collection of document metadata that we need to process and
update our live site with. HBase is one of the systems we are looking at as
it integrates with Hadoop which we will definitely be using in the near
future.

One question I have for this is if we start using the standalone operation
on a single server initially whilst we setup and test ours systems, is
possible to migrate from this to the distributed system without having to
rebuild the data store?
A second question is more tying to understand the way in which to use HBase.
If we have documents that have many authors, which themselves have a
varying amount of metadata, how is a good approach to store this? From
reading about HBase I see it could be done using a column family on the
document for say author_name:, author_email: but if there are an unknown
number of author properties this probably isn't the best way.. Would using
a separate table be better to store the author data in?

My last question is using Map/Reduce on top of HBase, is the Map/Reduce code
still location aware for where the data is stored in HDFS or does using
Map/Reduce create a larger I/O bottleneck than using HDFS normally?

If we choose to use HBase I hope to start being more active in the community
here soon!

Thanks,
Dan Harvey

Reply via email to