I think what you're saying is that you are mostly interested in data
locality. I don't think it's done yet, but it would be pretty easy to
make HBase provide start keys as well as region locations for splits
for a MapReduce job. In theory, that would give you all the pieces
you need to run locality-aware processing.
-Bryan
On Apr 24, 2008, at 10:16 AM, Leon Mergen wrote:
Hello,
I'm sorry if a question like this has been asked before, but I was
unable to
find an answer for this anywhere on google; if it is off-topic, I
apologize
in advance.
I'm trying to look a bit into the future, and predict scalability
problems
for the company I work for: we're using PostgreSQL, and processing
many
writes/second (access logs, currently around 250, but this will only
increase significantly in the future). Furthermore, we perform data
mining
on this data, and ideally, need to have this data stored in a
structured
form (the data is searched in various ways). In other words: a very
interesting problem.
Now, I'm trying to understand a bit of the hadoop/hbase
architecture: as I
understand it, HDFS, MapReduce and HBase are sufficiently decoupled
that the
use case I was hoping for is not available; however, I'm still
going to ask:
Is it possible to store this data in hbase, and thus have all
access logs
distributed amongst many different servers, and start MapReduce
jobs on
those actual servers, which process all the data on those servers ?
In other
words, the data never leaves the actual servers ?
If this isn't possible, is this because someone simply never took
the time
to implement such a thing, or is it hard to fit in the design (for
example,
that the JobTracker needs to be aware of the physical locations of
all the
data, since you don't want to analyze the same (replicated) data
twice) ?
From what I understand by playing with hadoop for the past few
days, the
idea is that you fetch your MapReduce data from HDFS rather than
BigTable,
or am I mistaken ?
Thanks for your time!
Regards,
Leon Mergen