I think what you're saying is that you are mostly interested in data locality. I don't think it's done yet, but it would be pretty easy to make HBase provide start keys as well as region locations for splits for a MapReduce job. In theory, that would give you all the pieces you need to run locality-aware processing.

-Bryan

On Apr 24, 2008, at 10:16 AM, Leon Mergen wrote:

Hello,

I'm sorry if a question like this has been asked before, but I was unable to find an answer for this anywhere on google; if it is off-topic, I apologize
in advance.

I'm trying to look a bit into the future, and predict scalability problems for the company I work for: we're using PostgreSQL, and processing many
writes/second (access logs, currently around 250, but this will only
increase significantly in the future). Furthermore, we perform data mining on this data, and ideally, need to have this data stored in a structured
form (the data is searched in various ways). In other words: a very
interesting problem.

Now, I'm trying to understand a bit of the hadoop/hbase architecture: as I understand it, HDFS, MapReduce and HBase are sufficiently decoupled that the use case I was hoping for is not available; however, I'm still going to ask:


Is it possible to store this data in hbase, and thus have all access logs distributed amongst many different servers, and start MapReduce jobs on those actual servers, which process all the data on those servers ? In other
words, the data never leaves the actual servers ?

If this isn't possible, is this because someone simply never took the time to implement such a thing, or is it hard to fit in the design (for example, that the JobTracker needs to be aware of the physical locations of all the data, since you don't want to analyze the same (replicated) data twice) ?

From what I understand by playing with hadoop for the past few days, the idea is that you fetch your MapReduce data from HDFS rather than BigTable,
or am I mistaken ?

Thanks for your time!

Regards,

Leon Mergen

Reply via email to