Re: realtime hadoop

Fernando Padilla Mon, 23 Jun 2008 22:26:00 -0700

One use case I have a question about, is using Hadoop to power a websearch or other query. So the full job should be done in under asecond, from start to finish.

You know, you have a huge datastore, and you have to run a query againstthat, implemented as a MR query. Is there a way to optimize that usecase, where the code doesn't change, but maybe the input parameters ofthe job? So a MR job could reuse the java code, and even the same JVMto avoid all of the startup costs..


<digression>

I bet hadoop isn't built for that yet (and enough reasons not to supportit yet).. but maybe it's a usecase that shouldn't be totally ignored.

And if you think about it, this is similar to what HBase is doing, atleast the query execution part.. A dedicated MR daemon running ontop ofthe Hadoop infrastructure, so you don't incur the cost of distributingand starting fresh MR/JVM processes across the cluster.. maybe someonewould want to refactor this thought process a little bit..

</digression>


Matt Kent wrote:

We use Hadoop in a similar manner, to process batches of data in
real-time every few minutes. However, we do substantial amounts of
processing on that data, so we use Hadoop to distribute our computation.
Unless you have a significant amount of work to be done, I wouldn't
recommend using Hadoop because it's not worth the overhead of launching
the jobs and moving the data around.

Matt

On Tue, 2008-06-24 at 13:34 +1000, Ian Holsman (Lists) wrote:

Interesting.

we are planning on using hadoop to provide 'near' real time loganalysis. we plan on having files close every 5 minutes (1 per logmachine, so 80 files every 5 minutes) and then have a m/r to merge itinto a single file that will get processed by other jobs later on.


do you think this will namespace will explode?

I wasn't thinking of clouddb.. it might be an interesting alternativeonce it is a bit more stable.


regards
Ian

Stefan Groschupf wrote:

Hadoop might be the wrong technology for you.
Map Reduce is a batch processing mechanism. Also HDFS might be critical
since to access your data you need to close the file - means you might
have many small file, a situation where hdfs is not very strong
(namespace is hold in memory).
Hbase might be an interesting tool for you, also zookeeper if you want
to do something home grown...



On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote:

Hi!

I am considering using Hadoop for (almost) realime data processing. I
have data coming every second and I would like to use hadoop cluster
to process
it as fast as possible. I need to be able to maintain some guaranteed
max. processing time, for example under 3 minutes.

Does anybody have experience with using Hadoop in such manner? I will
appreciate if you can share your experience or give me pointers
to some articles or pages on the subject.

Vadim

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com

Re: realtime hadoop

Reply via email to