Re: realtime hadoop

2008-06-25 Thread Daniel
2008/6/24 Konstantin Shvachko [EMAIL PROTECTED]: Also HDFS might be critical since to access your data you need to close the file Not anymore. Since 0.16 files are readable while being written to. Does this mean i can open some file as map input and the reduce output ? So i can update the

Re: realtime hadoop

2008-06-24 Thread Ian Holsman (Lists)
Matt Kent wrote: We use Hadoop in a similar manner, to process batches of data in real-time every few minutes. However, we do substantial amounts of processing on that data, so we use Hadoop to distribute our computation. Unless you have a significant amount of work to be done, I wouldn't

Re: realtime hadoop

2008-06-24 Thread Ian Holsman (Lists)
Fernando Padilla wrote: One use case I have a question about, is using Hadoop to power a web search or other query. So the full job should be done in under a second, from start to finish. I don't think you should be using hadoop to answer the results of a user's search query. you should be

Re: realtime hadoop

2008-06-24 Thread Vadim Zaliva
Matt, How do you manage your tasks? Do you lauch them periodically or keep them somehow running and feed them data? Vadim On Mon, Jun 23, 2008 at 21:54, Matt Kent [EMAIL PROTECTED] wrote: We use Hadoop in a similar manner, to process batches of data in real-time every few minutes. However,

Re: realtime hadoop

2008-06-24 Thread Chris K Wensel
On Jun 23, 2008, at 9:54 PM, Matt Kent wrote: Unless you have a significant amount of work to be done, I wouldn't recommend using Hadoop because it's not worth the overhead of launching the jobs and moving the data around. I think part of the tradeoff is having a system that is resilient

Re: realtime hadoop

2008-06-24 Thread Matt Kent
We wrote some custom tools that poll for new data and launch jobs periodically. Matt On Tue, 2008-06-24 at 09:27 -0700, Vadim Zaliva wrote: Matt, How do you manage your tasks? Do you lauch them periodically or keep them somehow running and feed them data? Vadim On Mon, Jun 23, 2008

realtime hadoop

2008-06-23 Thread Vadim Zaliva
Hi! I am considering using Hadoop for (almost) realime data processing. I have data coming every second and I would like to use hadoop cluster to process it as fast as possible. I need to be able to maintain some guaranteed max. processing time, for example under 3 minutes. Does anybody have

Re: realtime hadoop

2008-06-23 Thread Chris Anderson
Vadim, Depending on the nature of your data, CouchDB (http://couchdb.org) might be worth looking into. It speaks JSON natively, and has real-time map/reduce support. The 0.8.0 release is imminent (don't bother with 0.7.2), and the community is active. We're using it for something similar to what

Re: realtime hadoop

2008-06-23 Thread Konstantin Shvachko
Also HDFS might be critical since to access your data you need to close the file Not anymore. Since 0.16 files are readable while being written to. it as fast as possible. I need to be able to maintain some guaranteed max. processing time, for example under 3 minutes. It looks like you do

Re: realtime hadoop

2008-06-23 Thread Ian Holsman (Lists)
Interesting. we are planning on using hadoop to provide 'near' real time log analysis. we plan on having files close every 5 minutes (1 per log machine, so 80 files every 5 minutes) and then have a m/r to merge it into a single file that will get processed by other jobs later on. do you

Re: realtime hadoop

2008-06-23 Thread Matt Kent
We use Hadoop in a similar manner, to process batches of data in real-time every few minutes. However, we do substantial amounts of processing on that data, so we use Hadoop to distribute our computation. Unless you have a significant amount of work to be done, I wouldn't recommend using Hadoop

Re: realtime hadoop

2008-06-23 Thread Fernando Padilla
One use case I have a question about, is using Hadoop to power a web search or other query. So the full job should be done in under a second, from start to finish. You know, you have a huge datastore, and you have to run a query against that, implemented as a MR query. Is there a way to