Re: realtime hadoop
Fernando Padilla wrote: One use case I have a question about, is using Hadoop to power a web search or other query. So the full job should be done in under a second, from start to finish. I don't think you should be using hadoop to answer the results of a user's search query. you should be looking at things like SOLR (with the distributed patch), or CloudDB/Mysql Clusters. some good research has also been done on this.. see CRUSH by Sage Weil :- www.ssrc.ucsc.edu/Papers/weil-sc06.pdf or the work on Chord# for wikipedia called 'onscale' :- http://onscale.de/onscaledb.html both would be better suited for OLTP type operations I would think. You know, you have a huge datastore, and you have to run a query against that, implemented as a MR query. Is there a way to optimize that use case, where the code doesn't change, but maybe the input parameters of the job? So a MR job could reuse the java code, and even the same JVM to avoid all of the startup costs.. I bet hadoop isn't built for that yet (and enough reasons not to support it yet).. but maybe it's a usecase that shouldn't be totally ignored. And if you think about it, this is similar to what HBase is doing, at least the query execution part.. A dedicated MR daemon running ontop of the Hadoop infrastructure, so you don't incur the cost of distributing and starting fresh MR/JVM processes across the cluster.. maybe someone would want to refactor this thought process a little bit.. Matt Kent wrote: We use Hadoop in a similar manner, to process batches of data in real-time every few minutes. However, we do substantial amounts of processing on that data, so we use Hadoop to distribute our computation. Unless you have a significant amount of work to be done, I wouldn't recommend using Hadoop because it's not worth the overhead of launching the jobs and moving the data around. Matt On Tue, 2008-06-24 at 13:34 +1000, Ian Holsman (Lists) wrote: Interesting. we are planning on using hadoop to provide 'near' real time log analysis. we plan on having files close every 5 minutes (1 per log machine, so 80 files every 5 minutes) and then have a m/r to merge it into a single file that will get processed by other jobs later on. do you think this will namespace will explode? I wasn't thinking of clouddb.. it might be an interesting alternative once it is a bit more stable. regards Ian Stefan Groschupf wrote: Hadoop might be the wrong technology for you. Map Reduce is a batch processing mechanism. Also HDFS might be critical since to access your data you need to close the file - means you might have many small file, a situation where hdfs is not very strong (namespace is hold in memory). Hbase might be an interesting tool for you, also zookeeper if you want to do something home grown... On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote: Hi! I am considering using Hadoop for (almost) realime data processing. I have data coming every second and I would like to use hadoop cluster to process it as fast as possible. I need to be able to maintain some guaranteed max. processing time, for example under 3 minutes. Does anybody have experience with using Hadoop in such manner? I will appreciate if you can share your experience or give me pointers to some articles or pages on the subject. Vadim ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: realtime hadoop
Matt Kent wrote: We use Hadoop in a similar manner, to process batches of data in real-time every few minutes. However, we do substantial amounts of processing on that data, so we use Hadoop to distribute our computation. Unless you have a significant amount of work to be done, I wouldn't recommend using Hadoop because it's not worth the overhead of launching the jobs and moving the data around. Thanks Matt. we are boiling the ocean with the data so to speak.. so thats cool. we are also looking at supplementing the m/r jobs with data coming in from spread to get the 'instant' analysis parts of our feedback systems. Regards Ian Matt On Tue, 2008-06-24 at 13:34 +1000, Ian Holsman (Lists) wrote: Interesting. we are planning on using hadoop to provide 'near' real time log analysis. we plan on having files close every 5 minutes (1 per log machine, so 80 files every 5 minutes) and then have a m/r to merge it into a single file that will get processed by other jobs later on. do you think this will namespace will explode? I wasn't thinking of clouddb.. it might be an interesting alternative once it is a bit more stable. regards Ian Stefan Groschupf wrote: Hadoop might be the wrong technology for you. Map Reduce is a batch processing mechanism. Also HDFS might be critical since to access your data you need to close the file - means you might have many small file, a situation where hdfs is not very strong (namespace is hold in memory). Hbase might be an interesting tool for you, also zookeeper if you want to do something home grown... On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote: Hi! I am considering using Hadoop for (almost) realime data processing. I have data coming every second and I would like to use hadoop cluster to process it as fast as possible. I need to be able to maintain some guaranteed max. processing time, for example under 3 minutes. Does anybody have experience with using Hadoop in such manner? I will appreciate if you can share your experience or give me pointers to some articles or pages on the subject. Vadim ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: realtime hadoop
Interesting. we are planning on using hadoop to provide 'near' real time log analysis. we plan on having files close every 5 minutes (1 per log machine, so 80 files every 5 minutes) and then have a m/r to merge it into a single file that will get processed by other jobs later on. do you think this will namespace will explode? I wasn't thinking of clouddb.. it might be an interesting alternative once it is a bit more stable. regards Ian Stefan Groschupf wrote: Hadoop might be the wrong technology for you. Map Reduce is a batch processing mechanism. Also HDFS might be critical since to access your data you need to close the file - means you might have many small file, a situation where hdfs is not very strong (namespace is hold in memory). Hbase might be an interesting tool for you, also zookeeper if you want to do something home grown... On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote: Hi! I am considering using Hadoop for (almost) realime data processing. I have data coming every second and I would like to use hadoop cluster to process it as fast as possible. I need to be able to maintain some guaranteed max. processing time, for example under 3 minutes. Does anybody have experience with using Hadoop in such manner? I will appreciate if you can share your experience or give me pointers to some articles or pages on the subject. Vadim ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
data locality in HDFS
hi. I want to run a distributed cluster, where i have say 20 machines/slaves in 3 seperate data centers that belong to the same cluster. Ideally I would like the other machines in the data center to be able to upload files (apache log files in this case) onto the local slaves and then have map/red tasks do their magic without having to move data until the reduce phase where the amount of data will be smaller. does Hadoop have this functionality? how do people handle multi-datacenter logging with hadoop in this case? do you just copy the data into a centeral location? regards Ian