I think you should actually use the Java-based MapReduce here. As has been noted, these will be network-bound calls. And if you're trying to make a lot of them, my experience is that individual calls are slow. 10,000 GET requests could each take a second or two, especially if they involve DNS lookups. But they can be overlapped.
If you're using the old API, consider using the Multithreaded maprunner for this (I think that's org.apache.hadoop.mapred.lib.MultithreadedMapRunner): JobConf job = new JobConf(); job.setMapRunnerClass(MultithreadedMapRunner.class); If you're using the new API, there's an analagous o.a.h.mapreduce.lib.mapper.MultithreadedMapper that you should extend. This will allow you to pipeline all those requests and get much faster throughput. (Each map task starts a thread pool of a few threads, which will be given individual map inputs in an overlapped fashion. The same instance of your Mapper class will be used across all threads, so make sure to protect any instance variables.) For maximum efficiency, sort all your different URLs by hostname first, so that each split of the input contains all the requests to the same server -- this will allow your DNS caching to be much more efficient (rather than have all your mappers try to DNS lookup the same set of hosts). Of course, you want to be careful with something like this. A big Hadoop cluster can easily bring a web server to its knees if you're using too many map tasks in parallel on the same target :) You may want to actually do some rate-limiting of requests to the same node... but how to do that easily is a separate discussion. - Aaron On Sun, Mar 7, 2010 at 9:46 AM, Erez Katz <erez_k...@yahoo.com> wrote: > It should be very easy. If you just have say a list of URLS as input... > It is not even map-reduce task... just map task (with no reduce, i don't > see where you do a reduce on a key in this scenario). > Look for map only tasks in the streaming documentation. > > Just pick your favorite scripting language that keeps reading urls form the > standard input stream line by line and outputs the result to the standard > output. > > ala python: > > import urllib,sys > > for line in sys.stdin: > url = line.strip() > x = urllib.urlopen(url) > print x.read() > u.close() > > > That's all folks. > > > No real reason to use Java/C++ here, most of the time will be spend over > network IO. > > > Cheers, > > Erez Katz > > > --- On Sat, 3/6/10, Phil McCarthy <philmccar...@gmail.com> wrote: > > > From: Phil McCarthy <philmccar...@gmail.com> > > Subject: Parallelizing HTTP calls with MapReduce > > To: mapreduce-user@hadoop.apache.org > > Date: Saturday, March 6, 2010, 9:29 AM > > Hi, > > > > I'm new to Hadoop, and I'm trying to figure out the best > > way to use it > > with EC2 to make large number of calls to a web API, and > > then process > > and store the results. I'm completely new to Hadoop, so I'm > > wondering > > what's the best high-level approach, in terms of using > > MapReduce to > > parallelize the process. The calls will be regular HTTP > > requests, and > > the URLs follow a known format, so can be generated > > easily. > > > > This seems like it'd be a pretty common type of task, so > > apologies if > > I've missed something obvious in the docs etc. > > > > Cheers, > > Phil McCarthy > > > > > >