Phil, what you are describing is close to what Nutch is already doing. You can look at it - all this coding is non-trivial, and you can save yourself a lot of work and debugging.
Mark On Sun, Mar 7, 2010 at 8:30 AM, Zak Stone <zst...@gmail.com> wrote: > Hi Phil, > > If you treat each HTTP request as a Hadoop task and the individual > HTTP responses are small, you may find that the latency of the web > service leaves most of your Hadoop processes idle most of the time. > > To avoid this problem, you can let each mapper make many HTTP requests > in parallel, either using asynchronous programming or using threads. > For example, each mapper could load batches of URLs from Hadoop into > an internal work queue, and 100 threads per mapper could pull URLs off > the work queue and push the HTTP responses onto another in-memory > output queue. A separate thread could then steadily take items from > the output queue and stream them back to Hadoop as key-value pairs. > > Hope that helps, > Zak > > > On Sun, Mar 7, 2010 at 7:54 AM, Phil McCarthy <philmccar...@gmail.com> > wrote: > > Hi, > > > > I'm new to Hadoop, and I'm trying to figure out the best way to use it > > to parallelize a large number of calls to a web API, and then process > > and store the results. > > > > The calls will be regular HTTP requests, and the URLs follow a known > > format, so can be generated easily. I'd like to understand how to > > apply the MapReduce pattern to this task – should I have one mapper > > generating URLs, and another making the HTTP calls and mapping request > > URLs to their response documents, for example? > > > > Any links to sample code, examples etc. would be great. > > > > Cheers, > > Phil > > >