Re: Parallelizing HTTP calls with Hadoop

Mark Kerzner Sun, 07 Mar 2010 06:35:30 -0800

Phil,

what you are describing is close to what Nutch is already doing. You can
look at it - all this coding is non-trivial, and you can save yourself a lot
of work and debugging.


Mark

On Sun, Mar 7, 2010 at 8:30 AM, Zak Stone <zst...@gmail.com> wrote:

> Hi Phil,
>
> If you treat each HTTP request as a Hadoop task and the individual
> HTTP responses are small, you may find that the latency of the web
> service leaves most of your Hadoop processes idle most of the time.
>
> To avoid this problem, you can let each mapper make many HTTP requests
> in parallel, either using asynchronous programming or using threads.
> For example, each mapper could load batches of URLs from Hadoop into
> an internal work queue, and 100 threads per mapper could pull URLs off
> the work queue and push the HTTP responses onto another in-memory
> output queue. A separate thread could then steadily take items from
> the output queue and stream them back to Hadoop as key-value pairs.
>
> Hope that helps,
> Zak
>
>
> On Sun, Mar 7, 2010 at 7:54 AM, Phil McCarthy <philmccar...@gmail.com>
> wrote:
> > Hi,
> >
> > I'm new to Hadoop, and I'm trying to figure out the best way to use it
> > to parallelize a large number of calls to a web API, and then process
> > and store the results.
> >
> > The calls will be regular HTTP requests, and the URLs follow a known
> > format, so can be generated easily. I'd like to understand how to
> > apply the MapReduce pattern to this task – should I have one mapper
> > generating URLs, and another making the HTTP calls and mapping request
> > URLs to their response documents, for example?
> >
> > Any links to sample code, examples etc. would be great.
> >
> > Cheers,
> > Phil
> >
>

Re: Parallelizing HTTP calls with Hadoop

Reply via email to