Mapreduce jobs to download job input from across the internet

David Parks Tue, 16 Apr 2013 21:27:36 -0700

For a set of jobs to run I need to download about 100GB of data from the
internet (~1000 files of varying sizes from ~10 different domains).


 

Currently I do this in a simple linux script as it's easy to script FTP,
curl, and the like. But it's a mess to maintain a separate server for that
process. I'd rather it run in mapreduce. Just give it a bill of materials
and let it go about downloading it, retrying as necessary to deal with iffy
network conditions.

 

I wrote one such job to craw images we need to acquire, and it was the
royalist of royal pains. I wonder if there are any good approaches to this
kind of data acquisition task in Hadoop. It would certainly be nicer just to
schedule a data-acquisition job ahead of the processing jobs in Oozie rather
than try to maintain synchronization between the download processes and the
jobs.

 

Ideas?

Mapreduce jobs to download job input from across the internet

Reply via email to