For a set of jobs to run I need to download about 100GB of data from the internet (~1000 files of varying sizes from ~10 different domains).
Currently I do this in a simple linux script as it's easy to script FTP, curl, and the like. But it's a mess to maintain a separate server for that process. I'd rather it run in mapreduce. Just give it a bill of materials and let it go about downloading it, retrying as necessary to deal with iffy network conditions. I wrote one such job to craw images we need to acquire, and it was the royalist of royal pains. I wonder if there are any good approaches to this kind of data acquisition task in Hadoop. It would certainly be nicer just to schedule a data-acquisition job ahead of the processing jobs in Oozie rather than try to maintain synchronization between the download processes and the jobs. Ideas?