You can find it here: http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/
2013/4/17 Peyman Mohajerian <mohaj...@gmail.com> > Apache Flume may help you for this use case. I read an article on > Cloudera's site about using Flume to pull tweets and same idea may apply > here. > > > On Tue, Apr 16, 2013 at 9:26 PM, David Parks <davidpark...@yahoo.com>wrote: > >> For a set of jobs to run I need to download about 100GB of data from the >> internet (~1000 files of varying sizes from ~10 different domains).**** >> >> ** ** >> >> Currently I do this in a simple linux script as it’s easy to script FTP, >> curl, and the like. But it’s a mess to maintain a separate server for that >> process. I’d rather it run in mapreduce. Just give it a bill of materials >> and let it go about downloading it, retrying as necessary to deal with iffy >> network conditions.**** >> >> ** ** >> >> I wrote one such job to craw images we need to acquire, and it was the >> royalist of royal pains. I wonder if there are any good approaches to this >> kind of data acquisition task in Hadoop. It would certainly be nicer just >> to schedule a data-acquisition job ahead of the processing jobs in Oozie >> rather than try to maintain synchronization between the download processes >> and the jobs.**** >> >> ** ** >> >> Ideas?**** >> >> ** ** >> > > -- Marcos Ortiz Valmaseda, *Data-Driven Product Manager* at PDVSA *Blog*: http://dataddict.wordpress.com/ *LinkedIn: *http://www.linkedin.com/in/marcosluis2186 *Twitter*: @marcosluis2186 <http://twitter.com/marcosluis2186>