I am a beginner. But this seems to be similar to what I intend. The data source will be external FTP or S3 storage.
"Spark Streaming can read data from HDFS <http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html> ,Flume <http://flume.apache.org/>, Kafka <http://kafka.apache.org/>, Twitter <https://dev.twitter.com/> and ZeroMQ <http://zeromq.org/>. You can also define your own custom data sources." Thanks, Mohan On Wed, Jul 9, 2014 at 2:09 PM, Stanley Shi <s...@gopivotal.com> wrote: > There's a DistCP utility for this kind of purpose; > Also there's "Spring XD" there, but I am not sure if you want to use it. > > Regards, > *Stanley Shi,* > > > > On Mon, Jul 7, 2014 at 10:02 PM, Mohan Radhakrishnan < > radhakrishnan.mo...@gmail.com> wrote: > >> Hi, >> We used a commercial FT and scheduler tool in clustered mode. >> This was a traditional active-active cluster that supported multiple >> protocols like FTPS etc. >> >> Now I am interested in evaluating a Distributed way of crawling FTP >> sites and downloading files using Hadoop. I thought since we have to >> process thousands of files Hadoop jobs can do it. >> >> Are Hadoop jobs used for this type of file transfers ? >> >> Moreover there is a requirement for a scheduler also. What is the >> recommendation of the forum ? >> >> >> Thanks, >> Mohan >> > >