with the right ftp client JAR on your classpath (I forget which), you can use ftp:// a a source for a hadoop FS operation. you may even be able to use it as an input for some spark (non streaming job directly.
On 14 Aug 2015, at 14:11, Varadhan, Jawahar <[email protected]<mailto:[email protected]>> wrote: Thanks Marcelo. But our problem is little complicated. We have 10+ ftp sites that we will be transferring data from. The ftp server info, filename, credentials are all coming via Kafka message. So, I want to read those kafka message and dynamically connect to the ftp site and download those fat files and store it in HDFS. And hence, I was planning to use Spark Streaming with Kafka or Flume with Kafka. But flume runs on a JVM and may not be the best option as the huge file will create memory issues. Please suggest someway to run it inside the cluster. ________________________________ From: Marcelo Vanzin <[email protected]<mailto:[email protected]>> To: "Varadhan, Jawahar" <[email protected]<mailto:[email protected]>> Cc: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Sent: Friday, August 14, 2015 3:23 PM Subject: Re: Setting up Spark/flume/? to Ingest 10TB from FTP Why do you need to use Spark or Flume for this? You can just use curl and hdfs: curl ftp://blah<ftp://blah/> | hdfs dfs -put - /blah On Fri, Aug 14, 2015 at 1:15 PM, Varadhan, Jawahar <[email protected]<mailto:[email protected]>> wrote: What is the best way to bring such a huge file from a FTP server into Hadoop to persist in HDFS? Since a single jvm process might run out of memory, I was wondering if I can use Spark or Flume to do this. Any help on this matter is appreciated. I prefer a application/process running inside Hadoop which is doing this transfer Thanks. -- Marcelo
