Thanks Marcelo. But our problem is little complicated.
We have 10+ ftp sites that we will be transferring data from. The ftp server
info, filename, credentials are all coming via Kafka message. So, I want to
read those kafka message and dynamically connect to the ftp site and download
those fat files and store it in HDFS.
And hence, I was planning to use Spark Streaming with Kafka or Flume with
Kafka. But flume runs on a JVM and may not be the best option as the huge file
will create memory issues. Please suggest someway to run it inside the cluster.
From: Marcelo Vanzin <[email protected]>
To: "Varadhan, Jawahar" <[email protected]>
Cc: "[email protected]" <[email protected]>
Sent: Friday, August 14, 2015 3:23 PM
Subject: Re: Setting up Spark/flume/? to Ingest 10TB from FTP
Why do you need to use Spark or Flume for this?
You can just use curl and hdfs:
curl ftp://blah | hdfs dfs -put - /blah
On Fri, Aug 14, 2015 at 1:15 PM, Varadhan, Jawahar <[email protected]>
wrote:
What is the best way to bring such a huge file from a FTP server into Hadoop to
persist in HDFS? Since a single jvm process might run out of memory, I was
wondering if I can use Spark or Flume to do this. Any help on this matter is
appreciated.
I prefer a application/process running inside Hadoop which is doing this
transfer
Thanks.
--
Marcelo