SIGTERM 15 Issue : Spark Streaming for ingesting huge text files using custom Receiver

2015-09-11 Thread Varadhan, Jawahar
Hi all,   I have a coded a custom receiver which receives kafka messages. These 
Kafka messages have FTP server credentials in them. The receiver then opens the 
message and uses the ftp credentials in it  to connect to the ftp server. It 
then streams this huge text file (3.3G) . Finally this stream it read line by 
line using buffered reader and pushed to the spark streaming via the receiver's 
"store" method. Spark streaming process receives all these lines and stores it 
in hdfs.
With this process I could ingest small files (50 mb) but cant ingest this 3.3gb 
file.  I get a YARN exception of SIGTERM 15 in spark streaming process. Also, I 
tried going to that 3.3GB file directly (without custom receiver) in spark 
streaming using ssc.textFileStream  and everything works fine and that file 
ends in HDFS
Please let me know what I might have to do to get this working with receiver. I 
know there are better ways to ingest the file but we need to use Spark 
streaming in our case.
Thanks.

Spark (1.2.0) submit fails with exception saying log directory already exists

2015-08-25 Thread Varadhan, Jawahar
Here is the error
yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User 
class threw exception: Log directory 
hdfs://Sandbox/user/spark/applicationHistory/application_1438113296105_0302 
already exists!)
I am using cloudera 5.3.2 with Spark 1.2.0
Any help is appreciated.
ThanksJay



Re: Setting up Spark/flume/? to Ingest 10TB from FTP

2015-08-14 Thread Varadhan, Jawahar
Thanks Marcelo. But our problem is little complicated.

We have 10+ ftp sites that we will be transferring data from. The ftp server 
info, filename, credentials are all coming via Kafka message. So, I want to 
read those kafka message and dynamically connect to the ftp site and download 
those fat files and store it in HDFS.
And hence, I was planning to use Spark Streaming with Kafka or Flume with 
Kafka. But flume runs on a JVM and may not be the best option as the huge file 
will create memory issues. Please suggest someway to run it inside the cluster.

 

 From: Marcelo Vanzin van...@cloudera.com
 To: Varadhan, Jawahar varad...@yahoo.com 
Cc: d...@spark.apache.org d...@spark.apache.org 
 Sent: Friday, August 14, 2015 3:23 PM
 Subject: Re: Setting up Spark/flume/? to Ingest 10TB from FTP
   
Why do you need to use Spark or Flume for this?
You can just use curl and hdfs:
  curl ftp://blah | hdfs dfs -put - /blah



On Fri, Aug 14, 2015 at 1:15 PM, Varadhan, Jawahar varad...@yahoo.com.invalid 
wrote:

What is the best way to bring such a huge file from a FTP server into Hadoop to 
persist in HDFS? Since a single jvm process might run out of memory, I was 
wondering if I can use Spark or Flume to do this. Any help on this matter is 
appreciated. 
I prefer a application/process running inside Hadoop which is doing this 
transfer
Thanks.



-- 
Marcelo