SIGTERM 15 Issue : Spark Streaming for ingesting huge text files using custom Receiver
Hi all, I have a coded a custom receiver which receives kafka messages. These Kafka messages have FTP server credentials in them. The receiver then opens the message and uses the ftp credentials in it to connect to the ftp server. It then streams this huge text file (3.3G) . Finally this stream it read line by line using buffered reader and pushed to the spark streaming via the receiver's "store" method. Spark streaming process receives all these lines and stores it in hdfs. With this process I could ingest small files (50 mb) but cant ingest this 3.3gb file. I get a YARN exception of SIGTERM 15 in spark streaming process. Also, I tried going to that 3.3GB file directly (without custom receiver) in spark streaming using ssc.textFileStream and everything works fine and that file ends in HDFS Please let me know what I might have to do to get this working with receiver. I know there are better ways to ingest the file but we need to use Spark streaming in our case. Thanks.
Spark (1.2.0) submit fails with exception saying log directory already exists
Here is the error yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: Log directory hdfs://Sandbox/user/spark/applicationHistory/application_1438113296105_0302 already exists!) I am using cloudera 5.3.2 with Spark 1.2.0 Any help is appreciated. ThanksJay
Re: Setting up Spark/flume/? to Ingest 10TB from FTP
Thanks Marcelo. But our problem is little complicated. We have 10+ ftp sites that we will be transferring data from. The ftp server info, filename, credentials are all coming via Kafka message. So, I want to read those kafka message and dynamically connect to the ftp site and download those fat files and store it in HDFS. And hence, I was planning to use Spark Streaming with Kafka or Flume with Kafka. But flume runs on a JVM and may not be the best option as the huge file will create memory issues. Please suggest someway to run it inside the cluster. From: Marcelo Vanzin van...@cloudera.com To: Varadhan, Jawahar varad...@yahoo.com Cc: d...@spark.apache.org d...@spark.apache.org Sent: Friday, August 14, 2015 3:23 PM Subject: Re: Setting up Spark/flume/? to Ingest 10TB from FTP Why do you need to use Spark or Flume for this? You can just use curl and hdfs: curl ftp://blah | hdfs dfs -put - /blah On Fri, Aug 14, 2015 at 1:15 PM, Varadhan, Jawahar varad...@yahoo.com.invalid wrote: What is the best way to bring such a huge file from a FTP server into Hadoop to persist in HDFS? Since a single jvm process might run out of memory, I was wondering if I can use Spark or Flume to do this. Any help on this matter is appreciated. I prefer a application/process running inside Hadoop which is doing this transfer Thanks. -- Marcelo