Re: Spark Streaming - How to write RDD's in same directory ?

Sameer Farooqui Tue, 21 Oct 2014 16:03:29 -0700

Hi Shailesh,

Spark just leverages the Hadoop File Output Format to write out the RDD you
are saving.


This is really a Hadoop OutputFormat limitation which requires the
directory it is writing into to not exist. The idea is that a Hadoop job
should not be able to overwrite the results from a previous job, so it
enforces that the dir should not exist.

Easiest way to get around this may be to just write the results from each
Spark app to a newly named directory, then on an interval run a simple
script to merge data from multiple HDFS directories into one directory.

This HDFS command will let you do something like a directory merge:
hdfs dfs -cat /folderpath/folder* | hdfs dfs -copyFromLocal -
/newfolderpath/file

See this StackOverflow discussion for a way to do it using Pig and Bash
scripting also:
https://stackoverflow.com/questions/19979896/combine-map-reduce-output-from-different-folders-into-single-folder


Sameer F.
Client Services @ Databricks

On Tue, Oct 21, 2014 at 3:51 PM, Shailesh Birari <sbir...@wynyardgroup.com>
wrote:

> Hello,
>
> Spark 1.1.0, Hadoop 2.4.1
>
> I have written a Spark streaming application. And I am getting
> FileAlreadyExistsException for rdd.saveAsTextFile(outputFolderPath).
> Here is brief what I am is trying to do.
> My application is creating text file stream using Java Stream context. The
> input file is on HDFS.
>
>         JavaDStream<String> textStream = ssc.textFileStream(InputFile);
>
> Then it is comparing each line of input stream with some data and filtering
> it. The filtered data I am storing in JavaDStream<String>.
>
>                  JavaDStream<String> suspectedStream=
> textStream.flatMap(new
> FlatMapFunction<String,String>(){
>                         @Override
>                         public Iterable<String> call(String line) throws
> Exception {
>
>                         List<String> filteredList = new
> ArrayList<String>();
>
>                         // doing filter job
>
>                         return filteredList;
>                 }
>
> And this filteredList I am storing in HDFS as:
>
>              suspectedStream.foreach(new
> Function<JavaRDD&lt;String>,Void>(){
>                         @Override
>                         public Void call(JavaRDD<String> rdd) throws
> Exception {
>                                 rdd.saveAsTextFile(outputFolderPath);
>                                 return null;
>                 }});
>
>
> But with this I am receiving
> org.apache.hadoop.mapred.FileAlreadyExistsException.
>
> I tried with appending random number with outputFolderPath and its working.
> But my requirement is to collect all output in one directory.
>
> Can you please suggest if there is any way to get rid of this exception ?
>
> Thanks,
>   Shailesh
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-write-RDD-s-in-same-directory-tp16962.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark Streaming - How to write RDD's in same directory ?

Reply via email to