Hi Shailesh, Spark just leverages the Hadoop File Output Format to write out the RDD you are saving.
This is really a Hadoop OutputFormat limitation which requires the directory it is writing into to not exist. The idea is that a Hadoop job should not be able to overwrite the results from a previous job, so it enforces that the dir should not exist. Easiest way to get around this may be to just write the results from each Spark app to a newly named directory, then on an interval run a simple script to merge data from multiple HDFS directories into one directory. This HDFS command will let you do something like a directory merge: hdfs dfs -cat /folderpath/folder* | hdfs dfs -copyFromLocal - /newfolderpath/file See this StackOverflow discussion for a way to do it using Pig and Bash scripting also: https://stackoverflow.com/questions/19979896/combine-map-reduce-output-from-different-folders-into-single-folder Sameer F. Client Services @ Databricks On Tue, Oct 21, 2014 at 3:51 PM, Shailesh Birari <sbir...@wynyardgroup.com> wrote: > Hello, > > Spark 1.1.0, Hadoop 2.4.1 > > I have written a Spark streaming application. And I am getting > FileAlreadyExistsException for rdd.saveAsTextFile(outputFolderPath). > Here is brief what I am is trying to do. > My application is creating text file stream using Java Stream context. The > input file is on HDFS. > > JavaDStream<String> textStream = ssc.textFileStream(InputFile); > > Then it is comparing each line of input stream with some data and filtering > it. The filtered data I am storing in JavaDStream<String>. > > JavaDStream<String> suspectedStream= > textStream.flatMap(new > FlatMapFunction<String,String>(){ > @Override > public Iterable<String> call(String line) throws > Exception { > > List<String> filteredList = new > ArrayList<String>(); > > // doing filter job > > return filteredList; > } > > And this filteredList I am storing in HDFS as: > > suspectedStream.foreach(new > Function<JavaRDD<String>,Void>(){ > @Override > public Void call(JavaRDD<String> rdd) throws > Exception { > rdd.saveAsTextFile(outputFolderPath); > return null; > }}); > > > But with this I am receiving > org.apache.hadoop.mapred.FileAlreadyExistsException. > > I tried with appending random number with outputFolderPath and its working. > But my requirement is to collect all output in one directory. > > Can you please suggest if there is any way to get rid of this exception ? > > Thanks, > Shailesh > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-write-RDD-s-in-same-directory-tp16962.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >