Of course, Java or Scala can do that: 1) Create a FileWriter with append or roll over option 2) For each RDD create a StringBuilder after applying your filters 3) Write this StringBuilder to File when you want to write (The duration can be defined as a condition)
On Tue, Aug 18, 2015 at 11:05 PM, Mohit Anchlia <mohitanch...@gmail.com> wrote: > Is there a way to store all the results in one file and keep the file roll > over separate than the spark streaming batch interval? > > On Mon, Aug 17, 2015 at 2:39 AM, UMESH CHAUDHARY <umesh9...@gmail.com> > wrote: > >> In Spark Streaming you can simply check whether your RDD contains any >> records or not and if records are there you can save them using >> FIleOutputStream: >> >> DStream.foreachRDD(t=> { var count = t.count(); if (count>0){ // SAVE >> YOUR STUFF} }; >> >> This will not create unnecessary files of 0 bytes. >> >> On Mon, Aug 17, 2015 at 2:51 PM, Akhil Das <ak...@sigmoidanalytics.com> >> wrote: >> >>> Currently, spark streaming would create a new directory for every batch >>> and store the data to it (whether it has anything or not). There is no >>> direct append call as of now, but you can achieve this either with >>> FileUtil.copyMerge >>> <http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167> >>> or have a separate program which will do the clean up for you. >>> >>> Thanks >>> Best Regards >>> >>> On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia <mohitanch...@gmail.com> >>> wrote: >>> >>>> Spark stream seems to be creating 0 bytes files even when there is no >>>> data. Also, I have 2 concerns here: >>>> >>>> 1) Extra unnecessary files is being created from the output >>>> 2) Hadoop doesn't work really well with too many files and I see that >>>> it is creating a directory with a timestamp every 1 second. Is there a >>>> better way of writing a file, may be use some kind of append mechanism >>>> where one doesn't have to change the batch interval. >>>> >>> >>> >> >