My question was how to do this in Hadoop? Could somebody point me to some examples?
On Tue, Aug 18, 2015 at 10:43 PM, UMESH CHAUDHARY <umesh9...@gmail.com> wrote: > Of course, Java or Scala can do that: > 1) Create a FileWriter with append or roll over option > 2) For each RDD create a StringBuilder after applying your filters > 3) Write this StringBuilder to File when you want to write (The duration > can be defined as a condition) > > On Tue, Aug 18, 2015 at 11:05 PM, Mohit Anchlia <mohitanch...@gmail.com> > wrote: > >> Is there a way to store all the results in one file and keep the file >> roll over separate than the spark streaming batch interval? >> >> On Mon, Aug 17, 2015 at 2:39 AM, UMESH CHAUDHARY <umesh9...@gmail.com> >> wrote: >> >>> In Spark Streaming you can simply check whether your RDD contains any >>> records or not and if records are there you can save them using >>> FIleOutputStream: >>> >>> DStream.foreachRDD(t=> { var count = t.count(); if (count>0){ // SAVE >>> YOUR STUFF} }; >>> >>> This will not create unnecessary files of 0 bytes. >>> >>> On Mon, Aug 17, 2015 at 2:51 PM, Akhil Das <ak...@sigmoidanalytics.com> >>> wrote: >>> >>>> Currently, spark streaming would create a new directory for every batch >>>> and store the data to it (whether it has anything or not). There is no >>>> direct append call as of now, but you can achieve this either with >>>> FileUtil.copyMerge >>>> <http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167> >>>> or have a separate program which will do the clean up for you. >>>> >>>> Thanks >>>> Best Regards >>>> >>>> On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia <mohitanch...@gmail.com> >>>> wrote: >>>> >>>>> Spark stream seems to be creating 0 bytes files even when there is no >>>>> data. Also, I have 2 concerns here: >>>>> >>>>> 1) Extra unnecessary files is being created from the output >>>>> 2) Hadoop doesn't work really well with too many files and I see that >>>>> it is creating a directory with a timestamp every 1 second. Is there a >>>>> better way of writing a file, may be use some kind of append mechanism >>>>> where one doesn't have to change the batch interval. >>>>> >>>> >>>> >>> >> >