Any help would be appreciated On Wed, Aug 19, 2015 at 9:38 AM, Mohit Anchlia <mohitanch...@gmail.com> wrote:
> My question was how to do this in Hadoop? Could somebody point me to some > examples? > > On Tue, Aug 18, 2015 at 10:43 PM, UMESH CHAUDHARY <umesh9...@gmail.com> > wrote: > >> Of course, Java or Scala can do that: >> 1) Create a FileWriter with append or roll over option >> 2) For each RDD create a StringBuilder after applying your filters >> 3) Write this StringBuilder to File when you want to write (The duration >> can be defined as a condition) >> >> On Tue, Aug 18, 2015 at 11:05 PM, Mohit Anchlia <mohitanch...@gmail.com> >> wrote: >> >>> Is there a way to store all the results in one file and keep the file >>> roll over separate than the spark streaming batch interval? >>> >>> On Mon, Aug 17, 2015 at 2:39 AM, UMESH CHAUDHARY <umesh9...@gmail.com> >>> wrote: >>> >>>> In Spark Streaming you can simply check whether your RDD contains any >>>> records or not and if records are there you can save them using >>>> FIleOutputStream: >>>> >>>> DStream.foreachRDD(t=> { var count = t.count(); if (count>0){ // SAVE >>>> YOUR STUFF} }; >>>> >>>> This will not create unnecessary files of 0 bytes. >>>> >>>> On Mon, Aug 17, 2015 at 2:51 PM, Akhil Das <ak...@sigmoidanalytics.com> >>>> wrote: >>>> >>>>> Currently, spark streaming would create a new directory for every >>>>> batch and store the data to it (whether it has anything or not). There is >>>>> no direct append call as of now, but you can achieve this either with >>>>> FileUtil.copyMerge >>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167> >>>>> or have a separate program which will do the clean up for you. >>>>> >>>>> Thanks >>>>> Best Regards >>>>> >>>>> On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia <mohitanch...@gmail.com >>>>> > wrote: >>>>> >>>>>> Spark stream seems to be creating 0 bytes files even when there is no >>>>>> data. Also, I have 2 concerns here: >>>>>> >>>>>> 1) Extra unnecessary files is being created from the output >>>>>> 2) Hadoop doesn't work really well with too many files and I see that >>>>>> it is creating a directory with a timestamp every 1 second. Is there a >>>>>> better way of writing a file, may be use some kind of append mechanism >>>>>> where one doesn't have to change the batch interval. >>>>>> >>>>> >>>>> >>>> >>> >> >