Currently, spark streaming would create a new directory for every batch and store the data to it (whether it has anything or not). There is no direct append call as of now, but you can achieve this either with FileUtil.copyMerge <http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167> or have a separate program which will do the clean up for you.
Thanks Best Regards On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia <mohitanch...@gmail.com> wrote: > Spark stream seems to be creating 0 bytes files even when there is no > data. Also, I have 2 concerns here: > > 1) Extra unnecessary files is being created from the output > 2) Hadoop doesn't work really well with too many files and I see that it > is creating a directory with a timestamp every 1 second. Is there a better > way of writing a file, may be use some kind of append mechanism where one > doesn't have to change the batch interval. >