Re: Too many files/dirs in hdfs

Akhil Das Mon, 17 Aug 2015 02:22:25 -0700

Currently, spark streaming would create a new directory for every batch and
store the data to it (whether it has anything or not). There is no direct
append call as of now, but you can achieve this either with
FileUtil.copyMerge
<http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167>
or have a separate program which will do the clean up for you.


Thanks
Best Regards

On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia <mohitanch...@gmail.com>
wrote:

> Spark stream seems to be creating 0 bytes files even when there is no
> data. Also, I have 2 concerns here:
>
> 1) Extra unnecessary files is being created from the output
> 2) Hadoop doesn't work really well with too many files and I see that it
> is creating a directory with a timestamp every 1 second. Is there a better
> way of writing a file, may be use some kind of append mechanism where one
> doesn't have to change the batch interval.
>

Re: Too many files/dirs in hdfs

Reply via email to