Re: Too many files/dirs in hdfs

UMESH CHAUDHARY Tue, 18 Aug 2015 22:43:36 -0700

Of course, Java or Scala can do that:
1) Create a FileWriter with append or roll over option
2) For each RDD create a StringBuilder after applying your filters
3) Write this StringBuilder to File when you want to write (The duration
can be defined as a condition)


On Tue, Aug 18, 2015 at 11:05 PM, Mohit Anchlia <mohitanch...@gmail.com>
wrote:

> Is there a way to store all the results in one file and keep the file roll
> over separate than the spark streaming batch interval?
>
> On Mon, Aug 17, 2015 at 2:39 AM, UMESH CHAUDHARY <umesh9...@gmail.com>
> wrote:
>
>> In Spark Streaming you can simply check whether your RDD contains any
>> records or not and if records are there you can save them using
>> FIleOutputStream:
>>
>> DStream.foreachRDD(t=> { var count = t.count(); if (count>0){ // SAVE
>> YOUR STUFF} };
>>
>> This will not create unnecessary files of 0 bytes.
>>
>> On Mon, Aug 17, 2015 at 2:51 PM, Akhil Das <ak...@sigmoidanalytics.com>
>> wrote:
>>
>>> Currently, spark streaming would create a new directory for every batch
>>> and store the data to it (whether it has anything or not). There is no
>>> direct append call as of now, but you can achieve this either with
>>> FileUtil.copyMerge
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167>
>>> or have a separate program which will do the clean up for you.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia <mohitanch...@gmail.com>
>>> wrote:
>>>
>>>> Spark stream seems to be creating 0 bytes files even when there is no
>>>> data. Also, I have 2 concerns here:
>>>>
>>>> 1) Extra unnecessary files is being created from the output
>>>> 2) Hadoop doesn't work really well with too many files and I see that
>>>> it is creating a directory with a timestamp every 1 second. Is there a
>>>> better way of writing a file, may be use some kind of append mechanism
>>>> where one doesn't have to change the batch interval.
>>>>
>>>
>>>
>>
>

Re: Too many files/dirs in hdfs

Reply via email to