Re: Too many files/dirs in hdfs

2015-08-25 Thread Mohit Anchlia
Based on what I've read it appears that when using spark streaming there is no good way of optimizing the files on HDFS. Spark streaming writes many small files which is not scalable in apache hadoop. Only other way seem to be to read files after it has been written and merge them to a bigger

Re: Too many files/dirs in hdfs

2015-08-24 Thread Mohit Anchlia
Any help would be appreciated On Wed, Aug 19, 2015 at 9:38 AM, Mohit Anchlia mohitanch...@gmail.com wrote: My question was how to do this in Hadoop? Could somebody point me to some examples? On Tue, Aug 18, 2015 at 10:43 PM, UMESH CHAUDHARY umesh9...@gmail.com wrote: Of course, Java or

Re: Too many files/dirs in hdfs

2015-08-19 Thread Mohit Anchlia
My question was how to do this in Hadoop? Could somebody point me to some examples? On Tue, Aug 18, 2015 at 10:43 PM, UMESH CHAUDHARY umesh9...@gmail.com wrote: Of course, Java or Scala can do that: 1) Create a FileWriter with append or roll over option 2) For each RDD create a StringBuilder

Re: Too many files/dirs in hdfs

2015-08-18 Thread Mohit Anchlia
Is there a way to store all the results in one file and keep the file roll over separate than the spark streaming batch interval? On Mon, Aug 17, 2015 at 2:39 AM, UMESH CHAUDHARY umesh9...@gmail.com wrote: In Spark Streaming you can simply check whether your RDD contains any records or not and

Re: Too many files/dirs in hdfs

2015-08-18 Thread UMESH CHAUDHARY
Of course, Java or Scala can do that: 1) Create a FileWriter with append or roll over option 2) For each RDD create a StringBuilder after applying your filters 3) Write this StringBuilder to File when you want to write (The duration can be defined as a condition) On Tue, Aug 18, 2015 at 11:05 PM,

Re: Too many files/dirs in hdfs

2015-08-17 Thread UMESH CHAUDHARY
In Spark Streaming you can simply check whether your RDD contains any records or not and if records are there you can save them using FIleOutputStream: DStream.foreachRDD(t= { var count = t.count(); if (count0){ // SAVE YOUR STUFF} }; This will not create unnecessary files of 0 bytes. On Mon,

Re: Too many files/dirs in hdfs

2015-08-17 Thread Akhil Das
Currently, spark streaming would create a new directory for every batch and store the data to it (whether it has anything or not). There is no direct append call as of now, but you can achieve this either with FileUtil.copyMerge

Too many files/dirs in hdfs

2015-08-14 Thread Mohit Anchlia
Spark stream seems to be creating 0 bytes files even when there is no data. Also, I have 2 concerns here: 1) Extra unnecessary files is being created from the output 2) Hadoop doesn't work really well with too many files and I see that it is creating a directory with a timestamp every 1 second.