Based on what I've read it appears that when using spark streaming there is
no good way of optimizing the files on HDFS. Spark streaming writes many
small files which is not scalable in apache hadoop. Only other way seem to
be to read files after it has been written and merge them to a bigger
Any help would be appreciated
On Wed, Aug 19, 2015 at 9:38 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:
My question was how to do this in Hadoop? Could somebody point me to some
examples?
On Tue, Aug 18, 2015 at 10:43 PM, UMESH CHAUDHARY umesh9...@gmail.com
wrote:
Of course, Java or
My question was how to do this in Hadoop? Could somebody point me to some
examples?
On Tue, Aug 18, 2015 at 10:43 PM, UMESH CHAUDHARY umesh9...@gmail.com
wrote:
Of course, Java or Scala can do that:
1) Create a FileWriter with append or roll over option
2) For each RDD create a StringBuilder
Is there a way to store all the results in one file and keep the file roll
over separate than the spark streaming batch interval?
On Mon, Aug 17, 2015 at 2:39 AM, UMESH CHAUDHARY umesh9...@gmail.com
wrote:
In Spark Streaming you can simply check whether your RDD contains any
records or not and
Of course, Java or Scala can do that:
1) Create a FileWriter with append or roll over option
2) For each RDD create a StringBuilder after applying your filters
3) Write this StringBuilder to File when you want to write (The duration
can be defined as a condition)
On Tue, Aug 18, 2015 at 11:05 PM,
In Spark Streaming you can simply check whether your RDD contains any
records or not and if records are there you can save them using
FIleOutputStream:
DStream.foreachRDD(t= { var count = t.count(); if (count0){ // SAVE YOUR
STUFF} };
This will not create unnecessary files of 0 bytes.
On Mon,
Currently, spark streaming would create a new directory for every batch and
store the data to it (whether it has anything or not). There is no direct
append call as of now, but you can achieve this either with
FileUtil.copyMerge
Spark stream seems to be creating 0 bytes files even when there is no data.
Also, I have 2 concerns here:
1) Extra unnecessary files is being created from the output
2) Hadoop doesn't work really well with too many files and I see that it is
creating a directory with a timestamp every 1 second.