Re: Best way to merge files from streaming jobs

Sumedh Wale Tue, 08 Mar 2016 00:21:07 -0800

On Saturday 05 March 2016 02:39 AM, Jelez Raditchkov wrote:

My streaming job is creating files on S3.
The problem is that those files end up very small if I just write them to S3 directly.

This is why I use coalesce() to reduce the number of files and make them larger.

RDD.coalesce right? It accepts whether or not to shuffle as an argument. If you are reducing the number of partitions it should not cause a shuffle.

dstream.foreachRDD { rdd => val numParts = rdd.getPartitions.length // half the partitions rdd.coalesce(numParts / 2, shuffle = false)....}

Another option can be to combine multiple RDDs. Set appropriate remember duration (StreamingContext.remember), store the RDDs in a fixed size list/array and then process all the cached RDDs in one go periodically when list is full (combining with RDD.zipPartitions). You may have to keep the remember duration somewhat larger than the duration corresponding to the list size to account for processing time.

However, coalesce shuffles data and my job processing time ends up higher than sparkBatchIntervalMilliseconds.

I have observed that if I coalesce the number of partitions to be equal to the cores in the cluster I get less shuffling - but that is unsubstantiated.

Is there any dependency/rule between number of executors, number of cores etc. that I can use to minimize shuffling and at the same time achieve minimum number of output files per batch?

What is the best practice?

I think most DStreams (Kafka streams can be exceptions) will create number of partitions to be same as total number of executor cores (spark.default.parallelism). Perhaps that is why you are seeing the above behaviour. Looks like shuffle should be avoidable for your case but if using coalesce it will likely not use the full processing power.

thanks

-- 
Sumedh Wale
SnappyData (http://www.snappydata.io)



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Best way to merge files from streaming jobs

Reply via email to