On Saturday 05 March 2016 02:39 AM,
Jelez Raditchkov wrote:
RDD.coalesce right? It accepts whether or not to shuffle as an argument. If you are reducing the number of partitions it should not cause a shuffle. dstream.foreachRDD { rdd => val numParts = rdd.getPartitions.length // half the partitions rdd.coalesce(numParts / 2, shuffle = false).... } Another option can be to combine multiple RDDs. Set appropriate remember duration (StreamingContext.remember), store the RDDs in a fixed size list/array and then process all the cached RDDs in one go periodically when list is full (combining with RDD.zipPartitions). You may have to keep the remember duration somewhat larger than the duration corresponding to the list size to account for processing time.
I think most DStreams (Kafka streams can be exceptions) will create number of partitions to be same as total number of executor cores (spark.default.parallelism). Perhaps that is why you are seeing the above behaviour. Looks like shuffle should be avoidable for your case but if using coalesce it will likely not use the full processing power. thanks -- Sumedh Wale SnappyData (http://www.snappydata.io) --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org |
- Best way to merge files from streaming jobs Jelez Raditchkov
- Re: Best way to merge files from streaming jobs Sumedh Wale