Re: Job splling to disk and memory in Spark Streaming

2015-10-21 Thread Adrian Tanase
+1 – you can definitely make it work by making sure you are using the same partitioner (including the same number of partitions). For most operations like reduceByKey, updateStateByKey – simply specifying it enough. There are some gotchas for other operations: * mapValues and

Re: Job splling to disk and memory in Spark Streaming

2015-10-21 Thread Tathagata Das
Well, reduceByKey needs to shutffle if your intermediate data is not already partitioned in the same way as reduceByKey's partitioning. reduceByKey() has other signatures that take in a partitioner, or simply number of partitions. So you can set the same partitioner as your previous stage.

Job splling to disk and memory in Spark Streaming

2015-10-20 Thread swetha
Hi, Currently I have a job that has spills to disk and memory due to usage of reduceByKey and a lot of intermediate data in reduceByKey that gets shuffled. How to use custom partitioner in Spark Streaming for an intermediate stage so that the next stage that uses reduceByKey does not have to