Hi, Seems you have such a large window (24 hours), so the phenomena of memory increasing is expectable, because of window operation will cache the RDD within this window in memory. So for your requirement, memory should be enough to hold the data of 24 hours.
I don't think checkpoint in Spark Streaming can alleviate such problem, because checkpoint are mainly for fault tolerance. Thanks Jerry From: balu.naren [mailto:balu.na...@gmail.com] Sent: Tuesday, January 20, 2015 7:17 PM To: user@spark.apache.org Subject: spark streaming with checkpoint I am a beginner to spark streaming. So have a basic doubt regarding checkpoints. My use case is to calculate the no of unique users by day. I am using reduce by key and window for this. Where my window duration is 24 hours and slide duration is 5 mins. I am updating the processed record to mongodb. Currently I am replace the existing record each time. But I see the memory is slowly increasing over time and kills the process after 1 and 1/2 hours(in aws small instance). The DB write after the restart clears all the old data. So I understand checkpoint is the solution for this. But my doubt is * What should my check point duration be..? As per documentation it says 5-10 times of slide duration. But I need the data of entire day. So it is ok to keep 24 hrs. * Where ideally should the checkpoint be..? Initially when I receive the stream or just before the window operation or after the data reduction has taken place. Appreciate your help. Thank you ________________________________ View this message in context: spark streaming with checkpoint<http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-with-checkpoint-tp21263.html> Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.