I am a beginner to spark streaming. So have a basic doubt regarding checkpoints. My use case is to calculate the no of unique users by day. I am using reduce by key and window for this. Where my window duration is 24 hours and slide duration is 5 mins. I am updating the processed record to mongodb. Currently I am replace the existing record each time. But I see the memory is slowly increasing over time and kills the process after 1 and 1/2 hours(in aws small instance). The DB write after the restart clears all the old data. So I understand checkpoint is the solution for this. But my doubt is What should my check point duration be..? As per documentation it says 5-10 times of slide duration. But I need the data of entire day. So it is ok to keep 24 hrs. Where ideally should the checkpoint be..? Initially when I receive the stream or just before the window operation or after the data reduction has taken place.
Appreciate your help. Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-with-checkpoint-tp21263.html Sent from the Apache Spark User List mailing list archive at Nabble.com.