I am a beginner to spark streaming. So have a basic doubt regarding
checkpoints. My use case is to calculate the no of unique users by day. I am
using reduce by key and window for this. Where my window duration is 24
hours and slide duration is 5 mins. I am updating the processed record to
mongodb. Currently I am replace the existing record each time. But I see the
memory is slowly increasing over time and kills the process after 1 and 1/2
hours(in aws small instance). The DB write after the restart clears all the
old data. So I understand checkpoint is the solution for this. But my doubt
is
  
 What should my check point duration be..? As per documentation it says 5-10
times of slide duration. But I need the data of entire day. So it is ok to
keep 24 hrs.
Where ideally should the checkpoint be..? Initially when I receive the
stream or just before the window operation or after the data reduction has
taken place.

        Appreciate your help.
Thank you



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-with-checkpoint-tp21263.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to