Thank you Jerry, Does the window operation create new RDDs for each slide duration..? I am asking this because i see a constant increase in memory even when there is no logs received. If not checkpoint is there any alternative that you would suggest.?
On Tue, Jan 20, 2015 at 7:08 PM, Shao, Saisai <saisai.s...@intel.com> wrote: > Hi, > > > > Seems you have such a large window (24 hours), so the phenomena of memory > increasing is expectable, because of window operation will cache the RDD > within this window in memory. So for your requirement, memory should be > enough to hold the data of 24 hours. > > > > I don’t think checkpoint in Spark Streaming can alleviate such problem, > because checkpoint are mainly for fault tolerance. > > > > Thanks > > Jerry > > > > *From:* balu.naren [mailto:balu.na...@gmail.com] > *Sent:* Tuesday, January 20, 2015 7:17 PM > *To:* user@spark.apache.org > *Subject:* spark streaming with checkpoint > > > > I am a beginner to spark streaming. So have a basic doubt regarding > checkpoints. My use case is to calculate the no of unique users by day. I > am using reduce by key and window for this. Where my window duration is 24 > hours and slide duration is 5 mins. I am updating the processed record to > mongodb. Currently I am replace the existing record each time. But I see > the memory is slowly increasing over time and kills the process after 1 and > 1/2 hours(in aws small instance). The DB write after the restart clears all > the old data. So I understand checkpoint is the solution for this. But my > doubt is > > - What should my check point duration be..? As per documentation it > says 5-10 times of slide duration. But I need the data of entire day. So it > is ok to keep 24 hrs. > - Where ideally should the checkpoint be..? Initially when I receive > the stream or just before the window operation or after the data reduction > has taken place. > > > Appreciate your help. > Thank you > ------------------------------ > > View this message in context: spark streaming with checkpoint > <http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-with-checkpoint-tp21263.html> > Sent from the Apache Spark User List mailing list archive > <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >