Maybe you use a wrong approach - try something like hyperloglog or bitmap structures as you can find them, for instance, in redis. They are much smaller Le 22 janv. 2015 17:19, "Balakrishnan Narendran" <balu.na...@gmail.com> a écrit :
> Thank you Jerry, > Does the window operation create new RDDs for each slide > duration..? I am asking this because i see a constant increase in memory > even when there is no logs received. > If not checkpoint is there any alternative that you would suggest.? > > > On Tue, Jan 20, 2015 at 7:08 PM, Shao, Saisai <saisai.s...@intel.com> > wrote: > >> Hi, >> >> >> >> Seems you have such a large window (24 hours), so the phenomena of memory >> increasing is expectable, because of window operation will cache the RDD >> within this window in memory. So for your requirement, memory should be >> enough to hold the data of 24 hours. >> >> >> >> I don’t think checkpoint in Spark Streaming can alleviate such problem, >> because checkpoint are mainly for fault tolerance. >> >> >> >> Thanks >> >> Jerry >> >> >> >> *From:* balu.naren [mailto:balu.na...@gmail.com] >> *Sent:* Tuesday, January 20, 2015 7:17 PM >> *To:* user@spark.apache.org >> *Subject:* spark streaming with checkpoint >> >> >> >> I am a beginner to spark streaming. So have a basic doubt regarding >> checkpoints. My use case is to calculate the no of unique users by day. I >> am using reduce by key and window for this. Where my window duration is 24 >> hours and slide duration is 5 mins. I am updating the processed record to >> mongodb. Currently I am replace the existing record each time. But I see >> the memory is slowly increasing over time and kills the process after 1 and >> 1/2 hours(in aws small instance). The DB write after the restart clears all >> the old data. So I understand checkpoint is the solution for this. But my >> doubt is >> >> - What should my check point duration be..? As per documentation it >> says 5-10 times of slide duration. But I need the data of entire day. So >> it >> is ok to keep 24 hrs. >> - Where ideally should the checkpoint be..? Initially when I receive >> the stream or just before the window operation or after the data reduction >> has taken place. >> >> >> Appreciate your help. >> Thank you >> ------------------------------ >> >> View this message in context: spark streaming with checkpoint >> <http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-with-checkpoint-tp21263.html> >> Sent from the Apache Spark User List mailing list archive >> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >> > >