Re: Spark Streaming data checkpoint performance

2015-11-07 Thread trung kien
ng >>>>> It took me 5 seconds to finish the same size micro-batch, why it's >>>>> too high? what's kind of job in checkpoint? >>>>> why it's keep increasing? >>>>> >>>>> 2/ When I changes the data checkpoint interval

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Thúy Hằng Lê
val works more stable. > > On Nov 4, 2015 9:08 PM, "Adrian Tanase" <atan...@adobe.com> wrote: > >> Nice! Thanks for sharing, I wasn’t aware of the new API. >> >> Left some comments on the JIRA and design doc. >> >> -adrian >> >> From: S

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Aniket Bhatnagar
kpoint interval? >> >> It seems that default interval works more stable. >> >> On Nov 4, 2015 9:08 PM, "Adrian Tanase" <atan...@adobe.com> wrote: >> >>> Nice! Thanks for sharing, I wasn’t aware of the new API. >>> >>> Lef

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Aniket Bhatnagar
stats.checkpoint(Durations.seconds(100)); //change to 100, >>>> defaults is 10 >>>> >>>> The checkpoint is keep increasing significantly first checkpoint is >>>> 10s, second is 30s, third is 70s ... and keep increasing :) >>>> Why

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Thúy Hằng Lê
s 30s, third is 70s ... and keep increasing :) >>> Why it's too high when increasing checkpoint interval? >>> >>> It seems that default interval works more stable. >>> >>> On Nov 4, 2015 9:08 PM, "Adrian Tanase" <atan...@adobe.com> wrote: &

Re: Spark Streaming data checkpoint performance

2015-11-05 Thread Thúy Hằng Lê
"Adrian Tanase" <atan...@adobe.com> wrote: > Nice! Thanks for sharing, I wasn’t aware of the new API. > > Left some comments on the JIRA and design doc. > > -adrian > > From: Shixiong Zhu > Date: Tuesday, November 3, 2015 at 3:32 AM > To: Thúy Hằng Lê

Re: Spark Streaming data checkpoint performance

2015-11-04 Thread Adrian Tanase
Nice! Thanks for sharing, I wasn’t aware of the new API. Left some comments on the JIRA and design doc. -adrian From: Shixiong Zhu Date: Tuesday, November 3, 2015 at 3:32 AM To: Thúy Hằng Lê Cc: Adrian Tanase, "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Re:

Re: Spark Streaming data checkpoint performance

2015-11-02 Thread Adrian Tanase
You are correct, the default checkpointing interval is 10 seconds or your batch size, whichever is bigger. You can change it by calling .checkpoint(x) on your resulting Dstream. For the rest, you are probably keeping an “all time” word count that grows unbounded if you never remove words from

Re: Spark Streaming data checkpoint performance

2015-11-02 Thread Thúy Hằng Lê
Hi Andrian, Thanks for the information. However your 2 suggestions couldn't really work for me. Accuracy is the most important aspect in my application. So keeping only 15 minutes window stats or prune out some of keys is impossible for my application. I can change the checking point interval

Re: Spark Streaming data checkpoint performance

2015-11-02 Thread Shixiong Zhu
"trackStateByKey" is about to be added in 1.6 to resolve the performance issue of "updateStateByKey". You can take a look at https://issues.apache.org/jira/browse/SPARK-2629 and https://github.com/apache/spark/pull/9256

Re: spark streaming with checkpoint

2015-01-25 Thread Balakrishnan Narendran
Yeah use streaming to gather the incoming logs and write to log file then run a spark job evry 5 minutes to process the counts. Got it. Thanks a lot. On 07:07, Mon, 26 Jan 2015 Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Tue, Jan 20, 2015 at 8:16 PM, balu.naren balu.na...@gmail.com

Re: spark streaming with checkpoint

2015-01-25 Thread Balakrishnan Narendran
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: spark streaming with checkpoint

2015-01-25 Thread Tobias Pfeiffer
Hi, On Tue, Jan 20, 2015 at 8:16 PM, balu.naren balu.na...@gmail.com wrote: I am a beginner to spark streaming. So have a basic doubt regarding checkpoints. My use case is to calculate the no of unique users by day. I am using reduce by key and window for this. Where my window duration is 24

Re: spark streaming with checkpoint

2015-01-22 Thread Balakrishnan Narendran
Thank you Jerry, Does the window operation create new RDDs for each slide duration..? I am asking this because i see a constant increase in memory even when there is no logs received. If not checkpoint is there any alternative that you would suggest.? On Tue, Jan 20, 2015 at 7:08 PM,

RE: spark streaming with checkpoint

2015-01-22 Thread Shao, Saisai
for you? I think it’s better and easy for you to change your implementation rather than rely on Spark to handle this. Thanks Jerry From: Balakrishnan Narendran [mailto:balu.na...@gmail.com] Sent: Friday, January 23, 2015 12:19 AM To: Shao, Saisai Cc: user@spark.apache.org Subject: Re: spark

Re: spark streaming with checkpoint

2015-01-22 Thread Jörn Franke
Maybe you use a wrong approach - try something like hyperloglog or bitmap structures as you can find them, for instance, in redis. They are much smaller Le 22 janv. 2015 17:19, Balakrishnan Narendran balu.na...@gmail.com a écrit : Thank you Jerry, Does the window operation create new

RE: spark streaming with checkpoint

2015-01-20 Thread Shao, Saisai
Hi, Seems you have such a large window (24 hours), so the phenomena of memory increasing is expectable, because of window operation will cache the RDD within this window in memory. So for your requirement, memory should be enough to hold the data of 24 hours. I don't think checkpoint in Spark