Re: Spark Streaming data checkpoint performance

2015-11-07 Thread trung kien
ng >>>>> It took me 5 seconds to finish the same size micro-batch, why it's >>>>> too high? what's kind of job in checkpoint? >>>>> why it's keep increasing? >>>>> >>>>> 2/ When I changes the data checkpoint interval

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Thúy Hằng Lê
val works more stable. > > On Nov 4, 2015 9:08 PM, "Adrian Tanase" <atan...@adobe.com> wrote: > >> Nice! Thanks for sharing, I wasn’t aware of the new API. >> >> Left some comments on the JIRA and design doc. >> >> -adrian >> >> From: S

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Aniket Bhatnagar
kpoint interval? >> >> It seems that default interval works more stable. >> >> On Nov 4, 2015 9:08 PM, "Adrian Tanase" <atan...@adobe.com> wrote: >> >>> Nice! Thanks for sharing, I wasn’t aware of the new API. >>> >>> Lef

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Aniket Bhatnagar
stats.checkpoint(Durations.seconds(100)); //change to 100, >>>> defaults is 10 >>>> >>>> The checkpoint is keep increasing significantly first checkpoint is >>>> 10s, second is 30s, third is 70s ... and keep increasing :) >>>> Why

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Thúy Hằng Lê
s 30s, third is 70s ... and keep increasing :) >>> Why it's too high when increasing checkpoint interval? >>> >>> It seems that default interval works more stable. >>> >>> On Nov 4, 2015 9:08 PM, "Adrian Tanase" <atan...@adobe.com> wrote: &

Re: Spark Streaming data checkpoint performance

2015-11-05 Thread Thúy Hằng Lê
> Cc: Adrian Tanase, "user@spark.apache.org" > Subject: Re: Spark Streaming data checkpoint performance > > "trackStateByKey" is about to be added in 1.6 to resolve the performance > issue of "updateStateByKey". You can take a look at > https://issues.apache.org/jira/browse/SPARK-2629 and > https://github.com/apache/spark/pull/9256 >

Re: Spark Streaming data checkpoint performance

2015-11-04 Thread Adrian Tanase
Nice! Thanks for sharing, I wasn’t aware of the new API. Left some comments on the JIRA and design doc. -adrian From: Shixiong Zhu Date: Tuesday, November 3, 2015 at 3:32 AM To: Thúy Hằng Lê Cc: Adrian Tanase, "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Re:

Re: Spark Streaming data checkpoint performance

2015-11-02 Thread Adrian Tanase
You are correct, the default checkpointing interval is 10 seconds or your batch size, whichever is bigger. You can change it by calling .checkpoint(x) on your resulting Dstream. For the rest, you are probably keeping an “all time” word count that grows unbounded if you never remove words from

Re: Spark Streaming data checkpoint performance

2015-11-02 Thread Thúy Hằng Lê
Hi Andrian, Thanks for the information. However your 2 suggestions couldn't really work for me. Accuracy is the most important aspect in my application. So keeping only 15 minutes window stats or prune out some of keys is impossible for my application. I can change the checking point interval

Re: Spark Streaming data checkpoint performance

2015-11-02 Thread Shixiong Zhu
"trackStateByKey" is about to be added in 1.6 to resolve the performance issue of "updateStateByKey". You can take a look at https://issues.apache.org/jira/browse/SPARK-2629 and https://github.com/apache/spark/pull/9256