Thankd all, it would be great to have this feature soon. Do you know what's the release plan for 1.6?
In addition to this, I still have checkpoint performance problem My code is just simple like this: JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,Durations.seconds(2)); jssc.checkpoint("spark-data/checkpoint"); JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(...); JavaPairDStream<String, List<Double>> stats = messages.mapToPair(parseJson) .reduceByKey(REDUCE_STATS) .updateStateByKey(RUNNING_STATS); stats.print() Now I need to maintain about 800k keys, the stats here is only count number of occurence for key. While running the cache dir is very small (about 50M), my question is: 1/ For regular micro-batch it takes about 800ms to finish, but every 10 seconds when data checkpoint is running It took me 5 seconds to finish the same size micro-batch, why it's too high? what's kind of job in checkpoint? why it's keep increasing? 2/ When I changes the data checkpoint interval like using: stats.checkpoint(Durations.seconds(100)); //change to 100, defaults is 10 The checkpoint is keep increasing significantly first checkpoint is 10s, second is 30s, third is 70s ... and keep increasing :) Why it's too high when increasing checkpoint interval? It seems that default interval works more stable. On Nov 4, 2015 9:08 PM, "Adrian Tanase" <atan...@adobe.com> wrote: > Nice! Thanks for sharing, I wasn’t aware of the new API. > > Left some comments on the JIRA and design doc. > > -adrian > > From: Shixiong Zhu > Date: Tuesday, November 3, 2015 at 3:32 AM > To: Thúy Hằng Lê > Cc: Adrian Tanase, "user@spark.apache.org" > Subject: Re: Spark Streaming data checkpoint performance > > "trackStateByKey" is about to be added in 1.6 to resolve the performance > issue of "updateStateByKey". You can take a look at > https://issues.apache.org/jira/browse/SPARK-2629 and > https://github.com/apache/spark/pull/9256 >