Re: Spark Streaming data checkpoint performance

Thúy Hằng Lê Thu, 05 Nov 2015 20:12:04 -0800

Thankd all, it would be great to have this feature soon.
Do you know what's the release plan for 1.6?

In addition to this, I still have checkpoint performance problem

My code is just simple like this:
    JavaStreamingContext jssc = new
JavaStreamingContext(sparkConf,Durations.seconds(2));
    jssc.checkpoint("spark-data/checkpoint");
    JavaPairInputDStream<String, String> messages =
KafkaUtils.createDirectStream(...);
    JavaPairDStream<String, List<Double>> stats =
messages.mapToPair(parseJson)
                            .reduceByKey(REDUCE_STATS)
                            .updateStateByKey(RUNNING_STATS);

    stats.print()

  Now I need to maintain about 800k keys, the stats here is only count
number of occurence for key.
  While running the cache dir is very small (about 50M), my question is:

  1/ For regular micro-batch it takes about 800ms to finish, but every 10
seconds when data checkpoint is running
  It took me 5 seconds to finish the same size micro-batch, why it's too
high? what's kind of job in checkpoint?
  why it's keep increasing?

  2/ When I changes the data checkpoint interval like using:
      stats.checkpoint(Durations.seconds(100)); //change to 100, defaults
is 10

  The checkpoint is keep increasing significantly first checkpoint is 10s,
second is 30s, third is 70s ... and keep increasing :)
  Why it's too high when increasing checkpoint interval?

It seems that default interval works more stable.

On Nov 4, 2015 9:08 PM, "Adrian Tanase" <atan...@adobe.com> wrote:

> Nice! Thanks for sharing, I wasn’t aware of the new API.
>
> Left some comments on the JIRA and design doc.
>
> -adrian
>
> From: Shixiong Zhu
> Date: Tuesday, November 3, 2015 at 3:32 AM
> To: Thúy Hằng Lê
> Cc: Adrian Tanase, "user@spark.apache.org"
> Subject: Re: Spark Streaming data checkpoint performance
>
> "trackStateByKey" is about to be added in 1.6 to resolve the performance
> issue of "updateStateByKey". You can take a look at
> https://issues.apache.org/jira/browse/SPARK-2629 and
> https://github.com/apache/spark/pull/9256
>

Re: Spark Streaming data checkpoint performance

Reply via email to