You are correct. The earlier Kinesis receiver (as of Spark 1.4) was not
saving checkpoints correctly and was in general not reliable (even with WAL
enabled). We have improved this in Spark 1.5 with updated Kinesis receiver,
that keeps track of the Kinesis sequence numbers as part of the Spark
Streaming's DStream checkpointing, and the KCL checkpoint is updated only
after the sequence number has been written to the DStream checkpoints. This
allows a recovered streaming program (that is, restart from checkpoint) to
recover the sequence numbers from the checkpoint information and
reprocessed the corresponding records (those which had not been
successfully processed). This will give better guarantees.

If you are interested to learn more, see the JIRA:
https://issues.apache.org/jira/browse/SPARK-9215

Related to this, for your scenarios, you should be setting rate limits
(spark.streaming.rateLimit) to prevent spark from receiving data faster
that it can process.

On Mon, Aug 10, 2015 at 4:40 PM, Phil Kallos <phil.kal...@gmail.com> wrote:

> Hi! Sorry if this is a repost.
>
> I'm using Spark + Kinesis ASL to process and persist stream data to
> ElasticSearch. For the most part it works nicely.
>
> There is a subtle issue I'm running into about how failures are handled.
>
> For example's sake, let's say I am processing a Kinesis stream that
> produces 400 records per second, continuously.
>
> Kinesis provides a 24hr buffer of data, and I'm setting my Kinesis DStream
> consumer to use "TRIM_HORIZON", to mean "go as far back as possible and
> start processing the stream data as quickly as possible, until you catch up
> to the tip of the stream".
>
> This means that for some period of time, Spark will suck in data from
> Kinesis as quickly as it can, let's say at 5000 records per second.
>
> In my specific case, ElasticSearch can gracefully handle 400 writes per
> second, but is NOT happy to process 5000 writes per second. Let's say it
> only handles 2000 wps. This means that the processing time will exceed the
> batch time, scheduling delay in the stream will rise consistently and
> batches of data will get "backlogged" for some period of time.
>
> In normal circumstances, this is fine. When the Spark consumers catch up
> to "real-time", the data input rate slows to 400rps and the backlogged
> batches eventually get flushed to ES. The system stabilizes.
>
> However! It appears to me that the Kinesis consumer actively submits
> checkpoints, even though the records may not have been processed yet (since
> they are backlogged). If for some reason there is processing delay and the
> Spark process crashes, the checkpoint will have advanced too far. If I
> resume the Spark Streaming process, there is essentially a gap in my
> ElasticSearch data.
>
> In principle I understand the reason for this, but is there a way to
> adjust this behavior? Or is there another way to handle this specific
> problem? Ideally I would be able to configure the process to only submit
> Kinesis checkpoints only after my data is successfully written to ES.
>
> Thanks,
> Phil
>
>

Reply via email to