Hi,

I actually run into the same problem although our endpoint is not ElasticSearch. When the spark job is dead, we lose some data because Kinesis checkpoint is already beyond the last point that spark is processed.

Currently, our workaround is to use spark's checkpoint mechanism with write ahead log (WAL)

https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
https://spark.apache.org/docs/latest/streaming-programming-guide.html#deploying-applications

Using checkpointing comes with some disadvantage like application code is not upgradable, etc.

I believe there is some work to fix this problem like Kafka direct API.
Not sure if this is it : https://issues.apache.org/jira/browse/SPARK-9215

Thanks,
Patanachai


On 08/06/2015 12:08 PM, phibit wrote:
Hi! I'm using Spark + Kinesis ASL to process and persist stream data to
ElasticSearch. For the most part it works nicely.

There is a subtle issue I'm running into about how failures are handled.

For example's sake, let's say I am processing a Kinesis stream that produces
400 records per second, continuously.

Kinesis provides a 24hr buffer of data, and I'm setting my Kinesis DStream
consumer to use "TRIM_HORIZON", to mean "go as far back as possible and
start processing the stream data as quickly as possible, until you catch up
to the tip of the stream".

This means that for some period of time, Spark will suck in data from
Kinesis as quickly as it can, let's say at 5000 records per second.

In my specific case, ElasticSearch can gracefully handle 400 writes per
second, but is NOT happy to process 5000 writes per second. Let's say it
only handles 2000 wps. This means that the processing time will exceed the
batch time, scheduling delay in the stream will rise consistently and
batches of data will get "backlogged" for some period of time.

In normal circumstances, this is fine. When the Spark consumers catch up to
"real-time", the data input rate slows to 400rps and the backlogged batches
eventually get flushed to ES. The system stabilizes.

However! It appears to me that the Kinesis consumer actively submits
checkpoints, even though the records may not have been processed yet (since
they are backlogged). If for some reason there is processing delay and the
Spark process crashes, the checkpoint will have advanced too far. If I
resume the Spark Streaming process, there is essentially a gap in my
ElasticSearch data.

In principle I understand the reason for this, but is there a way to adjust
this behavior? Or is there another way to handle this specific problem?
Ideally I would be able to configure the process to only submit Kinesis
checkpoints only after my data is successfully written to ES.

Thanks,
Phil





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Kinesis-Checkpointing-Processing-Delay-tp24157.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


--
Patanachai


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to