You can perhaps setup a WAL that logs to S3? New cluster should pick the records that weren't processed due previous cluster termination.
Thanks, Aniket On Thu, Sep 17, 2015, 9:19 PM Alan Dipert <a...@dipert.org> wrote: > Hello, > We are using Spark Streaming 1.4.1 in AWS EMR to process records from > Kinesis. Our Spark program saves RDDs to S3, after which the records are > picked up by a Lambda function that loads them into Redshift. That no data > is lost during processing is important to us. > > We have set our Kinesis checkpoint interval to 15 minutes, which is also > our window size. > > Unfortunately, checkpointing happens after receiving data from Kinesis, > not after we have successfully written to S3. If batches back up in Spark, > and the cluster is terminated, whatever data was in-memory will be lost > because it was checkpointed but not actually saved to S3. > > We are considering forking and modifying the kinesis-asl library with > changes that would allow us to perform the checkpoint manually and at the > right time. We'd rather not do this. > > Are we overlooking an easier way to deal with this problem? Thank you in > advance for your insight! > > Alan >