FYI re WAL on S3 http://search-hadoop.com/m/q3RTtFMpd41A7TnH/WAL+S3&subj=WAL+on+S3
On 18 September 2015 at 13:32, Alan Dipert <a...@dipert.org> wrote: > Hello, > > Thanks all for considering our problem. We are doing transformations in > Spark Streaming. We have also since learned that WAL to S3 on 1.4 is "not > reliable" [1] > > We are just going to wait for EMR to support 1.5 and hopefully this won't > be a problem anymore [2]. > > Alan > > 1. > https://mail-archives.apache.org/mod_mbox/spark-user/201508.mbox/%3CCA+AHuKkH9r0BwQMgQjDG+j=qdcqzpow1rw1u4d0nrcgmq5x...@mail.gmail.com%3E > 2. https://issues.apache.org/jira/browse/SPARK-9215 > > On Fri, Sep 18, 2015 at 4:23 AM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > >> Are you doing actual transformations / aggregation in Spark Streaming? Or >> just using it to bulk write to S3? >> >> If the latter, then you could just use your AWS Lambda function to read >> directly from the Kinesis stream. If the former, then perhaps either look >> into the WAL option that Aniket mentioned, or perhaps you could write the >> processed RDD back to Kinesis, and have the Lambda function read the >> Kinesis stream and write to Redshift? >> >> On Thu, Sep 17, 2015 at 5:48 PM, Alan Dipert <a...@dipert.org> wrote: >> >>> Hello, >>> We are using Spark Streaming 1.4.1 in AWS EMR to process records from >>> Kinesis. Our Spark program saves RDDs to S3, after which the records are >>> picked up by a Lambda function that loads them into Redshift. That no data >>> is lost during processing is important to us. >>> >>> We have set our Kinesis checkpoint interval to 15 minutes, which is also >>> our window size. >>> >>> Unfortunately, checkpointing happens after receiving data from Kinesis, >>> not after we have successfully written to S3. If batches back up in Spark, >>> and the cluster is terminated, whatever data was in-memory will be lost >>> because it was checkpointed but not actually saved to S3. >>> >>> We are considering forking and modifying the kinesis-asl library with >>> changes that would allow us to perform the checkpoint manually and at the >>> right time. We'd rather not do this. >>> >>> Are we overlooking an easier way to deal with this problem? Thank you >>> in advance for your insight! >>> >>> Alan >>> >> >> >