Reliable SQS Receiver for Spark Streaming
I would like to have a Spark Streaming *SQS Receiver* which deletes SQS messages only *after* they were successfully stored on S3. For this a *Custom Receiver* can be implemented with the semantics of the Reliable Receiver. The store(multiple-records) call blocks until the given records have been stored and replicated inside Spark https://spark.apache.org/docs/latest/streaming-custom-receivers.html#receiver-reliability . If the *write-ahead logs* are enabled, all the data received from a receiver gets written into a write ahead log in the configuration checkpoint directory https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing . The checkpoint directory can be pointed to S3. After the store(multiple-records) blocking call finishes, are the records already stored in the checkpoint directory (and thus can be safely deleted from SQS)? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reliable-SQS-Receiver-for-Spark-Streaming-tp23302.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Reliable SQS Receiver for Spark Streaming
Yes, if you have enabled WAL and checkpointing then after the store, you can simply delete the SQS Messages from your receiver. Thanks Best Regards On Sat, Jun 13, 2015 at 6:14 AM, Michal Čizmazia mici...@gmail.com wrote: I would like to have a Spark Streaming SQS Receiver which deletes SQS messages only after they were successfully stored on S3. For this a Custom Receiver can be implemented with the semantics of the Reliable Receiver. The store(multiple-records) call blocks until the given records have been stored and replicated inside Spark. If the write-ahead logs are enabled, all the data received from a receiver gets written into a write ahead log in the configuration checkpoint directory. The checkpoint directory can be pointed to S3. After the store(multiple-records) blocking call finishes, are the records already stored in the checkpoint directory (and thus can be safely deleted from SQS)? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Reliable SQS Receiver for Spark Streaming
Thanks Akhil! I just looked it up in the code as well. Receiver.store(ArrayBuffer[T], ...) ReceiverSupervisorImpl.pushArrayBuffer(ArrayBuffer[T], ...) ReceiverSupervisorImpl.pushAndReportBlock(...) WriteAheadLogBasedBlockHandler.storeBlock(...) This implementation stores the block into the block manager as well as a write ahead log. It does this in parallel, using Scala Futures, and returns only after the block has been stored in both places. https://www.codatlas.com/github.com/apache/spark/master/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceivedBlockHandler.scala?keyword=WriteAheadLogBasedBlockHandlerline=160 On 13 June 2015 at 06:46, Akhil Das ak...@sigmoidanalytics.com wrote: Yes, if you have enabled WAL and checkpointing then after the store, you can simply delete the SQS Messages from your receiver. Thanks Best Regards On Sat, Jun 13, 2015 at 6:14 AM, Michal Čizmazia mici...@gmail.com wrote: I would like to have a Spark Streaming SQS Receiver which deletes SQS messages only after they were successfully stored on S3. For this a Custom Receiver can be implemented with the semantics of the Reliable Receiver. The store(multiple-records) call blocks until the given records have been stored and replicated inside Spark. If the write-ahead logs are enabled, all the data received from a receiver gets written into a write ahead log in the configuration checkpoint directory. The checkpoint directory can be pointed to S3. After the store(multiple-records) blocking call finishes, are the records already stored in the checkpoint directory (and thus can be safely deleted from SQS)? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Reliable SQS Receiver for Spark Streaming
I would like to have a Spark Streaming SQS Receiver which deletes SQS messages only after they were successfully stored on S3. For this a Custom Receiver can be implemented with the semantics of the Reliable Receiver. The store(multiple-records) call blocks until the given records have been stored and replicated inside Spark. If the write-ahead logs are enabled, all the data received from a receiver gets written into a write ahead log in the configuration checkpoint directory. The checkpoint directory can be pointed to S3. After the store(multiple-records) blocking call finishes, are the records already stored in the checkpoint directory (and thus can be safely deleted from SQS)? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org