[ 
https://issues.apache.org/jira/browse/SPARK-4964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14293129#comment-14293129
 ] 

Tathagata Das commented on SPARK-4964:
--------------------------------------

I am renaming this JIRA to "Native Kafka Support" because there are two 
problems that we are trying to solve, which gets solved by the associated PR.

> Exactly-once semantics for Kafka
> --------------------------------
>
>                 Key: SPARK-4964
>                 URL: https://issues.apache.org/jira/browse/SPARK-4964
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>            Reporter: Cody Koeninger
>
> for background, see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Which-committers-care-about-Kafka-td9827.html
> Requirements:
> - allow client code to implement exactly-once end-to-end semantics for Kafka 
> messages, in cases where their output storage is either idempotent or 
> transactional
> - allow client code access to Kafka offsets, rather than automatically 
> committing them
> - do not assume Zookeeper as a repository for offsets (for the transactional 
> case, offsets need to be stored in the same store as the data)
> - allow failure recovery without lost or duplicated messages, even in cases 
> where a checkpoint cannot be restored (for instance, because code must be 
> updated)
> Design:
> The basic idea is to make an rdd where each partition corresponds to a given 
> Kafka topic, partition, starting offset, and ending offset.  That allows for 
> deterministic replay of data from Kafka (as long as there is enough log 
> retention).
> Client code is responsible for committing offsets, either transactionally to 
> the same store that data is being written to, or in the case of idempotent 
> data, after data has been written.
> PR of a sample implementation for both the batch and dstream case is 
> forthcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to