[ https://issues.apache.org/jira/browse/SPARK-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697175#comment-14697175 ]
Dan Dutrow commented on SPARK-6249: ----------------------------------- There are valid arguments in the presentation and blog for why the developer needs to have control of the offsets in order to guarantee exactly-once semantics. A transactional data store that commits offsets and program results at the same time is the only way to be confident in reliability and consistency of the data. However, an application that can stand to lose some but not a lot of data (like a block here or there), a default "remember" implementation with fewer guarantees would be nice to have. > Get Kafka offsets from consumer group in ZK when using direct stream > -------------------------------------------------------------------- > > Key: SPARK-6249 > URL: https://issues.apache.org/jira/browse/SPARK-6249 > Project: Spark > Issue Type: Improvement > Components: Streaming > Reporter: Tathagata Das > > This is the proposal. > The simpler direct API (the one that does not take explicit offsets) can be > modified to also pick up the initial offset from ZK if group.id is specified. > This is exactly similar to how we find the latest or earliest offset in that > API, just that instead of latest/earliest offset of the topic we want to find > the offset from the consumer group. The group offsets is ZK is not used at > all for any further processing and restarting, so the exactly-once semantics > is not broken. > The use case where this is useful is simplified code upgrade. If the user > wants to upgrade the code, he/she can the context stop gracefully which will > ensure the ZK consumer group offset will be updated with the last offsets > processed. Then the new code is started (not restarted from checkpoint) can > pickup the consumer group offset from ZK and continue where the previous > code had left off. > Without the functionality of picking up consumer group offsets to start (that > is, currently) the only way to do this is for the users to save the offsets > somewhere (file, database, etc.) and manage the offsets themselves. I just > want to simplify this process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org