[ https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646033#comment-14646033 ]
Dmitry Goldenberg commented on SPARK-9434: ------------------------------------------ Ah. I think this is making sense then :) http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing > Need how-to for resuming direct Kafka streaming consumers where they had left > off before getting terminated, OR actual support for that mode in the > Streaming API > ----------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-9434 > URL: https://issues.apache.org/jira/browse/SPARK-9434 > Project: Spark > Issue Type: Improvement > Components: Documentation, Examples, Streaming > Affects Versions: 1.4.1 > Reporter: Dmitry Goldenberg > > We've been getting some mixed information regarding how to cause our direct > streaming consumers to resume processing from where they left off in terms of > the Kafka offsets. > On the one hand side, we're hearing "If you are restarting the streaming app > with Direct kafka from the checkpoint information (that is, restarting), then > the last read offsets are automatically recovered, and the data will start > processing from that offset. All the N records added in T will stay buffered > in Kafka." (where T is the interval of time during which the consumer was > down). > On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which > are marked as "won't fix" which seem to ask for the functionality we need, > with comments like "I don't want to add more config options with confusing > semantics around what is being used for the system of record for offsets, I'd > rather make it easy for people to explicitly do what they need." > The use-case is actually very clear and doesn't ask for confusing semantics. > An API option to resume reading where you left off, in addition to the > smallest or greatest auto.offset.reset should be *very* useful, probably for > quite a few folks. > We're asking for this as an enhancement request. SPARK-8833 states " I am > waiting for getting enough usecase to float in before I take a final call." > We're adding to that. > In the meantime, can you clarify the confusion? Does direct streaming > persist the progress information into "DStream checkpoints" or does it not? > If it does, why is it that we're not seeing that happen? Our consumers start > with auto.offset.reset=greatest and that causes them to read from the first > offset of data that is written to Kafka *after* the consumer has been > restarted, meaning we're missing data that had come in while the consumer was > down. > If the progress is stored in "DStream checkpoints", we want to know a) how to > cause that to work for us and b) where the said checkpointing data is stored > physically. > Conversely, if this is not accurate, then is our only choice to manually > persist the offsets into Zookeeper? If that is the case then a) we'd like a > clear, more complete code sample to be published, since the one in the Kafka > streaming guide is incomplete (it lacks the actual lines of code persisting > the offsets) and b) we'd like to request that SPARK-8833 be revisited as a > feature worth implementing in the API. > Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org