Dmitry Goldenberg created SPARK-9434:
----------------------------------------

             Summary: Need how-to for resuming direct Kafka streaming consumers 
where they had left off before getting terminated, OR actual support for that 
mode in the Streaming API
                 Key: SPARK-9434
                 URL: https://issues.apache.org/jira/browse/SPARK-9434
             Project: Spark
          Issue Type: Improvement
          Components: Documentation, Examples, Streaming
    Affects Versions: 1.4.1
            Reporter: Dmitry Goldenberg


We've been getting some mixed information regarding how to cause our direct 
streaming consumers to resume processing from where they left off in terms of 
the Kafka offsets.

On the one hand side, we're hearing "If you are restarting the streaming app 
with Direct kafka from the checkpoint information (that is, restarting), then 
the last read offsets are automatically recovered, and the data will start 
processing from that offset. All the N records added in T will stay buffered in 
Kafka." (where T is the interval of time during which the consumer was down).

On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which 
are marked as "won't fix" which seem to ask for the functionality we need, with 
comments like "I don't want to add more config options with confusing semantics 
around what is being used for the system of record for offsets, I'd rather make 
it easy for people to explicitly do what they need."

The use-case is actually very clear and doesn't ask for confusing semantics. An 
API option to resume reading where you left off, in addition to the smallest or 
greatest auto.offset.reset should be *very* useful, probably for quite a few 
folks.

We're asking for this as an enhancement request. SPARK-8833 states " I am 
waiting for getting enough usecase to float in before I take a final call." 
We're adding to that.

In the meantime, can you clarify the confusion?  Does direct streaming persist 
the progress information into "DStream checkpoints" or does it not?  If it 
does, why is it that we're not seeing that happen? Our consumers start with 
auto.offset.reset=greatest and that causes them to read from the first offset 
of data that is written to Kafka *after* the consumer has been restarted, 
meaning we're missing data that had come in while the consumer was down.

If the progress is stored in "DStream checkpoints", we want to know a) how to 
cause that to work for us and b) where the said checkpointing data is stored 
physically.

Conversely, if this is not accurate, then is our only choice to manually 
persist the offsets into Zookeeper? If that is the case then a) we'd like a 
clear, more complete code sample to be published, since the one in the Kafka 
streaming guide is incomplete (it lacks the actual lines of code persisting the 
offsets) and b) we'd like to request that SPARK-8833 be revisited as a feature 
worth implementing in the API.

Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to