[ 
https://issues.apache.org/jira/browse/STORM-2914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346232#comment-16346232
 ] 

Jungtaek Lim commented on STORM-2914:
-------------------------------------

[~hmclouro]
https://github.com/apache/flink/blob/1440e4febd589e320f846a2725e98aec8ee43e7f/docs/dev/connectors/kafka.md#kafka-consumers-offset-committing-behaviour-configuration

Flink clearly documents how "enable.auto.commit" option plays with Flink's 
checkpoint mechanism. Key statement is below:

{quote}
"Note that the Flink Kafka Consumer does not rely on the committed offsets for 
fault tolerance guarantees. The committed offsets are only a means to expose 
the consumer's progress for monitoring purposes."
{quote}

Based on the statement, Flink doesn't have any concern about committing to 
Kafka, which is very different to Storm since storm-kafka-client is completely 
relying on Kafka committed offset.

Here's Spark doc for streaming Kafka source:
https://github.com/apache/spark/blob/55b8cfe6e6a6759d65bf219ff570fd6154197ec4/docs/streaming-kafka-0-10-integration.md#storing-offsets

Relevant statements are below:

{quote}
Kafka has an offset commit API that stores offsets in a special Kafka topic. By 
default, the new consumer will periodically auto-commit offsets. This is almost 
certainly not what you want, because messages successfully polled by the 
consumer may not yet have resulted in a Spark output operation, resulting in 
undefined semantics. This is why the stream example above sets 
"enable.auto.commit" to false.
{quote}

Spark community also avoids to couple with "enable.auto.commit". Checkpoint 
would be done with different way, and even guide regarding storing offsets to 
kafka itself, they don't recommend using that option.

cc. [~Srdo] [~avermeerbergen]

> Remove enable.auto.commit support from storm-kafka-client
> ---------------------------------------------------------
>
>                 Key: STORM-2914
>                 URL: https://issues.apache.org/jira/browse/STORM-2914
>             Project: Apache Storm
>          Issue Type: Improvement
>          Components: storm-kafka-client
>    Affects Versions: 2.0.0, 1.2.0
>            Reporter: Stig Rohde Døssing
>            Assignee: Stig Rohde Døssing
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The enable.auto.commit option causes the KafkaConsumer to periodically commit 
> the latest offsets it has returned from poll(). It is convenient for use 
> cases where messages are polled from Kafka and processed synchronously, in a 
> loop. 
> Due to https://issues.apache.org/jira/browse/STORM-2913 we'd really like to 
> store some metadata in Kafka when the spout commits. This is not possible 
> with enable.auto.commit. I took at look at what that setting actually does, 
> and it just causes the KafkaConsumer to call commitAsync during poll (and 
> during a few other operations, e.g. close and assign) with some interval. 
> Ideally I'd like to get rid of ProcessingGuarantee.NONE, since I think 
> ProcessingGuarantee.AT_MOST_ONCE covers the same use cases, and is likely 
> almost as fast. The primary difference between them is that AT_MOST_ONCE 
> commits synchronously.
> If we really want to keep ProcessingGuarantee.NONE, I think we should make 
> our ProcessingGuarantee.NONE setting cause the spout to call commitAsync 
> after poll, and never use the enable.auto.commit option. This allows us to 
> include metadata in the commit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to