Re: [VOTE] KIP-572: Improve timeouts and retires in Kafka Streams

Matthias J. Sax Mon, 20 Jul 2020 14:41:04 -0700

While working on the PR, we realized that the command line tool

  bin/kafka-console-producer.sh


has a flag `--message-send-max-retries` to set the producer's `retries`
config. We also need to deprecate this flag.

I updated the KIP accordingly. Please let us know if there are any
concerns to this minor change to the KIP.

Thanks.


-Matthias

On 6/10/20 11:16 AM, Matthias J. Sax wrote:
> Thanks!
> 
> +1 (binding) from myself.
> 
> 
> I am closing the vote as accepted with 3 binding and 3 non-binding votes.
> 
> binding:
>  - John
>  - Guozhang
>  - Matthias
> 
> non-binding:
>  - Sophie
>  - Boyang
>  - Bruno
> 
> 
> 
> -Matthias
> 
> On 6/4/20 5:26 PM, Matthias J. Sax wrote:
>> Guozhang,
>>
>> what you propose makes sense, but this seems to get into implementation
>> detail territory already?
>>
>> Thus, if there are nor further change requests to the KIP wiki page
>> itself, I would like to proceed with the VOTE.
>>
>>
>> -Matthias
>>
>> On 5/20/20 12:30 PM, Guozhang Wang wrote:
>>> Thanks Matthias,
>>>
>>> I agree with you on all the bullet points above. Regarding the admin-client
>>> outer-loop retries inside partition assignor, I think we should treat error
>>> codes differently from those two blocking calls:
>>>
>>> Describe-topic:
>>> * unknown-topic (3): add this topic to the to-be-created topic list.
>>> * leader-not-available (5): do not try to create, retry in the outer loop.
>>> * request-timeout: break the current loop and retry in the outer loop.
>>> * others: fatal error.
>>>
>>> Create-topic:
>>> * topic-already-exists: retry in the outer loop to validate the
>>> num.partitions match expectation.
>>> * request-timeout: break the current loop and retry in the outer loop.
>>> * others: fatal error.
>>>
>>> And in the outer-loop, I think we can have a global timer for the whole
>>> "assign()" function, not only for the internal-topic-manager, and the timer
>>> can be hard-coded with, e.g. half of the rebalance.timeout to get rid of
>>> the `retries`; if we cannot complete the assignment before the timeout runs
>>> out, we can return just the partial assignment (e.g. if there are two
>>> tasks, but we can only get the topic metadata for one of them, then just do
>>> the assignment for that one only) while encoding in the error-code field to
>>> request for another rebalance.
>>>
>>> Guozhang
>>>
>>>
>>>
>>> On Mon, May 18, 2020 at 7:26 PM Matthias J. Sax <mj...@apache.org> wrote:
>>>
>>>> No worries Guozhang, any feedback is always very welcome! My reply is
>>>> going to be a little longer... Sorry.
>>>>
>>>>
>>>>
>>>>> 1) There are some inconsistent statements in the proposal regarding what
>>>> to
>>>>> deprecated:
>>>>
>>>> The proposal of the KIP is to deprecate `retries` for producer, admin,
>>>> and Streams. Maybe the confusion is about the dependency of those
>>>> settings within Streams and that we handle the deprecation somewhat
>>>> different for plain clients vs Streams:
>>>>
>>>> For plain producer/admin the default `retries` is set to MAX_VALUE. The
>>>> config will be deprecated but still be respected.
>>>>
>>>> For Streams, the default `retries` is set to zero, however, this default
>>>> retry does _not_ affect the embedded producer/admin clients -- both
>>>> clients stay on their own default of MAX_VALUES.
>>>>
>>>> Currently, this introduces the issue, that if a user wants to increase
>>>> Streams retries, she might by accident reduce the embedded client
>>>> retries, too. To avoid this issue, she would need to set
>>>>
>>>> retries=100
>>>> producer.retires=MAX_VALUE
>>>> admin.retries=MAX_VALUE
>>>>
>>>> This KIP will fix this issue only in the long term though, ie, when
>>>> `retries` is finally removed. Short term, using `retries` in
>>>> StreamsConfig would still affect the embedded clients, but Streams, as
>>>> well as both client would log a WARN message. This preserves backward
>>>> compatibility.
>>>>
>>>> Withing Streams `retries` is ignored and the new `task.timeout.ms` is
>>>> used instead. This increase the default resilience of Kafka Streams
>>>> itself. We could also achieve this by still respecting `retries` and to
>>>> change it's default value. However, because we deprecate `retries` it
>>>> seems better to just ignore it and switch to the new config directly.
>>>>
>>>> I updated the KIPs with some more details.
>>>>
>>>>
>>>>
>>>>> 2) We should also document the related behavior change in
>>>> PartitionAssignor
>>>>> that uses AdminClient.
>>>>
>>>> This is actually a good point. Originally, I looked into this only
>>>> briefly, but it raised an interesting question on how to handle it.
>>>>
>>>> Note that `TimeoutExceptions` are currently not handled in this retry
>>>> loop. Also note that the default retries value for other errors would be
>>>> MAX_VALUE be default (inherited from `AdminClient#retries` as mentioned
>>>> already by Guozhang).
>>>>
>>>> Applying the new `task.timeout.ms` config does not seem to be
>>>> appropriate because the AdminClient is used during a rebalance in the
>>>> leader. We could introduce a new config just for this case, but it seems
>>>> to be a little bit too much. Furthermore, the group-coordinator applies
>>>> a timeout on the leader anyway: if the assignment is not computed within
>>>> the timeout, the leader is removed from the group and another rebalance
>>>> is triggered.
>>>>
>>>> Overall, we make multiple admin client calls and thus we should keep
>>>> some retry logic and not just rely on the admin client internal retries,
>>>> as those would fall short to retry different calls interleaved. We could
>>>> just retry infinitely and rely on the group coordinator to remove the
>>>> leader eventually. However, this does not seem to be ideal because the
>>>> removed leader might be stuck forever.
>>>>
>>>> The question though is: if topic metadata cannot be obtained or internal
>>>> topics cannot be created, what should we do? We can't compute an
>>>> assignment anyway. We have already an rebalance error code to shut down
>>>> all instances for this case. Maybe we could break the retry loop before
>>>> the leader is kicked out of the group and send this error code? This way
>>>> we don't need a new config, but piggy-back on the existing timeout to
>>>> compute the assignment. To be conservative, we could use a 50% threshold?
>>>>
>>>>
>>>>
>>>>> BTW as I mentioned in the previous statement, today throwing an exception
>>>>> that kills one thread but not the whole instance is still an issue for
>>>>> monitoring purposes, but I suppose this is not going to be in this KIP
>>>> but
>>>>> addressed by another KIP, right?
>>>>
>>>> Correct. This issue if out-of-scope.
>>>>
>>>>
>>>>
>>>>> for consumer, if it gets a
>>>>> TimeoutException polling records, would start timing all tasks since that
>>>>> single consumer would affect all tasks?
>>>>
>>>> Consumer#poll() would never throw a `TimeoutException` and thus
>>>> `task.timeout.ms` does not apply.
>>>>
>>>>
>>>>
>>>>> For other blocking calls like
>>>>> `endOffsets()` etc, they are usually also issued on behalf of a batch of
>>>>> tasks, so if that gets timeout exception should we start ticking all the
>>>>> corresponding tasks as well? Maybe worth clarifying a bit more in the
>>>> wiki.
>>>>
>>>> Good point. I agree that the timer should tick for all affected tasks. I
>>>> clarified in the KIP.
>>>>
>>>>
>>>>
>>>> About KAFKA-6520:
>>>>
>>>> There is already KIP-457 and I am not sure this KIP should subsume it.
>>>> It somewhat feels orthogonal. I am also not 100% sure if KIP-572
>>>> actually helps much, because a thread could be disconnected to the
>>>> brokers without throwing any timeout exception: if all tasks are RUNNING
>>>> and just polling for new data, but no progress is made because of a
>>>> network issue, `poll()` would just return no data but not through.
>>>>
>>>>
>>>>
>>>> @Bruno
>>>>
>>>>> Wouldn't it be better to specify
>>>>> task.timeout.ms to -1 if no retry should be done
>>>>
>>>> Interesting idea. Personally I find `-1` confusing. And it seems
>>>> intuitive to me that `0` implies no retries (this seems to be in
>>>> alignment to other APIs).
>>>>
>>>>
>>>>
>>>> -Matthias
>>>>
>>>>
>>>> On 5/18/20 9:53 AM, Guozhang Wang wrote:
>>>>> Hi Matthias,
>>>>>
>>>>> Sorry for flooding the thread, but with this KIP I feel the design scope
>>>> of
>>>>> https://issues.apache.org/jira/browse/KAFKA-6520 can be simplified a lot
>>>>> and may it the design can be just piggy-backed as part of this KIP, wdyt?
>>>>>
>>>>> Guozhang
>>>>>
>>>>>
>>>>> On Mon, May 18, 2020 at 9:47 AM Guozhang Wang <wangg...@gmail.com>
>>>> wrote:
>>>>>
>>>>>> Hi Matthias,
>>>>>>
>>>>>> Just to add one more meta comment: for consumer, if it gets a
>>>>>> TimeoutException polling records, would start timing all tasks since
>>>> that
>>>>>> single consumer would affect all tasks? For other blocking calls like
>>>>>> `endOffsets()` etc, they are usually also issued on behalf of a batch of
>>>>>> tasks, so if that gets timeout exception should we start ticking all the
>>>>>> corresponding tasks as well? Maybe worth clarifying a bit more in the
>>>> wiki.
>>>>>>
>>>>>>
>>>>>> Guozhang
>>>>>>
>>>>>> On Mon, May 18, 2020 at 12:26 AM Bruno Cadonna <br...@confluent.io>
>>>> wrote:
>>>>>>
>>>>>>> Hi Matthias,
>>>>>>>
>>>>>>> I am +1 (non-binding) on the KIP.
>>>>>>>
>>>>>>> Just one final remark: Wouldn't it be better to specify
>>>>>>> task.timeout.ms to -1 if no retry should be done? IMO it would make
>>>>>>> the config more intuitive because 0 would not have two possible
>>>>>>> meanings (i.e. try once and never try) anymore.
>>>>>>>
>>>>>>> Best,
>>>>>>> Bruno
>>>>>>>
>>>>>>> On Sat, May 16, 2020 at 7:51 PM Guozhang Wang <wangg...@gmail.com>
>>>> wrote:
>>>>>>>>
>>>>>>>> Hello Matthias,
>>>>>>>>
>>>>>>>> Thanks for the updated KIP, overall I'm +1 on this proposal. Some
>>>> minor
>>>>>>>> comments (I know gmail mixed that again for me so I'm leaving it as a
>>>>>>> combo
>>>>>>>> for both DISCUSS and VOTE :)
>>>>>>>>
>>>>>>>> 1) There are some inconsistent statements in the proposal regarding
>>>>>>> what to
>>>>>>>> deprecated: at the beginning it says "We propose to deprecate the
>>>>>>> retries
>>>>>>>> configuration parameter for the producer and admin client" but later
>>>> in
>>>>>>>> compatibility we say "Producer and admin client behavior does not
>>>>>>> change;
>>>>>>>> both still respect retries config." My understanding is that we will
>>>>>>> only
>>>>>>>> deprecate the StreamsConfig#retries while not touch on
>>>>>>>> ProducerConfig/AdminClientConfig#retries, AND we will always override
>>>>>>> the
>>>>>>>> embedded producer / admin retries config to infinity so that we never
>>>>>>> rely
>>>>>>>> on those configs but always bounded by the timeout config. Is that
>>>>>>>> right, if yes could you clarify in the doc?
>>>>>>>>
>>>>>>>> 2) We should also document the related behavior change in
>>>>>>> PartitionAssignor
>>>>>>>> that uses AdminClient. More specifically, the admin client's retries
>>>>>>> config
>>>>>>>> is piggy-backed inside InternalTopicManager as an outer-loop retry
>>>>>>> logic in
>>>>>>>> addition to AdminClient's own inner retry loop. There are some
>>>> existing
>>>>>>>> issues like KAFKA-9999 / 10006 that Sophie and Boyang has been working
>>>>>>> on.
>>>>>>>> I exchanged some ideas with them, and generally we should consider if
>>>>>>> the
>>>>>>>> outer-loop of InternalTopicManager should just be removed and if we
>>>> got
>>>>>>>> TimeoutException we should just trigger another rebalance etc.
>>>>>>>>
>>>>>>>> BTW as I mentioned in the previous statement, today throwing an
>>>>>>> exception
>>>>>>>> that kills one thread but not the whole instance is still an issue for
>>>>>>>> monitoring purposes, but I suppose this is not going to be in this KIP
>>>>>>> but
>>>>>>>> addressed by another KIP, right?
>>>>>>>>
>>>>>>>>
>>>>>>>> Guozhang
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, May 15, 2020 at 1:14 PM Boyang Chen <
>>>> reluctanthero...@gmail.com
>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Good, good.
>>>>>>>>>
>>>>>>>>> Read through the discussions, the KIP looks good to me, +1
>>>>>>> (non-binding)
>>>>>>>>>
>>>>>>>>> On Fri, May 15, 2020 at 11:51 AM Sophie Blee-Goldman <
>>>>>>> sop...@confluent.io>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Called out!
>>>>>>>>>>
>>>>>>>>>> Seems like gmail struggles with [...] prefixed subjects. You'd
>>>>>>> think they
>>>>>>>>>> would adapt
>>>>>>>>>> all their practices to conform to the Apache Kafka mailing list
>>>>>>>>> standards,
>>>>>>>>>> but no!
>>>>>>>>>>
>>>>>>>>>> +1 (non-binding) by the way
>>>>>>>>>>
>>>>>>>>>> On Fri, May 15, 2020 at 11:46 AM John Roesler <vvcep...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Boyang,
>>>>>>>>>>>
>>>>>>>>>>> It is a separate thread, and you have just revealed yourself as a
>>>>>>> gmail
>>>>>>>>>>> user ;)
>>>>>>>>>>>
>>>>>>>>>>> (Gmail sometimes conflates vote and discuss threads for no
>>>>>>> apparent
>>>>>>>>>> reason
>>>>>>>>>>> )
>>>>>>>>>>>
>>>>>>>>>>> -John
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 15, 2020, at 13:39, Boyang Chen wrote:
>>>>>>>>>>>> Hey Matthias, should this be on a separate VOTE thread?
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 15, 2020 at 11:38 AM John Roesler <
>>>>>>> vvcep...@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks, Matthias!
>>>>>>>>>>>>>
>>>>>>>>>>>>> I’m +1 (binding)
>>>>>>>>>>>>>
>>>>>>>>>>>>> -John
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, May 15, 2020, at 11:55, Matthias J. Sax wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would like to start the vote on KIP-572:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-572%3A+Improve+timeouts+and+retries+in+Kafka+Streams
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Attachments:
>>>>>>>>>>>>>> * signature.asc
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> -- Guozhang
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> -- Guozhang
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

signature.asc
Description: OpenPGP digital signature

Re: [VOTE] KIP-572: Improve timeouts and retires in Kafka Streams

Reply via email to