Re: [DISCUSS] KIP-526: Reduce Producer Metadata Lookups for Large Number of Topics

Brian Byrne Mon, 09 Dec 2019 14:08:32 -0800

Hi Guozhang,

I see, we agree on the topic threshold not applying to urgent topics, but
differ slightly on what should be considered urgent. I would argue that we
should consider topics nearing the metadata.max.age.ms to be urgent since
they may still be well within the metadata.expiry.ms. That is, the client
still considers these topics to be relevant (not expired), but doesn't want
to incur the latency bubble of having to wait for the metadata to be
re-fetched if it's stale. This could be a frequent case if the
metadata.max.age.ms << metadata.expiry.ms.


In practice, I wouldn't expect this to make a noticeable difference so I
don't have a strong leaning, but the current behavior today is to
aggressively refresh the metadata of stale topics by ensuring a refresh is
triggered before that metadata.max.age.ms duration elapses.

Thanks,
Brian


On Mon, Dec 9, 2019 at 11:57 AM Guozhang Wang <wangg...@gmail.com> wrote:

> Hello Brian,
>
> Thanks for your explanation, could you then update the wiki page for the
> algorithm part since when I read it, I thought it was different from the
> above, e.g. urgent topics should not be added just because of max.age
> expiration, but should only be added if there are sending data pending.
>
>
> Guozhang
>
> On Mon, Dec 9, 2019 at 10:57 AM Brian Byrne <bby...@confluent.io> wrote:
>
> > Hi Guozhang,
> >
> > Thanks for the feedback!
> >
> > On Sun, Dec 8, 2019 at 6:25 PM Guozhang Wang <wangg...@gmail.com> wrote:
> >
> > > 1. The addition of *metadata.expiry.ms <http://metadata.expiry.ms>
> > *should
> > > be included in the public interface. Also its semantics needs more
> > > clarification (since previously it is hard-coded internally we do not
> > need
> > > to explain it publicly, but now with the configurable value we do
> need).
> > >
> >
> > This was an oversight. Done.
> >
> >
> > > 2. There are a couple of hard-coded parameters like 25 and 0.5 in the
> > > proposal, maybe we need to explain why these magic values makes sense
> in
> > > common scenarios.
> > >
> >
> > So these are pretty fuzzy numbers, and seemed to be a decent balance
> > between trade-offs. I've updated the target size to account for setups
> with
> > a large number of topics or a shorter refresh time, as well as added some
> > light rationale.
> >
> >
> > > 3. In the Urgent set condition, do you actually mean "with no cached
> > > metadata AND there are existing data buffered for the topic"?
> > >
> >
> > Yes, fixed.
> >
> >
> >
> > > One concern I have is whether or not we may introduce a regression,
> > > especially during producer startup such that since we only require up
> to
> > 25
> > > topics each request, it may cause the send data to be buffered more
> time
> > > than now due to metadata not available. I understand this is a
> > acknowledged
> > > trade-off in our design but any regression that may surface to users
> need
> > > to be very carefully considered. I'm wondering, e.g. if we can tweak
> our
> > > algorithm for the Urgent set, e.g. to consider those with non cached
> > > metadata have higher priority than those who have elapsed max.age but
> not
> > > yet have been called for sending. More specifically:
> > >
> > > Urgent: topics that have been requested for sending but no cached
> > metadata,
> > > and topics that have send request failed with e.g. NOT_LEADER.
> > > Non-urgent: topics that are not in Urgent but have expired max.age.
> > >
> > > Then when sending metadata, we always send ALL in the urgent (i.e.
> ignore
> > > the size limit), and only when they do not exceed the size limit,
> > consider
> > > fill in more topics from Non-urgent up to the size limit.
> > >
> >
> > I think we're on the same page here. Urgent topics ignore the target
> > metadata RPC size, and are not bound by it in any way, i.e. if there's
> 100
> > urgent topics, we'll fetch all 100 in a single RPC. Like you say,
> however,
> > if a topic transitions to urgent and there's several non-urgent ones,
> we'll
> > piggyback the non-urgent updates up to the target size.
> >
> > Thanks,
> >  Brian
> >
> >
> > On Wed, Nov 20, 2019 at 7:00 PM deng ziming <dengziming1...@gmail.com>
> > > wrote:
> > >
> > > > I think it's ok, and you can add another issue about `asynchronous
> > > > metadata` if `topic expiry` is not enough.
> > > >
> > > >
> > > > On Thu, Nov 21, 2019 at 6:20 AM Brian Byrne <bby...@confluent.io>
> > wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > I've refactored the KIP to remove implementing asynchronous
> metadata
> > > > > fetching in the producer during send(). It's now exclusively
> focused
> > on
> > > > > reducing the topic metadata fetch payload and proposes adding a new
> > > > > configuration flag to control topic expiry behavior. Please take a
> > look
> > > > > when possible.
> > > > >
> > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-526
> > > > > %3A+Reduce+Producer+Metadata+Lookups+for+Large+Number+of+Topics
> > > > >
> > > > > Thanks,
> > > > > Brian
> > > > >
> > > > > On Fri, Oct 4, 2019 at 10:04 AM Brian Byrne <bby...@confluent.io>
> > > wrote:
> > > > >
> > > > > > Lucas, Guozhang,
> > > > > >
> > > > > > Thank you for the comments. Good point on
> METADATA_MAX_AGE_CONFIG -
> > > it
> > > > > > looks like the ProducerMetadata was differentiating between
> expiry
> > > and
> > > > > > refresh, but it should be unnecessary to do so once the cost of
> > > > fetching
> > > > > a
> > > > > > single topic's metadata is reduced.
> > > > > >
> > > > > > I've updated the rejected alternatives and removed the config
> > > > variables.
> > > > > >
> > > > > > Brian
> > > > > >
> > > > > > On Fri, Oct 4, 2019 at 9:20 AM Guozhang Wang <wangg...@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > >> Hello Brian,
> > > > > >>
> > > > > >> Thanks for the KIP.
> > > > > >>
> > > > > >> I think using asynchronous metadata update to address 1)
> metadata
> > > > update
> > > > > >> blocking send, but for other issues, currently at producer we do
> > > have
> > > > a
> > > > > >> configurable `METADATA_MAX_AGE_CONFIG` similar to consumer, by
> > > default
> > > > > is
> > > > > >> 5min. So maybe we do not need to introduce new configs here, but
> > > only
> > > > > >> change the semantics of that config from global expiry (today we
> > > just
> > > > > >> enforce a full metadata update for the whole cluster) to
> > > single-topic
> > > > > >> expiry, and we can also extend its expiry deadline whenever that
> > > > > metadata
> > > > > >> is successfully used to send a produce request.
> > > > > >>
> > > > > >>
> > > > > >> Guozhang
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On Thu, Oct 3, 2019 at 6:51 PM Lucas Bradstreet <
> > lu...@confluent.io
> > > >
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Hi Brian,
> > > > > >> >
> > > > > >> > This looks great, and should help reduce blocking and high
> > > metadata
> > > > > >> request
> > > > > >> > volumes when the producer is sending to large numbers of
> topics,
> > > > > >> especially
> > > > > >> > at low volumes. I think the approach to make metadata fetching
> > > > > >> asynchronous
> > > > > >> > and batch metadata requests together will help significantly.
> > > > > >> >
> > > > > >> > The only other approach I can think of is to allow users to
> > supply
> > > > the
> > > > > >> > producer with the expected topics upfront, allowing the
> producer
> > > to
> > > > > >> perform
> > > > > >> > a single initial metadata request before any sends occur. I
> see
> > no
> > > > > real
> > > > > >> > advantages to this approach compared to the async method
> you’ve
> > > > > >> proposed,
> > > > > >> > but maybe we could add it to the rejected alternatives
> section?
> > > > > >> >
> > > > > >> > Thanks,
> > > > > >> >
> > > > > >> > Lucas
> > > > > >> >
> > > > > >> > On Fri, 20 Sep 2019 at 11:46, Brian Byrne <
> bby...@confluent.io>
> > > > > wrote:
> > > > > >> >
> > > > > >> > > I've updated the 'Proposed Changes' to include two new
> > producer
> > > > > >> > > configuration variables: topic.expiry.ms and
> topic.refresh.ms
> > .
> > > > > Please
> > > > > >> > take
> > > > > >> > > a look.
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Brian
> > > > > >> > >
> > > > > >> > > On Tue, Sep 17, 2019 at 12:59 PM Brian Byrne <
> > > bby...@confluent.io
> > > > >
> > > > > >> > wrote:
> > > > > >> > >
> > > > > >> > > > Dev team,
> > > > > >> > > >
> > > > > >> > > > Requesting discussion for improvement to the producer when
> > > > dealing
> > > > > >> > with a
> > > > > >> > > > large number of topics.
> > > > > >> > > >
> > > > > >> > > > KIP:
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-526%3A+Reduce+Producer+Metadata+Lookups+for+Large+Number+of+Topics
> > > > > >> > > >
> > > > > >> > > > JIRA: https://issues.apache.org/jira/browse/KAFKA-8904
> > > > > >> > > >
> > > > > >> > > > Thoughts and feedback would be appreciated.
> > > > > >> > > >
> > > > > >> > > > Thanks,
> > > > > >> > > > Brian
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> -- Guozhang
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-526: Reduce Producer Metadata Lookups for Large Number of Topics

Reply via email to