Re: [DISCUSS] KIP-526: Reduce Producer Metadata Lookups for Large Number of Topics

Stanislav Kozlovski Thu, 26 Dec 2019 02:17:46 -0800

Hey Brian,

1. Could we more explicitly clarify the behavior of the algorithm when `|T|
> TARGET_METADATA_FETCH SIZE` ? I assume we ignore the config in that
scenario
2. Should `targetMetadataFetchSize = Math.max(topicsPerSec / 10, 20)` be
`topicsPerSec * 10` ?
3. When is this new algorithm applied? To confirm my understanding - what
is the behavior of `metadata.max.age.ms` after this KIP? Are we adding a
new, more proactive metadata fetch for topics in |U|?


Thanks,
Stanislav

On Thu, Dec 19, 2019 at 11:37 PM Brian Byrne <bby...@confluent.io> wrote:

> Hello everyone,
>
> For all interested, please take a look at the proposed algorithm as I'd
> like to get more feedback. I'll call for a vote once the break is over.
>
> Thanks,
> Brian
>
> On Mon, Dec 9, 2019 at 10:18 PM Guozhang Wang <wangg...@gmail.com> wrote:
>
> > Sounds good, I agree that should not make a big difference in practice.
> >
> > On Mon, Dec 9, 2019 at 2:07 PM Brian Byrne <bby...@confluent.io> wrote:
> >
> > > Hi Guozhang,
> > >
> > > I see, we agree on the topic threshold not applying to urgent topics,
> but
> > > differ slightly on what should be considered urgent. I would argue that
> > we
> > > should consider topics nearing the metadata.max.age.ms to be urgent
> > since
> > > they may still be well within the metadata.expiry.ms. That is, the
> > client
> > > still considers these topics to be relevant (not expired), but doesn't
> > want
> > > to incur the latency bubble of having to wait for the metadata to be
> > > re-fetched if it's stale. This could be a frequent case if the
> > > metadata.max.age.ms << metadata.expiry.ms.
> > >
> > > In practice, I wouldn't expect this to make a noticeable difference so
> I
> > > don't have a strong leaning, but the current behavior today is to
> > > aggressively refresh the metadata of stale topics by ensuring a refresh
> > is
> > > triggered before that metadata.max.age.ms duration elapses.
> > >
> > > Thanks,
> > > Brian
> > >
> > >
> > > On Mon, Dec 9, 2019 at 11:57 AM Guozhang Wang <wangg...@gmail.com>
> > wrote:
> > >
> > > > Hello Brian,
> > > >
> > > > Thanks for your explanation, could you then update the wiki page for
> > the
> > > > algorithm part since when I read it, I thought it was different from
> > the
> > > > above, e.g. urgent topics should not be added just because of max.age
> > > > expiration, but should only be added if there are sending data
> pending.
> > > >
> > > >
> > > > Guozhang
> > > >
> > > > On Mon, Dec 9, 2019 at 10:57 AM Brian Byrne <bby...@confluent.io>
> > wrote:
> > > >
> > > > > Hi Guozhang,
> > > > >
> > > > > Thanks for the feedback!
> > > > >
> > > > > On Sun, Dec 8, 2019 at 6:25 PM Guozhang Wang <wangg...@gmail.com>
> > > wrote:
> > > > >
> > > > > > 1. The addition of *metadata.expiry.ms <
> http://metadata.expiry.ms>
> > > > > *should
> > > > > > be included in the public interface. Also its semantics needs
> more
> > > > > > clarification (since previously it is hard-coded internally we do
> > not
> > > > > need
> > > > > > to explain it publicly, but now with the configurable value we do
> > > > need).
> > > > > >
> > > > >
> > > > > This was an oversight. Done.
> > > > >
> > > > >
> > > > > > 2. There are a couple of hard-coded parameters like 25 and 0.5 in
> > the
> > > > > > proposal, maybe we need to explain why these magic values makes
> > sense
> > > > in
> > > > > > common scenarios.
> > > > > >
> > > > >
> > > > > So these are pretty fuzzy numbers, and seemed to be a decent
> balance
> > > > > between trade-offs. I've updated the target size to account for
> > setups
> > > > with
> > > > > a large number of topics or a shorter refresh time, as well as
> added
> > > some
> > > > > light rationale.
> > > > >
> > > > >
> > > > > > 3. In the Urgent set condition, do you actually mean "with no
> > cached
> > > > > > metadata AND there are existing data buffered for the topic"?
> > > > > >
> > > > >
> > > > > Yes, fixed.
> > > > >
> > > > >
> > > > >
> > > > > > One concern I have is whether or not we may introduce a
> regression,
> > > > > > especially during producer startup such that since we only
> require
> > up
> > > > to
> > > > > 25
> > > > > > topics each request, it may cause the send data to be buffered
> more
> > > > time
> > > > > > than now due to metadata not available. I understand this is a
> > > > > acknowledged
> > > > > > trade-off in our design but any regression that may surface to
> > users
> > > > need
> > > > > > to be very carefully considered. I'm wondering, e.g. if we can
> > tweak
> > > > our
> > > > > > algorithm for the Urgent set, e.g. to consider those with non
> > cached
> > > > > > metadata have higher priority than those who have elapsed max.age
> > but
> > > > not
> > > > > > yet have been called for sending. More specifically:
> > > > > >
> > > > > > Urgent: topics that have been requested for sending but no cached
> > > > > metadata,
> > > > > > and topics that have send request failed with e.g. NOT_LEADER.
> > > > > > Non-urgent: topics that are not in Urgent but have expired
> max.age.
> > > > > >
> > > > > > Then when sending metadata, we always send ALL in the urgent
> (i.e.
> > > > ignore
> > > > > > the size limit), and only when they do not exceed the size limit,
> > > > > consider
> > > > > > fill in more topics from Non-urgent up to the size limit.
> > > > > >
> > > > >
> > > > > I think we're on the same page here. Urgent topics ignore the
> target
> > > > > metadata RPC size, and are not bound by it in any way, i.e. if
> > there's
> > > > 100
> > > > > urgent topics, we'll fetch all 100 in a single RPC. Like you say,
> > > > however,
> > > > > if a topic transitions to urgent and there's several non-urgent
> ones,
> > > > we'll
> > > > > piggyback the non-urgent updates up to the target size.
> > > > >
> > > > > Thanks,
> > > > >  Brian
> > > > >
> > > > >
> > > > > On Wed, Nov 20, 2019 at 7:00 PM deng ziming <
> > dengziming1...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > I think it's ok, and you can add another issue about
> > `asynchronous
> > > > > > > metadata` if `topic expiry` is not enough.
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Nov 21, 2019 at 6:20 AM Brian Byrne <
> bby...@confluent.io
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hello all,
> > > > > > > >
> > > > > > > > I've refactored the KIP to remove implementing asynchronous
> > > > metadata
> > > > > > > > fetching in the producer during send(). It's now exclusively
> > > > focused
> > > > > on
> > > > > > > > reducing the topic metadata fetch payload and proposes
> adding a
> > > new
> > > > > > > > configuration flag to control topic expiry behavior. Please
> > take
> > > a
> > > > > look
> > > > > > > > when possible.
> > > > > > > >
> > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-526
> > > > > > > >
> %3A+Reduce+Producer+Metadata+Lookups+for+Large+Number+of+Topics
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Brian
> > > > > > > >
> > > > > > > > On Fri, Oct 4, 2019 at 10:04 AM Brian Byrne <
> > bby...@confluent.io
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Lucas, Guozhang,
> > > > > > > > >
> > > > > > > > > Thank you for the comments. Good point on
> > > > METADATA_MAX_AGE_CONFIG -
> > > > > > it
> > > > > > > > > looks like the ProducerMetadata was differentiating between
> > > > expiry
> > > > > > and
> > > > > > > > > refresh, but it should be unnecessary to do so once the
> cost
> > of
> > > > > > > fetching
> > > > > > > > a
> > > > > > > > > single topic's metadata is reduced.
> > > > > > > > >
> > > > > > > > > I've updated the rejected alternatives and removed the
> config
> > > > > > > variables.
> > > > > > > > >
> > > > > > > > > Brian
> > > > > > > > >
> > > > > > > > > On Fri, Oct 4, 2019 at 9:20 AM Guozhang Wang <
> > > wangg...@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Hello Brian,
> > > > > > > > >>
> > > > > > > > >> Thanks for the KIP.
> > > > > > > > >>
> > > > > > > > >> I think using asynchronous metadata update to address 1)
> > > > metadata
> > > > > > > update
> > > > > > > > >> blocking send, but for other issues, currently at producer
> > we
> > > do
> > > > > > have
> > > > > > > a
> > > > > > > > >> configurable `METADATA_MAX_AGE_CONFIG` similar to
> consumer,
> > by
> > > > > > default
> > > > > > > > is
> > > > > > > > >> 5min. So maybe we do not need to introduce new configs
> here,
> > > but
> > > > > > only
> > > > > > > > >> change the semantics of that config from global expiry
> > (today
> > > we
> > > > > > just
> > > > > > > > >> enforce a full metadata update for the whole cluster) to
> > > > > > single-topic
> > > > > > > > >> expiry, and we can also extend its expiry deadline
> whenever
> > > that
> > > > > > > > metadata
> > > > > > > > >> is successfully used to send a produce request.
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> Guozhang
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> On Thu, Oct 3, 2019 at 6:51 PM Lucas Bradstreet <
> > > > > lu...@confluent.io
> > > > > > >
> > > > > > > > >> wrote:
> > > > > > > > >>
> > > > > > > > >> > Hi Brian,
> > > > > > > > >> >
> > > > > > > > >> > This looks great, and should help reduce blocking and
> high
> > > > > > metadata
> > > > > > > > >> request
> > > > > > > > >> > volumes when the producer is sending to large numbers of
> > > > topics,
> > > > > > > > >> especially
> > > > > > > > >> > at low volumes. I think the approach to make metadata
> > > fetching
> > > > > > > > >> asynchronous
> > > > > > > > >> > and batch metadata requests together will help
> > > significantly.
> > > > > > > > >> >
> > > > > > > > >> > The only other approach I can think of is to allow users
> > to
> > > > > supply
> > > > > > > the
> > > > > > > > >> > producer with the expected topics upfront, allowing the
> > > > producer
> > > > > > to
> > > > > > > > >> perform
> > > > > > > > >> > a single initial metadata request before any sends
> occur.
> > I
> > > > see
> > > > > no
> > > > > > > > real
> > > > > > > > >> > advantages to this approach compared to the async method
> > > > you’ve
> > > > > > > > >> proposed,
> > > > > > > > >> > but maybe we could add it to the rejected alternatives
> > > > section?
> > > > > > > > >> >
> > > > > > > > >> > Thanks,
> > > > > > > > >> >
> > > > > > > > >> > Lucas
> > > > > > > > >> >
> > > > > > > > >> > On Fri, 20 Sep 2019 at 11:46, Brian Byrne <
> > > > bby...@confluent.io>
> > > > > > > > wrote:
> > > > > > > > >> >
> > > > > > > > >> > > I've updated the 'Proposed Changes' to include two new
> > > > > producer
> > > > > > > > >> > > configuration variables: topic.expiry.ms and
> > > > topic.refresh.ms
> > > > > .
> > > > > > > > Please
> > > > > > > > >> > take
> > > > > > > > >> > > a look.
> > > > > > > > >> > >
> > > > > > > > >> > > Thanks,
> > > > > > > > >> > > Brian
> > > > > > > > >> > >
> > > > > > > > >> > > On Tue, Sep 17, 2019 at 12:59 PM Brian Byrne <
> > > > > > bby...@confluent.io
> > > > > > > >
> > > > > > > > >> > wrote:
> > > > > > > > >> > >
> > > > > > > > >> > > > Dev team,
> > > > > > > > >> > > >
> > > > > > > > >> > > > Requesting discussion for improvement to the
> producer
> > > when
> > > > > > > dealing
> > > > > > > > >> > with a
> > > > > > > > >> > > > large number of topics.
> > > > > > > > >> > > >
> > > > > > > > >> > > > KIP:
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-526%3A+Reduce+Producer+Metadata+Lookups+for+Large+Number+of+Topics
> > > > > > > > >> > > >
> > > > > > > > >> > > > JIRA:
> > https://issues.apache.org/jira/browse/KAFKA-8904
> > > > > > > > >> > > >
> > > > > > > > >> > > > Thoughts and feedback would be appreciated.
> > > > > > > > >> > > >
> > > > > > > > >> > > > Thanks,
> > > > > > > > >> > > > Brian
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> --
> > > > > > > > >> -- Guozhang
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > -- Guozhang
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>


-- 
Best,
Stanislav

Re: [DISCUSS] KIP-526: Reduce Producer Metadata Lookups for Large Number of Topics

Reply via email to