Re: [VOTE] KIP-526: Reduce Producer Metadata Lookups for Large Number of Topics

Colin McCabe Mon, 11 Nov 2019 11:47:39 -0800

Hi Brian,

Thanks for the KIP.

Starting the metadata fetch before we need the result is definitely a great 
idea.  This way, the metadata fetch can be done in parallel with all the other 
stuff e producer is doing, rather than forcing the producer to periodically 
come to a halt periodically while metadata is fetched.

Maybe I missed it, but there seemed to be some details missing here.  When do 
we start the metadata fetch?  For example, if topic metadata expires every 5 
minutes, perhaps we should wait 4 minutes, then starting fetching the new 
metadata, which we would expect to arrive by the 5 minute deadline.  Or perhaps 
we should start the fetch even earlier, around the 2.5 minute mark.  In any 
case, there should be some discussion about what the actual policy is.  Given 
that metadata.max.age.ms is configurable, maybe that policy  needs to be 
expressed in terms of a percentage of the refresh period rather than in terms 
of an absolute delay.

The KIP correctly points out that the current metadata fetching policy causes 
us to "[block] in a function that's advertised as asynchronous."  However, the 
KIP doesn't seem to spell out whether we will continue to block if metadata 
can't be found, or if this will be abolished.  Clearly, starting the metadata 
fetch early will reduce blocking in the common case, but will there still be 
blocking in the uncommon case where the early fetch doesn't succeed in time?

 > To address (2), the producer currently maintains an expiry threshold for 
 > every topic, which is used to remove a topic from the working set at a 
 > future time (currently hard-coded to 5 minutes, this should be modified to 
 > use metadata.max.age.ms). While this does work to reduce the size of the 
 > topic working set, the producer will continue fetching metadata for these 
 > topics in every metadata request for the full expiry duration. This logic 
 > can be made more intelligent by managing the expiry from when the topic 
 > was last used, enabling the expiry duration to be reduced to improve cases 
 > where a large number of topics are touched intermittently.

Can you clarify this part a bit?  It seems like we have a metadata expiration 
policy now for topics, and we will have one after this KIP, but they will be 
somewhat different.  But it's not clear to me what the differences are.

In general, if load is a problem, we should probably consider adding some kind 
of jitter on the client side.  There are definitely cases where people start up 
a lot of clients at the same time in parallel and there is a thundering herd 
problem with metadata updates.  Adding jitter would spread the load across time 
rather than creating a spike every 5 minutes in this case.

best,
Colin

On Fri, Nov 8, 2019, at 08:59, Ismael Juma wrote:
> I think this KIP affects when we block which is actually user visible
> behavior. Right?
> 
> Ismael
> 
> On Fri, Nov 8, 2019, 8:50 AM Brian Byrne <bby...@confluent.io> wrote:
> 
> > Hi Guozhang,
> >
> > Regarding metadata expiry, no access times other than the initial lookup[1]
> > are used for determining when to expire producer metadata. Therefore,
> > frequently used topics' metadata will be aged out and subsequently
> > refreshed (in a blocking manner) every five minutes, and infrequently used
> > topics will be retained for a minimum of five minutes and currently
> > refetched on every metadata update during that time period. The sentence is
> > suggesting that we could reduce the expiry time to improve the latter
> > without affecting (rather slightly improving) the former.
> >
> > Keep in mind that in most all cases, I wouldn't anticipate much of a
> > difference with producer behavior, and the extra logic can be implemented
> > to have insignificant cost. It's the large/dynamic topic corner cases that
> > we're trying to improve.
> >
> > It'd be convenient if the KIP is no longer necessary. You're right in that
> > there's no public API changes and the behavioral changes are entirely
> > internal. I'd be happy to continue the discussion around the KIP, but
> > unless otherwise objected, it can be retired.
> >
> > [1] Not entirely accurate, it's actually the first time when the client
> > calculates whether to retain the topic in its metadata.
> >
> > Thanks,
> > Brian
> >
> > On Thu, Nov 7, 2019 at 4:48 PM Guozhang Wang <wangg...@gmail.com> wrote:
> >
> > > Hello Brian,
> > >
> > > Could you elaborate a bit more on this sentence: "This logic can be made
> > > more intelligent by managing the expiry from when the topic was last
> > used,
> > > enabling the expiry duration to be reduced to improve cases where a large
> > > number of topics are touched intermittently." Not sure I fully understand
> > > the proposal.
> > >
> > > Also since now this KIP did not make any public API changes and the
> > > behavioral changes are not considered a public API contract (i.e. how we
> > > maintain the topic metadata in producer cache is never committed to
> > users),
> > > I wonder if we still need a KIP for the proposed change any more?
> > >
> > >
> > > Guozhang
> > >
> > > On Thu, Nov 7, 2019 at 12:43 PM Brian Byrne <bby...@confluent.io> wrote:
> > >
> > > > Hello all,
> > > >
> > > > I'd like to propose a vote for a producer change to improve producer
> > > > behavior when dealing with a large number of topics, in part by
> > reducing
> > > > the amount of metadata fetching performed.
> > > >
> > > > The full KIP is provided here:
> > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-526%3A+Reduce+Producer+Metadata+Lookups+for+Large+Number+of+Topics
> > > >
> > > > And the discussion thread:
> > > >
> > > >
> > >
> > https://lists.apache.org/thread.html/b2f8f830ef04587144cf0840c7d4811bbf0a14f3c459723dbc5acf9e@%3Cdev.kafka.apache.org%3E
> > > >
> > > > Thanks,
> > > > Brian
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>

Re: [VOTE] KIP-526: Reduce Producer Metadata Lookups for Large Number of Topics

Reply via email to