Justine,

I am new here so please excuse the ignorance.

When you talk about "seen" producers I assume you mean the PIDs that the
Bloom filter has seen.
When you say "producer produces every 2 hours" are you the producer writes
to a topic every 2 hours and uses the same PID?
When you say "hitting the limit" what limit is reached?

Given the default setup, A producer that produces a PID every 2 hours,
regardless of whether or not it is a new PID, will be reported as a new PID
being seen.  But I would expect the throttling system to accept that as a
new PID for the producer and look at the frequency of PIDs and accept
without throttling.

If the actual question is "how many PIDs did this Principal produce in the
last hour"  Or "Has this Principal produced more than X PIDs in the last
hour", there are probably cleaner ways to do this.  If this is the
question, I would use CPC from Apache Data Sketches [1] and keep multiple
CPC (say every 15 minutes -- to match the KIP-936 proposal) for each
Principal.  You could then do a quick check on the current CPC to see if it
exceeds hour-limit / 4 and if so check the hour rate (by summing the 4
15-minute CPCs).  Then the code could simply notify when to throttle and
when to stop throttling.

Claude


https://datasketches.apache.org/docs/CPC/CpcPerformance.html

On Fri, May 3, 2024 at 4:21 PM Justine Olshan <jols...@confluent.io.invalid>
wrote:

> Hey folks,
>
> I shared this with Omnia offline:
> One concern I have is with the length of time we keep "seen" producer IDs.
> It seems like the default is 1 hour. If a producer produces every 2 hours
> or so, and we are hitting the limit, it seems like we will throttle it even
> though we've seen it before and have state for it on the server. Then, it
> seems like we will have to wait for the natural expiration of producer ids
> (via producer.id.expiration.ms) before we allow new or idle producers to
> join again without throttling. I think this proposal is a step in the right
> direction when it comes to throttling the "right" clients, but I want to
> make sure we have reasonable defaults. Keep in mind that idempotent
> producers are the default, so most folks won't be tuning these values out
> of the box.
>
> As for Igor's questions about InitProducerId -- I think the main reason we
> have avoided that solution is that there is no state stored for idempotent
> producers when grabbing an ID. My concern there is either storing too much
> state to track this or throttling before we need to.
>
> Justine
>
> On Thu, May 2, 2024 at 2:36 PM Claude Warren, Jr
> <claude.war...@aiven.io.invalid> wrote:
>
> > There is some question about whether or not we need the configuration
> > options.  My take on them is as follows:
> >
> > producer.id.quota.window.num  No opinion.  I don't know what this is used
> > for, but I suspect that there is a good reason to have it.  It is not
> used
> > within the Bloom filter caching mechanism
> > producer.id.quota.window.size.seconds Leave it as it is one of the most
> > effective ways to tune the filter and determines how long a PID is
> > recognized.
> > producer.id.quota.cache.cleanup.scheduler.interval.ms  Remove it unless
> > there is another use for it.   We can get a better calculation for
> > internals.
> > producer.id.quota.cache.layer.count Leave it as it is one of the most
> > effective ways to tune the filter.
> > producer.id.quota.cache.false.positive.rate Replace it with a constant,
> I
> > don't think any other Bloom filter solution provides access to this knob
> > for end users.
> > producer_ids_rate Leave this one, it is critical for reasonable
> operation.
> >
>


-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

Reply via email to