Re: KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

Tom Bentley Wed, 03 Jun 2020 09:41:23 -0700

Hi David,

One quick question about the implementation (I don't think it's spelled out
in the KIP): Presumably if you make, for example, a create topics request
with validate only it will check for quota violation, but not count towards
quota violation, right?


Many thanks,

Tom

On Wed, Jun 3, 2020 at 3:51 PM Rajini Sivaram <rajinisiva...@gmail.com>
wrote:

> Hi David,
>
> 2) sorry, that was my mistake.
>
> Regards,
>
> Rajini
>
>
> On Wed, Jun 3, 2020 at 3:08 PM David Jacot <dja...@confluent.io> wrote:
>
> > Hi Rajini,
> >
> > Thanks for your prompt response.
> > 1) Good catch, fixed.
> > 2) The retry mechanism will be in the client so a new field is not
> > required in the requests.
> >
> > Regards,
> > David
> >
> > On Wed, Jun 3, 2020 at 2:43 PM Rajini Sivaram <rajinisiva...@gmail.com>
> > wrote:
> >
> > > Hi David,
> > >
> > > Thanks for the updates, looks good. Just a couple of minor comments:
> > > 1) There is a typo in "*The channel will be mutated as well when
> > > `throttle_time_ms > 0`." * Should be *muted*?
> > > 2) Since the three requests will need a new field for `
> > > *retryQuotaViolatedException*`, we should perhaps add that change to
> the
> > > KIP.
> > >
> > > Regards,
> > >
> > > Rajini
> > >
> > > On Wed, Jun 3, 2020 at 1:02 PM David Jacot <dja...@confluent.io>
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I have updated the KIP based on our recent discussions. I have mainly
> > > > changed the
> > > > following points:
> > > > * I have renamed the quota as suggested by Jun.
> > > > * I have changed the metrics to be "token bucket" agnostic. The idea
> is
> > > to
> > > > report the
> > > > burst and the rate per principal/clientid.
> > > > * I have removed the `retry.quota.violation.exception` configuration
> > and
> > > > replaced it
> > > > with options in the respective methods' options.
> > > >
> > > > Now, the public interfaces are not tight to the algorithm that we use
> > > > internally to throttle
> > > > the requests but keep the notion of burst. I hope that this will help
> > to
> > > > alleviate the
> > > > tokens bucket vs rate discussions.
> > > >
> > > > Please, have a look and let me know your thoughts.
> > > >
> > > > Bests,
> > > > David
> > > >
> > > >
> > > > On Wed, Jun 3, 2020 at 10:17 AM David Jacot <dja...@confluent.io>
> > wrote:
> > > >
> > > > > Hi Rajini,
> > > > >
> > > > > Thanks for your feedback. Please find my answers below:
> > > > >
> > > > > 1) Our main goal is to protect the controller from the extreme
> users
> > > > > (DDoS). We want
> > > > > to protect it from large requests or repetitive requests coming
> from
> > a
> > > > > single user.
> > > > > That user could be used by multiple apps as you pointed out which
> > makes
> > > > it
> > > > > even
> > > > > worst. For instance, a buggy application could continuously create
> > and
> > > > > delete
> > > > > topics in a tight loop.
> > > > >
> > > > > The idea of the burst is to still allow creating or deleting topics
> > in
> > > > > batch because
> > > > > this is how applications tend to do it. However, we want to keep
> the
> > > > batch
> > > > > size
> > > > > under control with the burst. The burst does not allow requests of
> > any
> > > > > size. Topics
> > > > > are accepted until the burst is passed. All the others are
> rejected.
> > > > > Ideally, regular
> > > > > and well behaving applications should never or rarely be throttled.
> > > > >
> > > > > 2) No, there is no explicit bound on the maximum throttle time.
> > Having
> > > a
> > > > > maximum
> > > > > is straightforward here because the throttling time depends on the
> > > actual
> > > > > size of
> > > > > the request.
> > > > >
> > > > > 3) That's a very good question that I haven't thought of. I was
> > > thinking
> > > > > about doing
> > > > > it for the new quota only. I think that your suggestion of having a
> > per
> > > > > method argument
> > > > > makes sense. I will update the KIP.
> > > > >
> > > > > 4) Indeed, it is outdated text. Let me update that.
> > > > >
> > > > > Regards,
> > > > > David
> > > > >
> > > > > On Wed, Jun 3, 2020 at 12:01 AM Rajini Sivaram <
> > > rajinisiva...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Hi David,
> > > > >>
> > > > >> Thanks for the KIP. A few questions below:
> > > > >>
> > > > >> 1) The KIP says: *`Typically, applications tend to send one
> request
> > to
> > > > >> create all the topics that they need`*. What would the point of
> > > > throttling
> > > > >> be in this case? If there was a user quota for the principal used
> by
> > > > that
> > > > >> application, wouldn't we just allow the request due to the burst
> > > value?
> > > > Is
> > > > >> the KIP specifically for the case where multiple apps with the
> same
> > > user
> > > > >> principal are started all at once?
> > > > >>
> > > > >> 2) Will there be a bound on the maximum throttle time?
> > > > >>
> > > > >> 3) Will *retry.quota.violation.exception* have any impact on
> request
> > > > quota
> > > > >> throttling or is it limited to the three requests where the new
> > quota
> > > is
> > > > >> applied? If it is only for the new quota, perhaps it would be
> better
> > > to
> > > > >> add
> > > > >> an option to the relevant requests rather than use an admin client
> > > > config.
> > > > >>
> > > > >> 4) Is this outdated text, wasn't sure what this refers to : *While
> > the
> > > > >> configuration could be set by broker, we envision to have it set
> > only
> > > > has
> > > > >> a
> > > > >> cluster wide default for two reasons.*
> > > > >>
> > > > >> Regards,
> > > > >>
> > > > >> Rajini
> > > > >>
> > > > >>
> > > > >> On Tue, Jun 2, 2020 at 2:55 PM David Jacot <dja...@confluent.io>
> > > wrote:
> > > > >>
> > > > >> > Hi Jun,
> > > > >> >
> > > > >> > Thanks for your reply.
> > > > >> >
> > > > >> > 10. I think that both options are likely equivalent from an
> > accuracy
> > > > >> point
> > > > >> > of
> > > > >> > view. If we put the implementation aside, conceptually, I am not
> > > > >> convinced
> > > > >> > by the used based approach because resources don't have a clear
> > > owner
> > > > >> > in AK at the moment. A topic can be created by (Principal A, no
> > > client
> > > > >> id),
> > > > >> > then can be extended by (no principal, Client B), and finally
> > > deleted
> > > > by
> > > > >> > (Principal C, Client C). This does not sound right to me and I
> > fear
> > > > >> that it
> > > > >> > is not going to be easy to grasp for our users.
> > > > >> >
> > > > >> > Regarding the naming, I do agree that we can make it more future
> > > > proof.
> > > > >> > I propose `controller_mutations_rate`. I think that
> > differentiating
> > > > the
> > > > >> > mutations
> > > > >> > from the reads is still a good thing for the future.
> > > > >> >
> > > > >> > 11. I am not convinced by your proposal for the following
> reasons:
> > > > >> >
> > > > >> > First, in my toy example, I used 101 windows and 7 * 80
> requests.
> > We
> > > > >> could
> > > > >> > effectively allocate 5 * 100 requests to the previous windows
> > > assuming
> > > > >> that
> > > > >> > they are empty. What shall we do with the remaining 60 requests?
> > > Shall
> > > > >> we
> > > > >> > allocate them to the current window or shall we divide it among
> > all
> > > > the
> > > > >> > windows?
> > > > >> >
> > > > >> > Second, I don't think that we can safely change the behavior of
> > all
> > > > the
> > > > >> > existing
> > > > >> > rates used because it actually changes the computation of the
> rate
> > > as
> > > > >> > values
> > > > >> > allocated to past windows would expire before they would today.
> > > > >> >
> > > > >> > Overall, while trying to fit in the current rate, we are going
> to
> > > > build
> > > > >> a
> > > > >> > slightly
> > > > >> > different version of the rate which will be even more confusing
> > for
> > > > >> users.
> > > > >> >
> > > > >> > Instead, I think that we should embrace the notion of burst as
> it
> > > > could
> > > > >> > also
> > > > >> > be applied to other quotas in the future. Users don't have to
> know
> > > > that
> > > > >> we
> > > > >> > use the Token Bucket or a special rate inside at the end of the
> > day.
> > > > It
> > > > >> is
> > > > >> > an
> > > > >> > implementation detail.
> > > > >> >
> > > > >> > Users would be able to define:
> > > > >> > - a rate R; and
> > > > >> > - a maximum burst B.
> > > > >> >
> > > > >> > If we change the metrics to be as follow:
> > > > >> > - the actual rate;
> > > > >> > - the burst balance in %, 0 meaning that the user is throttled;
> > > > >> > It remains disattach from the algorithm.
> > > > >> >
> > > > >> > I personally prefer this over having to define a rate and a
> number
> > > of
> > > > >> > windows
> > > > >> > while having to understand that the number of windows implicitly
> > > > defines
> > > > >> > the
> > > > >> > allowed burst. I think that it is clearer and easier to grasp.
> > Don't
> > > > >> you?
> > > > >> >
> > > > >> > Best,
> > > > >> > David
> > > > >> >
> > > > >> > On Fri, May 29, 2020 at 6:38 PM Jun Rao <j...@confluent.io>
> wrote:
> > > > >> >
> > > > >> > > Hi, David, Anna,
> > > > >> > >
> > > > >> > > Thanks for the response. Sorry for the late reply.
> > > > >> > >
> > > > >> > > 10. Regarding exposing rate or usage as quota. Your argument
> is
> > > that
> > > > >> > usage
> > > > >> > > is not very accurate anyway and is harder to implement. So,
> > let's
> > > > just
> > > > >> > be a
> > > > >> > > bit loose and expose rate. I am sort of neutral on that. (1)
> It
> > > > seems
> > > > >> to
> > > > >> > me
> > > > >> > > that overall usage will be relatively more accurate than rate.
> > All
> > > > the
> > > > >> > > issues that Anna brought up seem to exist for rate too. Rate
> has
> > > the
> > > > >> > > additional problem that the cost of each request may not be
> > > uniform.
> > > > >> (2)
> > > > >> > In
> > > > >> > > terms of implementation, a usage based approach requires
> > tracking
> > > > the
> > > > >> > user
> > > > >> > > info through the life cycle of an operation. However, as you
> > > > >> mentioned,
> > > > >> > > things like topic creation can generate additional
> > > > >> > > LeaderAndIsr/UpdateMetadata requests. Longer term, we probably
> > > want
> > > > to
> > > > >> > > associate those cost to the user who initiated the operation.
> If
> > > we
> > > > do
> > > > >> > > that, we sort of need to track the user for the full life
> cycle
> > of
> > > > the
> > > > >> > > processing of an operation anyway. (3) If you prefer rate
> > > strongly,
> > > > I
> > > > >> > don't
> > > > >> > > have strong objections. However, I do feel that the new quota
> > name
> > > > >> should
> > > > >> > > be able to cover all controller related cost longer term. This
> > KIP
> > > > >> > > currently only covers topic creation/deletion. It would not be
> > > ideal
> > > > >> if
> > > > >> > in
> > > > >> > > the future, we have to add yet another type of quota for some
> > > other
> > > > >> > > controller related operations. The quota name in the KIP has
> > > > partition
> > > > >> > > mutation. In the future, if we allow things like topic
> renaming,
> > > it
> > > > >> may
> > > > >> > not
> > > > >> > > be related to partition mutation directly and it would be
> > trickier
> > > > to
> > > > >> fit
> > > > >> > > those operations in the current quota. So, maybe sth more
> > general
> > > > like
> > > > >> > > controller_operations_quota will be more future proof.
> > > > >> > >
> > > > >> > > 11. Regarding the difference between the token bucket
> algorithm
> > > and
> > > > >> our
> > > > >> > > current quota mechanism. That's a good observation. It seems
> > that
> > > we
> > > > >> can
> > > > >> > > make a slight change to our current quota mechanism to
> achieve a
> > > > >> similar
> > > > >> > > thing. As you said, the issue was that we allocate all 7 * 80
> > > > >> requests in
> > > > >> > > the last 1 sec window in our current mechanism. This is a bit
> > > > >> unintuitive
> > > > >> > > since each sec only has a quota capacity of 5. An alternative
> > way
> > > is
> > > > >> to
> > > > >> > > allocate the 7 * 80 requests to all previous windows, each up
> to
> > > the
> > > > >> > > remaining quota capacity left. So, you will fill the current 1
> > sec
> > > > >> window
> > > > >> > > and all previous ones, each with 5. Then, it seems this will
> > give
> > > > the
> > > > >> > same
> > > > >> > > behavior as token bucket? The reason that I keep asking this
> is
> > > that
> > > > >> from
> > > > >> > > an operational perspective, it's simpler if all types of
> quotas
> > > work
> > > > >> in
> > > > >> > the
> > > > >> > > same way.
> > > > >> > >
> > > > >> > > Jun
> > > > >> > >
> > > > >> > > On Tue, May 26, 2020 at 9:52 AM David Jacot <
> > dja...@confluent.io>
> > > > >> wrote:
> > > > >> > >
> > > > >> > > > Hi folks,
> > > > >> > > >
> > > > >> > > > I have updated the KIP. As mentioned by Jun, I have made the
> > > > >> > > > quota per principal/clientid similarly to the other quotas.
> I
> > > have
> > > > >> > > > also explained how this will work in conjunction with the
> auto
> > > > >> > > > topics creation.
> > > > >> > > >
> > > > >> > > > Please, take a look at it. I plan to call a vote for it in
> the
> > > > next
> > > > >> few
> > > > >> > > > days if there are no comments in this thread.
> > > > >> > > >
> > > > >> > > > Best,
> > > > >> > > > David
> > > > >> > > >
> > > > >> > > > On Wed, May 13, 2020 at 10:57 AM Tom Bentley <
> > > tbent...@redhat.com
> > > > >
> > > > >> > > wrote:
> > > > >> > > >
> > > > >> > > > > Hi David,
> > > > >> > > > >
> > > > >> > > > > Thanks for the explanation and confirmation that evolving
> > the
> > > > >> APIs is
> > > > >> > > not
> > > > >> > > > > off the table in the longer term.
> > > > >> > > > >
> > > > >> > > > > Kind regards,
> > > > >> > > > >
> > > > >> > > > > Tom
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

Reply via email to