Re: [KIP-DISCUSSION] KIP-13 Quotas

Jay Kreps Thu, 19 Mar 2015 20:10:58 -0700

Yeah I was saying was that we are blocked on picking an approach for
metrics but not necessarily the full conversion. Clearly if we pick the new
metrics package we would need to implement the two metrics we want to quota
on. But the conversion of the remaining metrics can be done asynchronously.


-Jay

On Thu, Mar 19, 2015 at 5:56 PM, Joel Koshy <jjkosh...@gmail.com> wrote:

> > in KAFKA-1930). I agree that this KIP doesn't need to block on the
> > migration of the metrics package.
>
> Can you clarify the above? i.e., if we are going to quota on something
> then we would want to have migrated that metric over right? Or do you
> mean we don't need to complete the migration of all metrics to the
> metrics package right?
>
> I think most of us now feel that the delay + no error is a good
> approach, but it would be good to make sure everyone is on the same
> page.
>
> As Aditya requested a couple of days ago I think we should go over
> this at the next KIP hangout.
>
> Joel
>
> On Thu, Mar 19, 2015 at 09:24:09AM -0700, Jun Rao wrote:
> > 1. Delay + no error seems reasonable to me. However, I do feel that we
> need
> > to give the client an indicator that it's being throttled, instead of
> doing
> > this silently. For that, we probably need to evolve the produce/fetch
> > protocol to include an extra status field in the response. We probably
> need
> > to think more about whether we just want to return a simple status code
> > (e.g., 1 = throttled) or a value that indicates how much is being
> throttled.
> >
> > 2. We probably need to improve the histogram support in the new metrics
> > package before we can use it more widely on the server side (left a
> comment
> > in KAFKA-1930). I agree that this KIP doesn't need to block on the
> > migration of the metrics package.
> >
> > Thanks,
> >
> > Jun
> >
> > On Wed, Mar 18, 2015 at 4:02 PM, Aditya Auradkar <
> > aaurad...@linkedin.com.invalid> wrote:
> >
> > > Hey everyone,
> > >
> > > Thanks for the great discussion. There are currently a few points on
> this
> > > KIP that need addressing and I want to make sure we are on the same
> page
> > > about those.
> > >
> > > 1. Append and delay response vs delay and return error
> > > - I think we've discussed the pros and cons of each approach but
> haven't
> > > chosen an approach yet. Where does everyone stand on this issue?
> > >
> > > 2. Metrics Migration and usage in quotas
> > > - The metrics library in clients has a notion of quotas that we should
> > > reuse. For that to happen, we need to migrate the server to the new
> metrics
> > > package.
> > > - Need more clarification on how to compute throttling time and
> windowing
> > > for quotas.
> > >
> > > I'm going to start a new KIP to discuss metrics migration separately.
> That
> > > will also contain a section on quotas.
> > >
> > > 3. Dynamic Configuration management - Being discussed in KIP-5.
> Basically
> > > we need something that will model default quotas and allow per-client
> > > overrides.
> > >
> > > Is there something else that I'm missing?
> > >
> > > Thanks,
> > > Aditya
> > > ________________________________________
> > > From: Jay Kreps [jay.kr...@gmail.com]
> > > Sent: Wednesday, March 18, 2015 2:10 PM
> > > To: dev@kafka.apache.org
> > > Subject: Re: [KIP-DISCUSSION] KIP-13 Quotas
> > >
> > > Hey Steven,
> > >
> > > The current proposal is actually to enforce quotas at the
> > > client/application level, NOT the topic level. So if you have a service
> > > with a few dozen instances the quota is against all of those instances
> > > added up across all their topics. So actually the effect would be the
> same
> > > either way but throttling gives the producer the choice of either
> blocking
> > > or dropping.
> > >
> > > -Jay
> > >
> > > On Tue, Mar 17, 2015 at 10:08 AM, Steven Wu <stevenz...@gmail.com>
> wrote:
> > >
> > > > Jay,
> > > >
> > > > let's say an app produces to 10 different topics. one of the topic is
> > > sent
> > > > from a library. due to whatever condition/bug, this lib starts to
> send
> > > > messages over the quota. if we go with the delayed response
> approach, it
> > > > will cause the whole shared RecordAccumulator buffer to be filled up.
> > > that
> > > > will penalize other 9 topics who are within the quota. that is the
> > > > unfairness point that Ewen and I were trying to make.
> > > >
> > > > if broker just drop the msg and return an error/status code
> indicates the
> > > > drop and why. then producer can just move on and accept the drop.
> shared
> > > > buffer won't be saturated and other 9 topics won't be penalized.
> > > >
> > > > Thanks,
> > > > Steven
> > > >
> > > >
> > > >
> > > > On Tue, Mar 17, 2015 at 9:44 AM, Jay Kreps <jay.kr...@gmail.com>
> wrote:
> > > >
> > > > > Hey Steven,
> > > > >
> > > > > It is true that hitting the quota will cause back-pressure on the
> > > > producer.
> > > > > But the solution is simple, a producer that wants to avoid this
> should
> > > > stay
> > > > > under its quota. In other words this is a contract between the
> cluster
> > > > and
> > > > > the client, with each side having something to uphold. Quite
> possibly
> > > the
> > > > > same thing will happen in the absence of a quota, a client that
> > > produces
> > > > an
> > > > > unexpected amount of load will hit the limits of the server and
> > > > experience
> > > > > backpressure. Quotas just allow you to set that same limit at
> something
> > > > > lower than 100% of all resources on the server, which is useful
> for a
> > > > > shared cluster.
> > > > >
> > > > > -Jay
> > > > >
> > > > > On Mon, Mar 16, 2015 at 11:34 PM, Steven Wu <stevenz...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > wait. we create one kafka producer for each cluster. each
> cluster can
> > > > > have
> > > > > > many topics. if producer buffer got filled up due to delayed
> response
> > > > for
> > > > > > one throttled topic, won't that penalize other topics unfairly?
> it
> > > > seems
> > > > > to
> > > > > > me that broker should just return error without delay.
> > > > > >
> > > > > > sorry that I am chatting to myself :)
> > > > > >
> > > > > > On Mon, Mar 16, 2015 at 11:29 PM, Steven Wu <
> stevenz...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > I think I can answer my own question. delayed response will
> cause
> > > the
> > > > > > > producer buffer to be full, which then result in either thread
> > > > blocking
> > > > > > or
> > > > > > > message drop.
> > > > > > >
> > > > > > > On Mon, Mar 16, 2015 at 11:24 PM, Steven Wu <
> stevenz...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > >> please correct me if I am missing sth here. I am not
> understanding
> > > > how
> > > > > > >> would throttle work without cooperation/back-off from
> producer.
> > > new
> > > > > Java
> > > > > > >> producer supports non-blocking API. why would delayed
> response be
> > > > able
> > > > > > to
> > > > > > >> slow down producer? producer will continue to fire async
> sends.
> > > > > > >>
> > > > > > >> On Mon, Mar 16, 2015 at 10:58 PM, Guozhang Wang <
> > > wangg...@gmail.com
> > > > >
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> I think we are really discussing two separate issues here:
> > > > > > >>>
> > > > > > >>> 1. Whether we should a)
> > > append-then-block-then-returnOKButThrottled
> > > > > or
> > > > > > b)
> > > > > > >>> block-then-returnFailDuetoThrottled for quota actions on
> produce
> > > > > > >>> requests.
> > > > > > >>>
> > > > > > >>> Both these approaches assume some kind of well-behaveness of
> the
> > > > > > clients:
> > > > > > >>> option a) assumes the client sets an proper timeout value
> while
> > > can
> > > > > > just
> > > > > > >>> ignore "OKButThrottled" response, while option b) assumes the
> > > > client
> > > > > > >>> handles the "FailDuetoThrottled" appropriately. For any
> malicious
> > > > > > clients
> > > > > > >>> that, for example, just keep retrying either intentionally or
> > > not,
> > > > > > >>> neither
> > > > > > >>> of these approaches are actually effective.
> > > > > > >>>
> > > > > > >>> 2. For "OKButThrottled" and "FailDuetoThrottled" responses,
> shall
> > > > we
> > > > > > >>> encode
> > > > > > >>> them as error codes or augment the protocol to use a separate
> > > field
> > > > > > >>> indicating "status codes".
> > > > > > >>>
> > > > > > >>> Today we have already incorporated some status code as error
> > > codes
> > > > in
> > > > > > the
> > > > > > >>> responses, e.g. ReplicaNotAvailable in MetadataResponse, the
> pros
> > > > of
> > > > > > this
> > > > > > >>> is of course using a single field for response status like
> the
> > > HTTP
> > > > > > >>> status
> > > > > > >>> codes, while the cons is that it requires clients to handle
> the
> > > > error
> > > > > > >>> codes
> > > > > > >>> carefully.
> > > > > > >>>
> > > > > > >>> I think maybe we can actually extend the single-code
> approach to
> > > > > > overcome
> > > > > > >>> its drawbacks, that is, wrap the error codes semantics to the
> > > users
> > > > > so
> > > > > > >>> that
> > > > > > >>> users do not need to handle the codes one-by-one. More
> > > concretely,
> > > > > > >>> following Jay's example the client could write sth. like
> this:
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> -----------------
> > > > > > >>>
> > > > > > >>>   if(error.isOK())
> > > > > > >>>      // status code is good or the code can be simply
> ignored for
> > > > > this
> > > > > > >>> request type, process the request
> > > > > > >>>   else if(error.needsRetry())
> > > > > > >>>      // throttled, transient error, etc: retry
> > > > > > >>>   else if(error.isFatal())
> > > > > > >>>      // non-retriable errors, etc: notify / terminate / other
> > > > > handling
> > > > > > >>>
> > > > > > >>> -----------------
> > > > > > >>>
> > > > > > >>> Only when the clients really want to handle, for example
> > > > > > >>> FailDuetoThrottled
> > > > > > >>> status code specifically, it needs to:
> > > > > > >>>
> > > > > > >>>   if(error.isOK())
> > > > > > >>>      // status code is good or the code can be simply
> ignored for
> > > > > this
> > > > > > >>> request type, process the request
> > > > > > >>>   else if(error == FailDuetoThrottled )
> > > > > > >>>      // throttled: log it
> > > > > > >>>   else if(error.needsRetry())
> > > > > > >>>      // transient error, etc: retry
> > > > > > >>>   else if(error.isFatal())
> > > > > > >>>      // non-retriable errors, etc: notify / terminate / other
> > > > > handling
> > > > > > >>>
> > > > > > >>> -----------------
> > > > > > >>>
> > > > > > >>> And for implementation we can probably group the codes
> > > accordingly
> > > > > like
> > > > > > >>> HTTP status code such that we can do:
> > > > > > >>>
> > > > > > >>> boolean Error.isOK() {
> > > > > > >>>   return code < 300 && code >= 200;
> > > > > > >>> }
> > > > > > >>>
> > > > > > >>> Guozhang
> > > > > > >>>
> > > > > > >>> On Mon, Mar 16, 2015 at 10:24 PM, Ewen Cheslack-Postava <
> > > > > > >>> e...@confluent.io>
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>> > Agreed that trying to shoehorn non-error codes into the
> error
> > > > field
> > > > > > is
> > > > > > >>> a
> > > > > > >>> > bad idea. It makes it *way* too easy to write code that
> looks
> > > > (and
> > > > > > >>> should
> > > > > > >>> > be) correct but is actually incorrect. If necessary, I
> think
> > > it's
> > > > > > much
> > > > > > >>> > better to to spend a couple of extra bytes to encode that
> > > > > information
> > > > > > >>> > separately (a "status" or "warning" section of the
> response).
> > > An
> > > > > > >>> indication
> > > > > > >>> > that throttling is occurring is something I'd expect to be
> > > > > indicated
> > > > > > >>> by a
> > > > > > >>> > bit flag in the response rather than as an error code.
> > > > > > >>> >
> > > > > > >>> > Gwen - I think an error code makes sense when the request
> > > > actually
> > > > > > >>> failed.
> > > > > > >>> > Option B, which Jun was advocating, would have appended the
> > > > > messages
> > > > > > >>> > successfully. If the rate-limiting case you're talking
> about
> > > had
> > > > > > >>> > successfully committed the messages, I would say that's
> also a
> > > > bad
> > > > > > use
> > > > > > >>> of
> > > > > > >>> > error codes.
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> > On Mon, Mar 16, 2015 at 10:16 PM, Gwen Shapira <
> > > > > > gshap...@cloudera.com>
> > > > > > >>> > wrote:
> > > > > > >>> >
> > > > > > >>> > > We discussed an error code for rate-limiting (which I
> think
> > > > made
> > > > > > >>> > > sense), isn't it a similar case?
> > > > > > >>> > >
> > > > > > >>> > > On Mon, Mar 16, 2015 at 10:10 PM, Jay Kreps <
> > > > jay.kr...@gmail.com
> > > > > >
> > > > > > >>> wrote:
> > > > > > >>> > > > My concern is that as soon as you start encoding
> non-error
> > > > > > response
> > > > > > >>> > > > information into error codes the next question is what
> to
> > > do
> > > > if
> > > > > > two
> > > > > > >>> > such
> > > > > > >>> > > > codes apply (i.e. you have a replica down and the
> response
> > > is
> > > > > > >>> > quota'd). I
> > > > > > >>> > > > think I am trying to argue that error should mean "why
> we
> > > > > failed
> > > > > > >>> your
> > > > > > >>> > > > request", for which there will really only be one
> reason,
> > > and
> > > > > any
> > > > > > >>> other
> > > > > > >>> > > > useful information we want to send back is just another
> > > field
> > > > > in
> > > > > > >>> the
> > > > > > >>> > > > response.
> > > > > > >>> > > >
> > > > > > >>> > > > -Jay
> > > > > > >>> > > >
> > > > > > >>> > > > On Mon, Mar 16, 2015 at 9:51 PM, Gwen Shapira <
> > > > > > >>> gshap...@cloudera.com>
> > > > > > >>> > > wrote:
> > > > > > >>> > > >
> > > > > > >>> > > >> I think its not too late to reserve a set of error
> codes
> > > > > > >>> (200-299?)
> > > > > > >>> > > >> for "non-error" codes.
> > > > > > >>> > > >>
> > > > > > >>> > > >> It won't be backward compatible (i.e. clients that
> > > currently
> > > > > do
> > > > > > >>> "else
> > > > > > >>> > > >> throw" will throw on non-errors), but perhaps its
> > > > worthwhile.
> > > > > > >>> > > >>
> > > > > > >>> > > >> On Mon, Mar 16, 2015 at 9:42 PM, Jay Kreps <
> > > > > jay.kr...@gmail.com
> > > > > > >
> > > > > > >>> > wrote:
> > > > > > >>> > > >> > Hey Jun,
> > > > > > >>> > > >> >
> > > > > > >>> > > >> > I'd really really really like to avoid that. Having
> just
> > > > > > spent a
> > > > > > >>> > > bunch of
> > > > > > >>> > > >> > time on the clients, using the error codes to encode
> > > other
> > > > > > >>> > information
> > > > > > >>> > > >> > about the response is super dangerous. The error
> > > handling
> > > > is
> > > > > > >>> one of
> > > > > > >>> > > the
> > > > > > >>> > > >> > hardest parts of the client (Guozhang chime in
> here).
> > > > > > >>> > > >> >
> > > > > > >>> > > >> > Generally the error handling looks like
> > > > > > >>> > > >> >   if(error == none)
> > > > > > >>> > > >> >      // good, process the request
> > > > > > >>> > > >> >   else if(error == KNOWN_ERROR_1)
> > > > > > >>> > > >> >      // handle known error 1
> > > > > > >>> > > >> >   else if(error == KNOWN_ERROR_2)
> > > > > > >>> > > >> >      // handle known error 2
> > > > > > >>> > > >> >   else
> > > > > > >>> > > >> >      throw Errors.forCode(error).exception(); // or
> some
> > > > > other
> > > > > > >>> > default
> > > > > > >>> > > >> > behavior
> > > > > > >>> > > >> >
> > > > > > >>> > > >> > This works because we have a convention that and
> error
> > > is
> > > > > > >>> something
> > > > > > >>> > > that
> > > > > > >>> > > >> > prevented your getting the response so the default
> > > > handling
> > > > > > >>> case is
> > > > > > >>> > > sane
> > > > > > >>> > > >> > and forward compatible. It is tempting to use the
> error
> > > > code
> > > > > > to
> > > > > > >>> > convey
> > > > > > >>> > > >> > information in the success case. For example we
> could
> > > use
> > > > > > error
> > > > > > >>> > codes
> > > > > > >>> > > to
> > > > > > >>> > > >> > encode whether quotas were enforced, whether the
> request
> > > > was
> > > > > > >>> served
> > > > > > >>> > > out
> > > > > > >>> > > >> of
> > > > > > >>> > > >> > cache, whether the stock market is up today, or
> > > whatever.
> > > > > The
> > > > > > >>> > problem
> > > > > > >>> > > is
> > > > > > >>> > > >> > that since these are not errors as far as the
> client is
> > > > > > >>> concerned it
> > > > > > >>> > > >> should
> > > > > > >>> > > >> > not throw an exception but process the response,
> but now
> > > > we
> > > > > > >>> created
> > > > > > >>> > an
> > > > > > >>> > > >> > explicit requirement that that error be handled
> > > explicitly
> > > > > > >>> since it
> > > > > > >>> > is
> > > > > > >>> > > >> > different. I really think that this kind of
> information
> > > is
> > > > > not
> > > > > > >>> an
> > > > > > >>> > > error,
> > > > > > >>> > > >> it
> > > > > > >>> > > >> > is just information, and if we want it in the
> response
> > > we
> > > > > > >>> should do
> > > > > > >>> > > the
> > > > > > >>> > > >> > right thing and add a new field to the response.
> > > > > > >>> > > >> >
> > > > > > >>> > > >> > I think you saw the Samza bug that was literally an
> > > > example
> > > > > of
> > > > > > >>> this
> > > > > > >>> > > >> > happening and leading to an infinite retry loop.
> > > > > > >>> > > >> >
> > > > > > >>> > > >> > Further more I really want to emphasize that hitting
> > > your
> > > > > > quota
> > > > > > >>> in
> > > > > > >>> > the
> > > > > > >>> > > >> > design that Adi has proposed is actually not an
> error
> > > > > > condition
> > > > > > >>> at
> > > > > > >>> > > all.
> > > > > > >>> > > >> It
> > > > > > >>> > > >> > is totally reasonable in any bootstrap situation to
> > > > > > >>> intentionally
> > > > > > >>> > > want to
> > > > > > >>> > > >> > run at the limit the system imposes on you.
> > > > > > >>> > > >> >
> > > > > > >>> > > >> > -Jay
> > > > > > >>> > > >> >
> > > > > > >>> > > >> >
> > > > > > >>> > > >> >
> > > > > > >>> > > >> > On Mon, Mar 16, 2015 at 4:27 PM, Jun Rao <
> > > > j...@confluent.io>
> > > > > > >>> wrote:
> > > > > > >>> > > >> >
> > > > > > >>> > > >> >> It's probably useful for a client to know whether
> its
> > > > > > requests
> > > > > > >>> are
> > > > > > >>> > > >> >> throttled or not (e.g., for monitoring and
> alerting).
> > > > From
> > > > > > that
> > > > > > >>> > > >> >> perspective, option B (delay the requests and
> return an
> > > > > > error)
> > > > > > >>> > seems
> > > > > > >>> > > >> >> better.
> > > > > > >>> > > >> >>
> > > > > > >>> > > >> >> Thanks,
> > > > > > >>> > > >> >>
> > > > > > >>> > > >> >> Jun
> > > > > > >>> > > >> >>
> > > > > > >>> > > >> >> On Wed, Mar 4, 2015 at 3:51 PM, Aditya Auradkar <
> > > > > > >>> > > >> >> aaurad...@linkedin.com.invalid> wrote:
> > > > > > >>> > > >> >>
> > > > > > >>> > > >> >> > Posted a KIP for quotas in kafka.
> > > > > > >>> > > >> >> >
> > > > > > >>> >
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-13+-+Quotas
> > > > > > >>> > > >> >> >
> > > > > > >>> > > >> >> > Appreciate any feedback.
> > > > > > >>> > > >> >> >
> > > > > > >>> > > >> >> > Aditya
> > > > > > >>> > > >> >> >
> > > > > > >>> > > >> >>
> > > > > > >>> > > >>
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> > --
> > > > > > >>> > Thanks,
> > > > > > >>> > Ewen
> > > > > > >>> >
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> --
> > > > > > >>> -- Guozhang
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>
>

Re: [KIP-DISCUSSION] KIP-13 Quotas

Reply via email to