Re: [KIP-DISCUSSION] KIP-13 Quotas

Jay Kreps Wed, 18 Mar 2015 14:12:06 -0700

Hey Steven,

The current proposal is actually to enforce quotas at the
client/application level, NOT the topic level. So if you have a service
with a few dozen instances the quota is against all of those instances
added up across all their topics. So actually the effect would be the same
either way but throttling gives the producer the choice of either blocking
or dropping.


-Jay

On Tue, Mar 17, 2015 at 10:08 AM, Steven Wu <[email protected]> wrote:

> Jay,
>
> let's say an app produces to 10 different topics. one of the topic is sent
> from a library. due to whatever condition/bug, this lib starts to send
> messages over the quota. if we go with the delayed response approach, it
> will cause the whole shared RecordAccumulator buffer to be filled up. that
> will penalize other 9 topics who are within the quota. that is the
> unfairness point that Ewen and I were trying to make.
>
> if broker just drop the msg and return an error/status code indicates the
> drop and why. then producer can just move on and accept the drop. shared
> buffer won't be saturated and other 9 topics won't be penalized.
>
> Thanks,
> Steven
>
>
>
> On Tue, Mar 17, 2015 at 9:44 AM, Jay Kreps <[email protected]> wrote:
>
> > Hey Steven,
> >
> > It is true that hitting the quota will cause back-pressure on the
> producer.
> > But the solution is simple, a producer that wants to avoid this should
> stay
> > under its quota. In other words this is a contract between the cluster
> and
> > the client, with each side having something to uphold. Quite possibly the
> > same thing will happen in the absence of a quota, a client that produces
> an
> > unexpected amount of load will hit the limits of the server and
> experience
> > backpressure. Quotas just allow you to set that same limit at something
> > lower than 100% of all resources on the server, which is useful for a
> > shared cluster.
> >
> > -Jay
> >
> > On Mon, Mar 16, 2015 at 11:34 PM, Steven Wu <[email protected]>
> wrote:
> >
> > > wait. we create one kafka producer for each cluster. each cluster can
> > have
> > > many topics. if producer buffer got filled up due to delayed response
> for
> > > one throttled topic, won't that penalize other topics unfairly? it
> seems
> > to
> > > me that broker should just return error without delay.
> > >
> > > sorry that I am chatting to myself :)
> > >
> > > On Mon, Mar 16, 2015 at 11:29 PM, Steven Wu <[email protected]>
> > wrote:
> > >
> > > > I think I can answer my own question. delayed response will cause the
> > > > producer buffer to be full, which then result in either thread
> blocking
> > > or
> > > > message drop.
> > > >
> > > > On Mon, Mar 16, 2015 at 11:24 PM, Steven Wu <[email protected]>
> > > wrote:
> > > >
> > > >> please correct me if I am missing sth here. I am not understanding
> how
> > > >> would throttle work without cooperation/back-off from producer. new
> > Java
> > > >> producer supports non-blocking API. why would delayed response be
> able
> > > to
> > > >> slow down producer? producer will continue to fire async sends.
> > > >>
> > > >> On Mon, Mar 16, 2015 at 10:58 PM, Guozhang Wang <[email protected]
> >
> > > >> wrote:
> > > >>
> > > >>> I think we are really discussing two separate issues here:
> > > >>>
> > > >>> 1. Whether we should a) append-then-block-then-returnOKButThrottled
> > or
> > > b)
> > > >>> block-then-returnFailDuetoThrottled for quota actions on produce
> > > >>> requests.
> > > >>>
> > > >>> Both these approaches assume some kind of well-behaveness of the
> > > clients:
> > > >>> option a) assumes the client sets an proper timeout value while can
> > > just
> > > >>> ignore "OKButThrottled" response, while option b) assumes the
> client
> > > >>> handles the "FailDuetoThrottled" appropriately. For any malicious
> > > clients
> > > >>> that, for example, just keep retrying either intentionally or not,
> > > >>> neither
> > > >>> of these approaches are actually effective.
> > > >>>
> > > >>> 2. For "OKButThrottled" and "FailDuetoThrottled" responses, shall
> we
> > > >>> encode
> > > >>> them as error codes or augment the protocol to use a separate field
> > > >>> indicating "status codes".
> > > >>>
> > > >>> Today we have already incorporated some status code as error codes
> in
> > > the
> > > >>> responses, e.g. ReplicaNotAvailable in MetadataResponse, the pros
> of
> > > this
> > > >>> is of course using a single field for response status like the HTTP
> > > >>> status
> > > >>> codes, while the cons is that it requires clients to handle the
> error
> > > >>> codes
> > > >>> carefully.
> > > >>>
> > > >>> I think maybe we can actually extend the single-code approach to
> > > overcome
> > > >>> its drawbacks, that is, wrap the error codes semantics to the users
> > so
> > > >>> that
> > > >>> users do not need to handle the codes one-by-one. More concretely,
> > > >>> following Jay's example the client could write sth. like this:
> > > >>>
> > > >>>
> > > >>> -----------------
> > > >>>
> > > >>>   if(error.isOK())
> > > >>>      // status code is good or the code can be simply ignored for
> > this
> > > >>> request type, process the request
> > > >>>   else if(error.needsRetry())
> > > >>>      // throttled, transient error, etc: retry
> > > >>>   else if(error.isFatal())
> > > >>>      // non-retriable errors, etc: notify / terminate / other
> > handling
> > > >>>
> > > >>> -----------------
> > > >>>
> > > >>> Only when the clients really want to handle, for example
> > > >>> FailDuetoThrottled
> > > >>> status code specifically, it needs to:
> > > >>>
> > > >>>   if(error.isOK())
> > > >>>      // status code is good or the code can be simply ignored for
> > this
> > > >>> request type, process the request
> > > >>>   else if(error == FailDuetoThrottled )
> > > >>>      // throttled: log it
> > > >>>   else if(error.needsRetry())
> > > >>>      // transient error, etc: retry
> > > >>>   else if(error.isFatal())
> > > >>>      // non-retriable errors, etc: notify / terminate / other
> > handling
> > > >>>
> > > >>> -----------------
> > > >>>
> > > >>> And for implementation we can probably group the codes accordingly
> > like
> > > >>> HTTP status code such that we can do:
> > > >>>
> > > >>> boolean Error.isOK() {
> > > >>>   return code < 300 && code >= 200;
> > > >>> }
> > > >>>
> > > >>> Guozhang
> > > >>>
> > > >>> On Mon, Mar 16, 2015 at 10:24 PM, Ewen Cheslack-Postava <
> > > >>> [email protected]>
> > > >>> wrote:
> > > >>>
> > > >>> > Agreed that trying to shoehorn non-error codes into the error
> field
> > > is
> > > >>> a
> > > >>> > bad idea. It makes it *way* too easy to write code that looks
> (and
> > > >>> should
> > > >>> > be) correct but is actually incorrect. If necessary, I think it's
> > > much
> > > >>> > better to to spend a couple of extra bytes to encode that
> > information
> > > >>> > separately (a "status" or "warning" section of the response). An
> > > >>> indication
> > > >>> > that throttling is occurring is something I'd expect to be
> > indicated
> > > >>> by a
> > > >>> > bit flag in the response rather than as an error code.
> > > >>> >
> > > >>> > Gwen - I think an error code makes sense when the request
> actually
> > > >>> failed.
> > > >>> > Option B, which Jun was advocating, would have appended the
> > messages
> > > >>> > successfully. If the rate-limiting case you're talking about had
> > > >>> > successfully committed the messages, I would say that's also a
> bad
> > > use
> > > >>> of
> > > >>> > error codes.
> > > >>> >
> > > >>> >
> > > >>> > On Mon, Mar 16, 2015 at 10:16 PM, Gwen Shapira <
> > > [email protected]>
> > > >>> > wrote:
> > > >>> >
> > > >>> > > We discussed an error code for rate-limiting (which I think
> made
> > > >>> > > sense), isn't it a similar case?
> > > >>> > >
> > > >>> > > On Mon, Mar 16, 2015 at 10:10 PM, Jay Kreps <
> [email protected]
> > >
> > > >>> wrote:
> > > >>> > > > My concern is that as soon as you start encoding non-error
> > > response
> > > >>> > > > information into error codes the next question is what to do
> if
> > > two
> > > >>> > such
> > > >>> > > > codes apply (i.e. you have a replica down and the response is
> > > >>> > quota'd). I
> > > >>> > > > think I am trying to argue that error should mean "why we
> > failed
> > > >>> your
> > > >>> > > > request", for which there will really only be one reason, and
> > any
> > > >>> other
> > > >>> > > > useful information we want to send back is just another field
> > in
> > > >>> the
> > > >>> > > > response.
> > > >>> > > >
> > > >>> > > > -Jay
> > > >>> > > >
> > > >>> > > > On Mon, Mar 16, 2015 at 9:51 PM, Gwen Shapira <
> > > >>> [email protected]>
> > > >>> > > wrote:
> > > >>> > > >
> > > >>> > > >> I think its not too late to reserve a set of error codes
> > > >>> (200-299?)
> > > >>> > > >> for "non-error" codes.
> > > >>> > > >>
> > > >>> > > >> It won't be backward compatible (i.e. clients that currently
> > do
> > > >>> "else
> > > >>> > > >> throw" will throw on non-errors), but perhaps its
> worthwhile.
> > > >>> > > >>
> > > >>> > > >> On Mon, Mar 16, 2015 at 9:42 PM, Jay Kreps <
> > [email protected]
> > > >
> > > >>> > wrote:
> > > >>> > > >> > Hey Jun,
> > > >>> > > >> >
> > > >>> > > >> > I'd really really really like to avoid that. Having just
> > > spent a
> > > >>> > > bunch of
> > > >>> > > >> > time on the clients, using the error codes to encode other
> > > >>> > information
> > > >>> > > >> > about the response is super dangerous. The error handling
> is
> > > >>> one of
> > > >>> > > the
> > > >>> > > >> > hardest parts of the client (Guozhang chime in here).
> > > >>> > > >> >
> > > >>> > > >> > Generally the error handling looks like
> > > >>> > > >> >   if(error == none)
> > > >>> > > >> >      // good, process the request
> > > >>> > > >> >   else if(error == KNOWN_ERROR_1)
> > > >>> > > >> >      // handle known error 1
> > > >>> > > >> >   else if(error == KNOWN_ERROR_2)
> > > >>> > > >> >      // handle known error 2
> > > >>> > > >> >   else
> > > >>> > > >> >      throw Errors.forCode(error).exception(); // or some
> > other
> > > >>> > default
> > > >>> > > >> > behavior
> > > >>> > > >> >
> > > >>> > > >> > This works because we have a convention that and error is
> > > >>> something
> > > >>> > > that
> > > >>> > > >> > prevented your getting the response so the default
> handling
> > > >>> case is
> > > >>> > > sane
> > > >>> > > >> > and forward compatible. It is tempting to use the error
> code
> > > to
> > > >>> > convey
> > > >>> > > >> > information in the success case. For example we could use
> > > error
> > > >>> > codes
> > > >>> > > to
> > > >>> > > >> > encode whether quotas were enforced, whether the request
> was
> > > >>> served
> > > >>> > > out
> > > >>> > > >> of
> > > >>> > > >> > cache, whether the stock market is up today, or whatever.
> > The
> > > >>> > problem
> > > >>> > > is
> > > >>> > > >> > that since these are not errors as far as the client is
> > > >>> concerned it
> > > >>> > > >> should
> > > >>> > > >> > not throw an exception but process the response, but now
> we
> > > >>> created
> > > >>> > an
> > > >>> > > >> > explicit requirement that that error be handled explicitly
> > > >>> since it
> > > >>> > is
> > > >>> > > >> > different. I really think that this kind of information is
> > not
> > > >>> an
> > > >>> > > error,
> > > >>> > > >> it
> > > >>> > > >> > is just information, and if we want it in the response we
> > > >>> should do
> > > >>> > > the
> > > >>> > > >> > right thing and add a new field to the response.
> > > >>> > > >> >
> > > >>> > > >> > I think you saw the Samza bug that was literally an
> example
> > of
> > > >>> this
> > > >>> > > >> > happening and leading to an infinite retry loop.
> > > >>> > > >> >
> > > >>> > > >> > Further more I really want to emphasize that hitting your
> > > quota
> > > >>> in
> > > >>> > the
> > > >>> > > >> > design that Adi has proposed is actually not an error
> > > condition
> > > >>> at
> > > >>> > > all.
> > > >>> > > >> It
> > > >>> > > >> > is totally reasonable in any bootstrap situation to
> > > >>> intentionally
> > > >>> > > want to
> > > >>> > > >> > run at the limit the system imposes on you.
> > > >>> > > >> >
> > > >>> > > >> > -Jay
> > > >>> > > >> >
> > > >>> > > >> >
> > > >>> > > >> >
> > > >>> > > >> > On Mon, Mar 16, 2015 at 4:27 PM, Jun Rao <
> [email protected]>
> > > >>> wrote:
> > > >>> > > >> >
> > > >>> > > >> >> It's probably useful for a client to know whether its
> > > requests
> > > >>> are
> > > >>> > > >> >> throttled or not (e.g., for monitoring and alerting).
> From
> > > that
> > > >>> > > >> >> perspective, option B (delay the requests and return an
> > > error)
> > > >>> > seems
> > > >>> > > >> >> better.
> > > >>> > > >> >>
> > > >>> > > >> >> Thanks,
> > > >>> > > >> >>
> > > >>> > > >> >> Jun
> > > >>> > > >> >>
> > > >>> > > >> >> On Wed, Mar 4, 2015 at 3:51 PM, Aditya Auradkar <
> > > >>> > > >> >> [email protected]> wrote:
> > > >>> > > >> >>
> > > >>> > > >> >> > Posted a KIP for quotas in kafka.
> > > >>> > > >> >> >
> > > >>> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-13+-+Quotas
> > > >>> > > >> >> >
> > > >>> > > >> >> > Appreciate any feedback.
> > > >>> > > >> >> >
> > > >>> > > >> >> > Aditya
> > > >>> > > >> >> >
> > > >>> > > >> >>
> > > >>> > > >>
> > > >>> > >
> > > >>> >
> > > >>> >
> > > >>> >
> > > >>> > --
> > > >>> > Thanks,
> > > >>> > Ewen
> > > >>> >
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> -- Guozhang
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: [KIP-DISCUSSION] KIP-13 Quotas

Reply via email to