Re: [KIP-DISCUSSION] KIP-13 Quotas

Jay Kreps Wed, 18 Mar 2015 17:19:20 -0700

I think that is a good summary.

1. I'd favor delay over error--I think we have a design where delay will
basically mimic the same behavior you would get if you had a lower capacity
Kafka cluster all to yourself, which from my point of view is ideal. I'm
aesthetically opposed to delay+error.
2. I actually don't think this is blocked on completing the metrics
migration. I do think we need to figure out the direction, though to decide
what to do.
3. Yup.


-Jay


On Wed, Mar 18, 2015 at 4:02 PM, Aditya Auradkar <
aaurad...@linkedin.com.invalid> wrote:

> Hey everyone,
>
> Thanks for the great discussion. There are currently a few points on this
> KIP that need addressing and I want to make sure we are on the same page
> about those.
>
> 1. Append and delay response vs delay and return error
> - I think we've discussed the pros and cons of each approach but haven't
> chosen an approach yet. Where does everyone stand on this issue?
>
> 2. Metrics Migration and usage in quotas
> - The metrics library in clients has a notion of quotas that we should
> reuse. For that to happen, we need to migrate the server to the new metrics
> package.
> - Need more clarification on how to compute throttling time and windowing
> for quotas.
>
> I'm going to start a new KIP to discuss metrics migration separately. That
> will also contain a section on quotas.
>
> 3. Dynamic Configuration management - Being discussed in KIP-5. Basically
> we need something that will model default quotas and allow per-client
> overrides.
>
> Is there something else that I'm missing?
>
> Thanks,
> Aditya
> ________________________________________
> From: Jay Kreps [jay.kr...@gmail.com]
> Sent: Wednesday, March 18, 2015 2:10 PM
> To: dev@kafka.apache.org
> Subject: Re: [KIP-DISCUSSION] KIP-13 Quotas
>
> Hey Steven,
>
> The current proposal is actually to enforce quotas at the
> client/application level, NOT the topic level. So if you have a service
> with a few dozen instances the quota is against all of those instances
> added up across all their topics. So actually the effect would be the same
> either way but throttling gives the producer the choice of either blocking
> or dropping.
>
> -Jay
>
> On Tue, Mar 17, 2015 at 10:08 AM, Steven Wu <stevenz...@gmail.com> wrote:
>
> > Jay,
> >
> > let's say an app produces to 10 different topics. one of the topic is
> sent
> > from a library. due to whatever condition/bug, this lib starts to send
> > messages over the quota. if we go with the delayed response approach, it
> > will cause the whole shared RecordAccumulator buffer to be filled up.
> that
> > will penalize other 9 topics who are within the quota. that is the
> > unfairness point that Ewen and I were trying to make.
> >
> > if broker just drop the msg and return an error/status code indicates the
> > drop and why. then producer can just move on and accept the drop. shared
> > buffer won't be saturated and other 9 topics won't be penalized.
> >
> > Thanks,
> > Steven
> >
> >
> >
> > On Tue, Mar 17, 2015 at 9:44 AM, Jay Kreps <jay.kr...@gmail.com> wrote:
> >
> > > Hey Steven,
> > >
> > > It is true that hitting the quota will cause back-pressure on the
> > producer.
> > > But the solution is simple, a producer that wants to avoid this should
> > stay
> > > under its quota. In other words this is a contract between the cluster
> > and
> > > the client, with each side having something to uphold. Quite possibly
> the
> > > same thing will happen in the absence of a quota, a client that
> produces
> > an
> > > unexpected amount of load will hit the limits of the server and
> > experience
> > > backpressure. Quotas just allow you to set that same limit at something
> > > lower than 100% of all resources on the server, which is useful for a
> > > shared cluster.
> > >
> > > -Jay
> > >
> > > On Mon, Mar 16, 2015 at 11:34 PM, Steven Wu <stevenz...@gmail.com>
> > wrote:
> > >
> > > > wait. we create one kafka producer for each cluster. each cluster can
> > > have
> > > > many topics. if producer buffer got filled up due to delayed response
> > for
> > > > one throttled topic, won't that penalize other topics unfairly? it
> > seems
> > > to
> > > > me that broker should just return error without delay.
> > > >
> > > > sorry that I am chatting to myself :)
> > > >
> > > > On Mon, Mar 16, 2015 at 11:29 PM, Steven Wu <stevenz...@gmail.com>
> > > wrote:
> > > >
> > > > > I think I can answer my own question. delayed response will cause
> the
> > > > > producer buffer to be full, which then result in either thread
> > blocking
> > > > or
> > > > > message drop.
> > > > >
> > > > > On Mon, Mar 16, 2015 at 11:24 PM, Steven Wu <stevenz...@gmail.com>
> > > > wrote:
> > > > >
> > > > >> please correct me if I am missing sth here. I am not understanding
> > how
> > > > >> would throttle work without cooperation/back-off from producer.
> new
> > > Java
> > > > >> producer supports non-blocking API. why would delayed response be
> > able
> > > > to
> > > > >> slow down producer? producer will continue to fire async sends.
> > > > >>
> > > > >> On Mon, Mar 16, 2015 at 10:58 PM, Guozhang Wang <
> wangg...@gmail.com
> > >
> > > > >> wrote:
> > > > >>
> > > > >>> I think we are really discussing two separate issues here:
> > > > >>>
> > > > >>> 1. Whether we should a)
> append-then-block-then-returnOKButThrottled
> > > or
> > > > b)
> > > > >>> block-then-returnFailDuetoThrottled for quota actions on produce
> > > > >>> requests.
> > > > >>>
> > > > >>> Both these approaches assume some kind of well-behaveness of the
> > > > clients:
> > > > >>> option a) assumes the client sets an proper timeout value while
> can
> > > > just
> > > > >>> ignore "OKButThrottled" response, while option b) assumes the
> > client
> > > > >>> handles the "FailDuetoThrottled" appropriately. For any malicious
> > > > clients
> > > > >>> that, for example, just keep retrying either intentionally or
> not,
> > > > >>> neither
> > > > >>> of these approaches are actually effective.
> > > > >>>
> > > > >>> 2. For "OKButThrottled" and "FailDuetoThrottled" responses, shall
> > we
> > > > >>> encode
> > > > >>> them as error codes or augment the protocol to use a separate
> field
> > > > >>> indicating "status codes".
> > > > >>>
> > > > >>> Today we have already incorporated some status code as error
> codes
> > in
> > > > the
> > > > >>> responses, e.g. ReplicaNotAvailable in MetadataResponse, the pros
> > of
> > > > this
> > > > >>> is of course using a single field for response status like the
> HTTP
> > > > >>> status
> > > > >>> codes, while the cons is that it requires clients to handle the
> > error
> > > > >>> codes
> > > > >>> carefully.
> > > > >>>
> > > > >>> I think maybe we can actually extend the single-code approach to
> > > > overcome
> > > > >>> its drawbacks, that is, wrap the error codes semantics to the
> users
> > > so
> > > > >>> that
> > > > >>> users do not need to handle the codes one-by-one. More
> concretely,
> > > > >>> following Jay's example the client could write sth. like this:
> > > > >>>
> > > > >>>
> > > > >>> -----------------
> > > > >>>
> > > > >>>   if(error.isOK())
> > > > >>>      // status code is good or the code can be simply ignored for
> > > this
> > > > >>> request type, process the request
> > > > >>>   else if(error.needsRetry())
> > > > >>>      // throttled, transient error, etc: retry
> > > > >>>   else if(error.isFatal())
> > > > >>>      // non-retriable errors, etc: notify / terminate / other
> > > handling
> > > > >>>
> > > > >>> -----------------
> > > > >>>
> > > > >>> Only when the clients really want to handle, for example
> > > > >>> FailDuetoThrottled
> > > > >>> status code specifically, it needs to:
> > > > >>>
> > > > >>>   if(error.isOK())
> > > > >>>      // status code is good or the code can be simply ignored for
> > > this
> > > > >>> request type, process the request
> > > > >>>   else if(error == FailDuetoThrottled )
> > > > >>>      // throttled: log it
> > > > >>>   else if(error.needsRetry())
> > > > >>>      // transient error, etc: retry
> > > > >>>   else if(error.isFatal())
> > > > >>>      // non-retriable errors, etc: notify / terminate / other
> > > handling
> > > > >>>
> > > > >>> -----------------
> > > > >>>
> > > > >>> And for implementation we can probably group the codes
> accordingly
> > > like
> > > > >>> HTTP status code such that we can do:
> > > > >>>
> > > > >>> boolean Error.isOK() {
> > > > >>>   return code < 300 && code >= 200;
> > > > >>> }
> > > > >>>
> > > > >>> Guozhang
> > > > >>>
> > > > >>> On Mon, Mar 16, 2015 at 10:24 PM, Ewen Cheslack-Postava <
> > > > >>> e...@confluent.io>
> > > > >>> wrote:
> > > > >>>
> > > > >>> > Agreed that trying to shoehorn non-error codes into the error
> > field
> > > > is
> > > > >>> a
> > > > >>> > bad idea. It makes it *way* too easy to write code that looks
> > (and
> > > > >>> should
> > > > >>> > be) correct but is actually incorrect. If necessary, I think
> it's
> > > > much
> > > > >>> > better to to spend a couple of extra bytes to encode that
> > > information
> > > > >>> > separately (a "status" or "warning" section of the response).
> An
> > > > >>> indication
> > > > >>> > that throttling is occurring is something I'd expect to be
> > > indicated
> > > > >>> by a
> > > > >>> > bit flag in the response rather than as an error code.
> > > > >>> >
> > > > >>> > Gwen - I think an error code makes sense when the request
> > actually
> > > > >>> failed.
> > > > >>> > Option B, which Jun was advocating, would have appended the
> > > messages
> > > > >>> > successfully. If the rate-limiting case you're talking about
> had
> > > > >>> > successfully committed the messages, I would say that's also a
> > bad
> > > > use
> > > > >>> of
> > > > >>> > error codes.
> > > > >>> >
> > > > >>> >
> > > > >>> > On Mon, Mar 16, 2015 at 10:16 PM, Gwen Shapira <
> > > > gshap...@cloudera.com>
> > > > >>> > wrote:
> > > > >>> >
> > > > >>> > > We discussed an error code for rate-limiting (which I think
> > made
> > > > >>> > > sense), isn't it a similar case?
> > > > >>> > >
> > > > >>> > > On Mon, Mar 16, 2015 at 10:10 PM, Jay Kreps <
> > jay.kr...@gmail.com
> > > >
> > > > >>> wrote:
> > > > >>> > > > My concern is that as soon as you start encoding non-error
> > > > response
> > > > >>> > > > information into error codes the next question is what to
> do
> > if
> > > > two
> > > > >>> > such
> > > > >>> > > > codes apply (i.e. you have a replica down and the response
> is
> > > > >>> > quota'd). I
> > > > >>> > > > think I am trying to argue that error should mean "why we
> > > failed
> > > > >>> your
> > > > >>> > > > request", for which there will really only be one reason,
> and
> > > any
> > > > >>> other
> > > > >>> > > > useful information we want to send back is just another
> field
> > > in
> > > > >>> the
> > > > >>> > > > response.
> > > > >>> > > >
> > > > >>> > > > -Jay
> > > > >>> > > >
> > > > >>> > > > On Mon, Mar 16, 2015 at 9:51 PM, Gwen Shapira <
> > > > >>> gshap...@cloudera.com>
> > > > >>> > > wrote:
> > > > >>> > > >
> > > > >>> > > >> I think its not too late to reserve a set of error codes
> > > > >>> (200-299?)
> > > > >>> > > >> for "non-error" codes.
> > > > >>> > > >>
> > > > >>> > > >> It won't be backward compatible (i.e. clients that
> currently
> > > do
> > > > >>> "else
> > > > >>> > > >> throw" will throw on non-errors), but perhaps its
> > worthwhile.
> > > > >>> > > >>
> > > > >>> > > >> On Mon, Mar 16, 2015 at 9:42 PM, Jay Kreps <
> > > jay.kr...@gmail.com
> > > > >
> > > > >>> > wrote:
> > > > >>> > > >> > Hey Jun,
> > > > >>> > > >> >
> > > > >>> > > >> > I'd really really really like to avoid that. Having just
> > > > spent a
> > > > >>> > > bunch of
> > > > >>> > > >> > time on the clients, using the error codes to encode
> other
> > > > >>> > information
> > > > >>> > > >> > about the response is super dangerous. The error
> handling
> > is
> > > > >>> one of
> > > > >>> > > the
> > > > >>> > > >> > hardest parts of the client (Guozhang chime in here).
> > > > >>> > > >> >
> > > > >>> > > >> > Generally the error handling looks like
> > > > >>> > > >> >   if(error == none)
> > > > >>> > > >> >      // good, process the request
> > > > >>> > > >> >   else if(error == KNOWN_ERROR_1)
> > > > >>> > > >> >      // handle known error 1
> > > > >>> > > >> >   else if(error == KNOWN_ERROR_2)
> > > > >>> > > >> >      // handle known error 2
> > > > >>> > > >> >   else
> > > > >>> > > >> >      throw Errors.forCode(error).exception(); // or some
> > > other
> > > > >>> > default
> > > > >>> > > >> > behavior
> > > > >>> > > >> >
> > > > >>> > > >> > This works because we have a convention that and error
> is
> > > > >>> something
> > > > >>> > > that
> > > > >>> > > >> > prevented your getting the response so the default
> > handling
> > > > >>> case is
> > > > >>> > > sane
> > > > >>> > > >> > and forward compatible. It is tempting to use the error
> > code
> > > > to
> > > > >>> > convey
> > > > >>> > > >> > information in the success case. For example we could
> use
> > > > error
> > > > >>> > codes
> > > > >>> > > to
> > > > >>> > > >> > encode whether quotas were enforced, whether the request
> > was
> > > > >>> served
> > > > >>> > > out
> > > > >>> > > >> of
> > > > >>> > > >> > cache, whether the stock market is up today, or
> whatever.
> > > The
> > > > >>> > problem
> > > > >>> > > is
> > > > >>> > > >> > that since these are not errors as far as the client is
> > > > >>> concerned it
> > > > >>> > > >> should
> > > > >>> > > >> > not throw an exception but process the response, but now
> > we
> > > > >>> created
> > > > >>> > an
> > > > >>> > > >> > explicit requirement that that error be handled
> explicitly
> > > > >>> since it
> > > > >>> > is
> > > > >>> > > >> > different. I really think that this kind of information
> is
> > > not
> > > > >>> an
> > > > >>> > > error,
> > > > >>> > > >> it
> > > > >>> > > >> > is just information, and if we want it in the response
> we
> > > > >>> should do
> > > > >>> > > the
> > > > >>> > > >> > right thing and add a new field to the response.
> > > > >>> > > >> >
> > > > >>> > > >> > I think you saw the Samza bug that was literally an
> > example
> > > of
> > > > >>> this
> > > > >>> > > >> > happening and leading to an infinite retry loop.
> > > > >>> > > >> >
> > > > >>> > > >> > Further more I really want to emphasize that hitting
> your
> > > > quota
> > > > >>> in
> > > > >>> > the
> > > > >>> > > >> > design that Adi has proposed is actually not an error
> > > > condition
> > > > >>> at
> > > > >>> > > all.
> > > > >>> > > >> It
> > > > >>> > > >> > is totally reasonable in any bootstrap situation to
> > > > >>> intentionally
> > > > >>> > > want to
> > > > >>> > > >> > run at the limit the system imposes on you.
> > > > >>> > > >> >
> > > > >>> > > >> > -Jay
> > > > >>> > > >> >
> > > > >>> > > >> >
> > > > >>> > > >> >
> > > > >>> > > >> > On Mon, Mar 16, 2015 at 4:27 PM, Jun Rao <
> > j...@confluent.io>
> > > > >>> wrote:
> > > > >>> > > >> >
> > > > >>> > > >> >> It's probably useful for a client to know whether its
> > > > requests
> > > > >>> are
> > > > >>> > > >> >> throttled or not (e.g., for monitoring and alerting).
> > From
> > > > that
> > > > >>> > > >> >> perspective, option B (delay the requests and return an
> > > > error)
> > > > >>> > seems
> > > > >>> > > >> >> better.
> > > > >>> > > >> >>
> > > > >>> > > >> >> Thanks,
> > > > >>> > > >> >>
> > > > >>> > > >> >> Jun
> > > > >>> > > >> >>
> > > > >>> > > >> >> On Wed, Mar 4, 2015 at 3:51 PM, Aditya Auradkar <
> > > > >>> > > >> >> aaurad...@linkedin.com.invalid> wrote:
> > > > >>> > > >> >>
> > > > >>> > > >> >> > Posted a KIP for quotas in kafka.
> > > > >>> > > >> >> >
> > > > >>> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-13+-+Quotas
> > > > >>> > > >> >> >
> > > > >>> > > >> >> > Appreciate any feedback.
> > > > >>> > > >> >> >
> > > > >>> > > >> >> > Aditya
> > > > >>> > > >> >> >
> > > > >>> > > >> >>
> > > > >>> > > >>
> > > > >>> > >
> > > > >>> >
> > > > >>> >
> > > > >>> >
> > > > >>> > --
> > > > >>> > Thanks,
> > > > >>> > Ewen
> > > > >>> >
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> --
> > > > >>> -- Guozhang
> > > > >>>
> > > > >>
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [KIP-DISCUSSION] KIP-13 Quotas

Reply via email to