Thanks for the explanation Colin. > ForceRenounceCount => > kafka.controller:type=KafkaController,name=MetadataErrorCount > publisher-error-count => metadata-load-error-count > listener-batch-load-error-count => metadata-apply-error-count
Yeah, this makes sense. I have made the changes in naming you suggested and updated the KIP. - Niket > On Aug 3, 2022, at 2:00 PM, Colin McCabe <cmcc...@apache.org> wrote: > > I think there are a few different cases here: > > 1a. We hit an ApiException PREPARING metadata records on the active > controller. This is normal and expected. For example, someone tried to create > a topic that already exists. We translate the ApiException to an ApiError and > return the appropriate error code to the user. (NotControllerException is > also included here in 1A) > > 1b. We hit a non-ApiException PREPARING metadata records on the active > controller. This is not expected. It would be something like we get a > NullPointerException in a createTopics RPC. In this case, we resign > leadership and increment your ForceRenounceCount metric. > > 2. We hit any exception APPLYING metadata records on the active controller. > In this case, we just want to exit the process. This prevents the > (potentially) bad records from ever being commited to the log, since the > active controller applies the records locally before sending them out to its > peers. > > 3. We hit any exception APPLYING metadata records on a standby controller. In > this case, we want to increment a metric so that we know that this happened. > But we don't want to exit the process, in case all the controllers are > hitting the same problem. Remember that the standby controller only ever > applies committed records. > > 4a. We hit any exception constructing the MetadataDelta / new MetadataImage > on the broker. Increment listener-batch-load-error-count. > > 4b. We hit any exception applying the MetadataDelta / new MetadataImage on > the broker. Increment publisher-error-count > > The KIP is solid on 4a and 4b, but we have no metric here that we can use for > case #3. The standby controller can't resign because it's not active in the > first place. So I would suggest rather than having a resignation metric, we > just have a general controller metadata error metric. > > How do you feel about: > > ForceRenounceCount => > kafka.controller:type=KafkaController,name=MetadataErrorCount > publisher-error-count => metadata-load-error-count > listener-batch-load-error-count => metadata-apply-error-count > > best, > Colin > > > On Tue, Aug 2, 2022, at 18:05, Niket Goel wrote: >> Thanks for taking the time to go over the KIP Colin. >> >> While I agree with both your points about error handling, I think this >> KIP focuses on just exposing these errors via the proposed metrics and >> does not alter the error handling behavior on either the brokers or the >> controllers. The metrics (as proposed) are somewhat independent of that >> and should just hook into whatever we are doing. If we change the error >> handling then the metrics should be hooked into the new code as well. >> >> Maybe the point you are trying to make is that having the properties >> which you mention below will generally make it so that errors are not >> committed to the log and hence having the metrics for the is sufficient >> (as the error will likely be transient and recoverable?). >> >> Apologies if I did not understand the intent of your comment. >> >> - Niket >> >>> On Aug 2, 2022, at 3:34 PM, Colin McCabe <cmcc...@apache.org> wrote: >>> >>> Hi Niket, >>> >>> Thanks for the KIP -- much appreciated! The new metrics look very useful. >>> >>> I agree with the proposed error handling for errors on standby controllers >>> and brokers. For active controllers, I think we should establish two points: >>> >>> 1. the active controller replays metadata before submitting it to the Raft >>> quorum >>> 2. metadata replay errors on the active cause the process to exit, prior to >>> attempting to commit the record >>> >>> This will allow most of these metadata replay errors to be noticed and NOT >>> committed to the metadata log, which I think will make things much more >>> robust. Since the controller process can be restarted very quickly, it >>> shouldn't be an undue operational burden. (It's true that when in combined >>> mode, restarts will take longer, but this kind of tradeoff is integral to >>> combined mode -- you get reduced fault isolation in exchange for the lower >>> overhead of one fewer JVM process). >>> >>> best, >>> Colin >>> >>> >>> On Mon, Aug 1, 2022, at 18:05, David Arthur wrote: >>>> Thanks, Niket. >>>> >>>> +1 binding from me >>>> >>>> -David >>>> >>>> On Mon, Aug 1, 2022 at 8:15 PM Niket Goel <ng...@confluent.io.invalid> >>>> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> I would like to start a vote on KIP-859 which adds some new metrics to >>>>> KRaft to allow for better visibility into log processing errors. >>>>> >>>>> KIP >>>>> —ttps://cwiki.apache.org/confluence/display/KAFKA/KIP-859%3A+Add+Metadata+Log+Processing+Error+Related+Metrics >>>>> >>>>> <https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://cwiki.apache.org/confluence/display/KAFKA/KIP-859%25253A%252BAdd%252BMetadata%252BLog%252BProcessing%252BError%252BRelated%252BMetrics%26source%3Dgmail-imap%26ust%3D1660084494000000%26usg%3DAOvVaw3enbkssNlSaDhPrm6faYby&source=gmail-imap&ust=1660165379000000&usg=AOvVaw3kqbZjyeDMGa9vyeBvkobj> >>>>> Discussion Thread — >>>>> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://lists.apache.org/thread/yl87h1s484yc09yjo1no46hwpbv0qkwt%26source%3Dgmail-imap%26ust%3D1660084494000000%26usg%3DAOvVaw2R3Sj0u0NlQOG9XHh-Wgzs&source=gmail-imap&ust=1660165379000000&usg=AOvVaw1TFzTmin99eZXg0g5Y81Mz >>>>> >>>>> Thanks >>>>> Niket >>>>>