Thanks for the explanation Colin.

> ForceRenounceCount => 
> kafka.controller:type=KafkaController,name=MetadataErrorCount
> publisher-error-count => metadata-load-error-count
> listener-batch-load-error-count => metadata-apply-error-count

Yeah, this makes sense. I have made the changes in naming you suggested and 
updated the KIP.

- Niket


> On Aug 3, 2022, at 2:00 PM, Colin McCabe <cmcc...@apache.org> wrote:
> 
> I think there are a few different cases here:
> 
> 1a. We hit an ApiException PREPARING metadata records on the active 
> controller. This is normal and expected. For example, someone tried to create 
> a topic that already exists. We translate the ApiException to an ApiError and 
> return the appropriate error code to the user. (NotControllerException is 
> also included here in 1A)
> 
> 1b. We hit a non-ApiException PREPARING metadata records on the active 
> controller. This is not expected. It would be something like we get a 
> NullPointerException in a createTopics RPC. In this case, we resign 
> leadership and increment your ForceRenounceCount metric.
> 
> 2. We hit any exception APPLYING metadata records on the active controller. 
> In this case, we just want to exit the process. This prevents the 
> (potentially) bad records from ever being commited to the log, since the 
> active controller applies the records locally before sending them out to its 
> peers.
> 
> 3. We hit any exception APPLYING metadata records on a standby controller. In 
> this case, we want to increment a metric so that we know that this happened. 
> But we don't want to exit the process, in case all the controllers are 
> hitting the same problem. Remember that the standby controller only ever 
> applies committed records.
> 
> 4a. We hit any exception constructing the MetadataDelta / new MetadataImage 
> on the broker. Increment listener-batch-load-error-count.
> 
> 4b. We hit any exception applying the MetadataDelta / new MetadataImage on 
> the broker. Increment publisher-error-count
> 
> The KIP is solid on 4a and 4b, but we have no metric here that we can use for 
> case #3. The standby controller can't resign because it's not active in the 
> first place. So I would suggest rather than having a resignation metric, we 
> just have a general controller metadata error metric.
> 
> How do you feel about:
> 
> ForceRenounceCount => 
> kafka.controller:type=KafkaController,name=MetadataErrorCount
> publisher-error-count => metadata-load-error-count
> listener-batch-load-error-count => metadata-apply-error-count
> 
> best,
> Colin
> 
> 
> On Tue, Aug 2, 2022, at 18:05, Niket Goel wrote:
>> Thanks for taking the time to go over the KIP Colin.
>> 
>> While I agree with both your points about error handling, I think this 
>> KIP focuses on just exposing these errors via the proposed metrics and 
>> does not alter the error handling behavior on either the brokers or the 
>> controllers. The metrics (as proposed) are somewhat independent of that 
>> and should just hook into whatever we are doing. If we change the error 
>> handling then the metrics should be hooked into the new code as well.
>> 
>> Maybe the point you are trying to make is that having the properties 
>> which you mention below will generally make it so that errors are not 
>> committed to the log and hence having the metrics for the is sufficient 
>> (as the error will likely be transient and recoverable?).
>> 
>> Apologies if I did not understand the intent of your comment.
>> 
>> - Niket
>> 
>>> On Aug 2, 2022, at 3:34 PM, Colin McCabe <cmcc...@apache.org> wrote:
>>> 
>>> Hi Niket,
>>> 
>>> Thanks for the KIP -- much appreciated! The new metrics look very useful.
>>> 
>>> I agree with the proposed error handling for errors on standby controllers 
>>> and brokers. For active controllers, I think we should establish two points:
>>> 
>>> 1. the active controller replays metadata before submitting it to the Raft 
>>> quorum
>>> 2. metadata replay errors on the active cause the process to exit, prior to 
>>> attempting to commit the record
>>> 
>>> This will allow most of these metadata replay errors to be noticed and NOT 
>>> committed to the metadata log, which I think will make things much more 
>>> robust. Since the controller process can be restarted very quickly, it 
>>> shouldn't be an undue operational burden. (It's true that when in combined 
>>> mode, restarts will take longer, but this kind of tradeoff is integral to 
>>> combined mode -- you get reduced fault isolation in exchange for the lower 
>>> overhead of one fewer JVM process).
>>> 
>>> best,
>>> Colin
>>> 
>>> 
>>> On Mon, Aug 1, 2022, at 18:05, David Arthur wrote:
>>>> Thanks, Niket.
>>>> 
>>>> +1 binding from me
>>>> 
>>>> -David
>>>> 
>>>> On Mon, Aug 1, 2022 at 8:15 PM Niket Goel <ng...@confluent.io.invalid> 
>>>> wrote:
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I would like to start a vote on KIP-859 which adds some new metrics to 
>>>>> KRaft to allow for better visibility into log processing errors.
>>>>> 
>>>>> KIP 
>>>>> —ttps://cwiki.apache.org/confluence/display/KAFKA/KIP-859%3A+Add+Metadata+Log+Processing+Error+Related+Metrics
>>>>>  
>>>>> <https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://cwiki.apache.org/confluence/display/KAFKA/KIP-859%25253A%252BAdd%252BMetadata%252BLog%252BProcessing%252BError%252BRelated%252BMetrics%26source%3Dgmail-imap%26ust%3D1660084494000000%26usg%3DAOvVaw3enbkssNlSaDhPrm6faYby&source=gmail-imap&ust=1660165379000000&usg=AOvVaw3kqbZjyeDMGa9vyeBvkobj>
>>>>> Discussion Thread — 
>>>>> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://lists.apache.org/thread/yl87h1s484yc09yjo1no46hwpbv0qkwt%26source%3Dgmail-imap%26ust%3D1660084494000000%26usg%3DAOvVaw2R3Sj0u0NlQOG9XHh-Wgzs&source=gmail-imap&ust=1660165379000000&usg=AOvVaw1TFzTmin99eZXg0g5Y81Mz
>>>>> 
>>>>> Thanks
>>>>> Niket
>>>>> 

Reply via email to