Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-11 Thread Becket Qin
Hi Ismael,

Yes, a follow up KIP after the controller code settles down sounds good.

Thanks,

Jiangjie (Becket) Qin

On Wed, May 10, 2017 at 6:11 PM, Ismael Juma  wrote:

> Thanks Jun. Discussed this offline with Onur and Jun and I believe there's
> agreement so updated the KIP:
>
> https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?
> pageId=69407758&selectedPageVersions=8&selectedPageVersions=7
>
> Ismael
>
> On Wed, May 10, 2017 at 4:46 PM, Jun Rao  wrote:
>
> > Hi, Onur,
> >
> > We probably don't want to do the 1-to-1 mapping from the event type to
> the
> > controller state since some of the event types are implementation
> details.
> > How about the following mapping?
> >
> > 0 - idle
> > 1 - controller change (Startup, ControllerChange, Reelect)
> > 2 - broker change (BrokerChange)
> > 3 - topic creation/change (TopicChange, PartitionModifications)
> > 4 - topic deletion (TopicDeletion, TopicDeletionStopReplicaResult)
> > 5 - partition reassigning (PartitionReassignment,
> > PartitionReassignmentIsrChange)
> > 6 - auto leader balancing (AutoPreferredReplicaLeaderElection)
> > 7 - manual leader balancing (PreferredReplicaLeaderElection)
> > 8 - controlled shutdown (ControlledShutdown)
> > 9 - isr change (IsrChangeNotification)
> >
> > For each state, we will add a corresponding timer to track the rate and
> the
> > latency, if it's not there already (e.g., broker change and controlled
> > shutdown). If there are future changes to the controller, we can make a
> > call whether the new event should be mapped to one of the existing states
> > or a new state.
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, May 9, 2017 at 6:17 PM, Onur Karaman <
> onurkaraman.apa...@gmail.com
> > >
> > wrote:
> >
> > > @Ismael, Jun
> > > After bringing up an earlier point twice now, it still doesn't feel
> like
> > > it's been commented on/addressed, so I'm going to give it one more
> shot:
> > > Assuming that ControllerState should reflect the current event being
> > > processed, the KIP is missing states.
> > >
> > > The controller currently has 14 event types:
> > > BrokerChange
> > > TopicChange
> > > PartitionModifications
> > > TopicDeletion
> > > PartitionReassignment
> > > PartitionReassignmentIsrChange
> > > IsrChangeNotification
> > > PreferredReplicaLeaderElection
> > > AutoPreferredReplicaLeaderElection
> > > ControlledShutdown
> > > TopicDeletionStopReplicaResult
> > > Startup
> > > ControllerChange
> > > Reelect
> > >
> > > The KIP only shows 10 event types (and it's not a proper subset of the
> > > above set).
> > >
> > > I think this mismatch would cause the ControllerState to incorrectly be
> > in
> > > the Idle state when in fact the controller could be doing a lot of
> work.
> > >
> > > 1. Should ControllerState exactly consist of the 14 controller event
> > types
> > > + the 1 Idle state?
> > > 2. If so, what's the process for adding/removing/merging event types
> > > w.r.t. this metric?
> > >
> > > On Tue, May 9, 2017 at 4:45 PM, Ismael Juma  wrote:
> > >
> > >> Becket,
> > >>
> > >> Are you OK with extending the metrics via a subsequent KIP (assuming
> > that
> > >> what we're doing here doesn't prevent that)? The KIP freeze is
> tomorrow
> > >> (although I will give it an extra day or two as many in the community
> > have
> > >> been attending the Kafka Summit this week), so we should avoid
> > increasing
> > >> the scope unless it's important for future improvements.
> > >>
> > >> Thanks,
> > >> Ismael
> > >>
> > >> On Wed, May 10, 2017 at 12:09 AM, Jun Rao  wrote:
> > >>
> > >> > Hi, Becket,
> > >> >
> > >> > q10. The reason why there is not a timer metric for broker change
> > event
> > >> is
> > >> > that the controller currently always has a
> LeaderElectionRateAndTimeMs
> > >> > timer metric (in ControllerStats).
> > >> >
> > >> > q11. I agree that that it's useful to know the queue time in the
> > >> > controller event queue and suggested that earlier. Onur thinks that
> > >> it's a
> > >> > bit too early to add that since we are about to change how to queue
> > >> events
> > >> > from ZK. Similarly, we will probably also make changes to batch
> > requests
> > >> > from the controller to the broker. So, perhaps we can add more
> metrics
> > >> once
> > >> > those changes in the controller have been made. For now, knowing the
> > >> > controller state and the controller channel queue size is probably
> > good
> > >> > enough.
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Jun
> > >> >
> > >> >
> > >> >
> > >> > On Mon, May 8, 2017 at 10:05 PM, Becket Qin 
> > >> wrote:
> > >> >
> > >> >> @Ismael,
> > >> >>
> > >> >> About the stage and event type. Yes, I think each event handling
> > should
> > >> >> have those stages covered. It is similar to what we are doing for
> the
> > >> >> requests on the broker side. We have benefited from such systematic
> > >> metric
> > >> >> structure a lot so I think it would be worth following the same way
> > in
> > >> the
> > >> >> controller.
> > >

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-10 Thread Ismael Juma
Thanks Jun. Discussed this offline with Onur and Jun and I believe there's
agreement so updated the KIP:

https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?
pageId=69407758&selectedPageVersions=8&selectedPageVersions=7

Ismael

On Wed, May 10, 2017 at 4:46 PM, Jun Rao  wrote:

> Hi, Onur,
>
> We probably don't want to do the 1-to-1 mapping from the event type to the
> controller state since some of the event types are implementation details.
> How about the following mapping?
>
> 0 - idle
> 1 - controller change (Startup, ControllerChange, Reelect)
> 2 - broker change (BrokerChange)
> 3 - topic creation/change (TopicChange, PartitionModifications)
> 4 - topic deletion (TopicDeletion, TopicDeletionStopReplicaResult)
> 5 - partition reassigning (PartitionReassignment,
> PartitionReassignmentIsrChange)
> 6 - auto leader balancing (AutoPreferredReplicaLeaderElection)
> 7 - manual leader balancing (PreferredReplicaLeaderElection)
> 8 - controlled shutdown (ControlledShutdown)
> 9 - isr change (IsrChangeNotification)
>
> For each state, we will add a corresponding timer to track the rate and the
> latency, if it's not there already (e.g., broker change and controlled
> shutdown). If there are future changes to the controller, we can make a
> call whether the new event should be mapped to one of the existing states
> or a new state.
>
> Thanks,
>
> Jun
>
> On Tue, May 9, 2017 at 6:17 PM, Onur Karaman  >
> wrote:
>
> > @Ismael, Jun
> > After bringing up an earlier point twice now, it still doesn't feel like
> > it's been commented on/addressed, so I'm going to give it one more shot:
> > Assuming that ControllerState should reflect the current event being
> > processed, the KIP is missing states.
> >
> > The controller currently has 14 event types:
> > BrokerChange
> > TopicChange
> > PartitionModifications
> > TopicDeletion
> > PartitionReassignment
> > PartitionReassignmentIsrChange
> > IsrChangeNotification
> > PreferredReplicaLeaderElection
> > AutoPreferredReplicaLeaderElection
> > ControlledShutdown
> > TopicDeletionStopReplicaResult
> > Startup
> > ControllerChange
> > Reelect
> >
> > The KIP only shows 10 event types (and it's not a proper subset of the
> > above set).
> >
> > I think this mismatch would cause the ControllerState to incorrectly be
> in
> > the Idle state when in fact the controller could be doing a lot of work.
> >
> > 1. Should ControllerState exactly consist of the 14 controller event
> types
> > + the 1 Idle state?
> > 2. If so, what's the process for adding/removing/merging event types
> > w.r.t. this metric?
> >
> > On Tue, May 9, 2017 at 4:45 PM, Ismael Juma  wrote:
> >
> >> Becket,
> >>
> >> Are you OK with extending the metrics via a subsequent KIP (assuming
> that
> >> what we're doing here doesn't prevent that)? The KIP freeze is tomorrow
> >> (although I will give it an extra day or two as many in the community
> have
> >> been attending the Kafka Summit this week), so we should avoid
> increasing
> >> the scope unless it's important for future improvements.
> >>
> >> Thanks,
> >> Ismael
> >>
> >> On Wed, May 10, 2017 at 12:09 AM, Jun Rao  wrote:
> >>
> >> > Hi, Becket,
> >> >
> >> > q10. The reason why there is not a timer metric for broker change
> event
> >> is
> >> > that the controller currently always has a LeaderElectionRateAndTimeMs
> >> > timer metric (in ControllerStats).
> >> >
> >> > q11. I agree that that it's useful to know the queue time in the
> >> > controller event queue and suggested that earlier. Onur thinks that
> >> it's a
> >> > bit too early to add that since we are about to change how to queue
> >> events
> >> > from ZK. Similarly, we will probably also make changes to batch
> requests
> >> > from the controller to the broker. So, perhaps we can add more metrics
> >> once
> >> > those changes in the controller have been made. For now, knowing the
> >> > controller state and the controller channel queue size is probably
> good
> >> > enough.
> >> >
> >> > Thanks,
> >> >
> >> > Jun
> >> >
> >> >
> >> >
> >> > On Mon, May 8, 2017 at 10:05 PM, Becket Qin 
> >> wrote:
> >> >
> >> >> @Ismael,
> >> >>
> >> >> About the stage and event type. Yes, I think each event handling
> should
> >> >> have those stages covered. It is similar to what we are doing for the
> >> >> requests on the broker side. We have benefited from such systematic
> >> metric
> >> >> structure a lot so I think it would be worth following the same way
> in
> >> the
> >> >> controller.
> >> >>
> >> >> As an example, for BrokerChangeEvent (or any event), I am thinking we
> >> >> would
> >> >> have the following metrics:
> >> >> 1. EventRate
> >> >> 2. EventQueueTime : The event queue time
> >> >> 3. EventHandlingTime: The event handling time (including zk path
> >> updates)
> >> >> 4. EventControllerChannelQueueTime: The queue time of the
> >> corresponding
> >> >> LeaderAndIsrRequest and UpdateMetadataRequest in the controller
> channel
> >> >> queue.
> >> >> 5. EventController

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-10 Thread Jun Rao
Hi, Onur,

We probably don't want to do the 1-to-1 mapping from the event type to the
controller state since some of the event types are implementation details.
How about the following mapping?

0 - idle
1 - controller change (Startup, ControllerChange, Reelect)
2 - broker change (BrokerChange)
3 - topic creation/change (TopicChange, PartitionModifications)
4 - topic deletion (TopicDeletion, TopicDeletionStopReplicaResult)
5 - partition reassigning (PartitionReassignment,
PartitionReassignmentIsrChange)
6 - auto leader balancing (AutoPreferredReplicaLeaderElection)
7 - manual leader balancing (PreferredReplicaLeaderElection)
8 - controlled shutdown (ControlledShutdown)
9 - isr change (IsrChangeNotification)

For each state, we will add a corresponding timer to track the rate and the
latency, if it's not there already (e.g., broker change and controlled
shutdown). If there are future changes to the controller, we can make a
call whether the new event should be mapped to one of the existing states
or a new state.

Thanks,

Jun

On Tue, May 9, 2017 at 6:17 PM, Onur Karaman 
wrote:

> @Ismael, Jun
> After bringing up an earlier point twice now, it still doesn't feel like
> it's been commented on/addressed, so I'm going to give it one more shot:
> Assuming that ControllerState should reflect the current event being
> processed, the KIP is missing states.
>
> The controller currently has 14 event types:
> BrokerChange
> TopicChange
> PartitionModifications
> TopicDeletion
> PartitionReassignment
> PartitionReassignmentIsrChange
> IsrChangeNotification
> PreferredReplicaLeaderElection
> AutoPreferredReplicaLeaderElection
> ControlledShutdown
> TopicDeletionStopReplicaResult
> Startup
> ControllerChange
> Reelect
>
> The KIP only shows 10 event types (and it's not a proper subset of the
> above set).
>
> I think this mismatch would cause the ControllerState to incorrectly be in
> the Idle state when in fact the controller could be doing a lot of work.
>
> 1. Should ControllerState exactly consist of the 14 controller event types
> + the 1 Idle state?
> 2. If so, what's the process for adding/removing/merging event types
> w.r.t. this metric?
>
> On Tue, May 9, 2017 at 4:45 PM, Ismael Juma  wrote:
>
>> Becket,
>>
>> Are you OK with extending the metrics via a subsequent KIP (assuming that
>> what we're doing here doesn't prevent that)? The KIP freeze is tomorrow
>> (although I will give it an extra day or two as many in the community have
>> been attending the Kafka Summit this week), so we should avoid increasing
>> the scope unless it's important for future improvements.
>>
>> Thanks,
>> Ismael
>>
>> On Wed, May 10, 2017 at 12:09 AM, Jun Rao  wrote:
>>
>> > Hi, Becket,
>> >
>> > q10. The reason why there is not a timer metric for broker change event
>> is
>> > that the controller currently always has a LeaderElectionRateAndTimeMs
>> > timer metric (in ControllerStats).
>> >
>> > q11. I agree that that it's useful to know the queue time in the
>> > controller event queue and suggested that earlier. Onur thinks that
>> it's a
>> > bit too early to add that since we are about to change how to queue
>> events
>> > from ZK. Similarly, we will probably also make changes to batch requests
>> > from the controller to the broker. So, perhaps we can add more metrics
>> once
>> > those changes in the controller have been made. For now, knowing the
>> > controller state and the controller channel queue size is probably good
>> > enough.
>> >
>> > Thanks,
>> >
>> > Jun
>> >
>> >
>> >
>> > On Mon, May 8, 2017 at 10:05 PM, Becket Qin 
>> wrote:
>> >
>> >> @Ismael,
>> >>
>> >> About the stage and event type. Yes, I think each event handling should
>> >> have those stages covered. It is similar to what we are doing for the
>> >> requests on the broker side. We have benefited from such systematic
>> metric
>> >> structure a lot so I think it would be worth following the same way in
>> the
>> >> controller.
>> >>
>> >> As an example, for BrokerChangeEvent (or any event), I am thinking we
>> >> would
>> >> have the following metrics:
>> >> 1. EventRate
>> >> 2. EventQueueTime : The event queue time
>> >> 3. EventHandlingTime: The event handling time (including zk path
>> updates)
>> >> 4. EventControllerChannelQueueTime: The queue time of the
>> corresponding
>> >> LeaderAndIsrRequest and UpdateMetadataRequest in the controller channel
>> >> queue.
>> >> 5. EventControllerChannelSendTime: The time to send the corresponding
>> >> requests, could be the total time for the corresponding
>> >> LeaderAndIsrRequest
>> >> and UpdateMetadataRequest. It is typically small, but in some cases
>> could
>> >> be slower than we expect.
>> >> 6. EventAckTime: The time waiting for all the corresponding responses.
>> >> 7. EventHandlingTotalTime: sum of 2-6
>> >>
>> >> Note that all the metrics are from the event and cluster wide state
>> >> transition point of view, but not from a single request type point of
>> >> view.
>> >> Among the above metri

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-09 Thread Onur Karaman
@Ismael, Jun
After bringing up an earlier point twice now, it still doesn't feel like
it's been commented on/addressed, so I'm going to give it one more shot:
Assuming that ControllerState should reflect the current event being
processed, the KIP is missing states.

The controller currently has 14 event types:
BrokerChange
TopicChange
PartitionModifications
TopicDeletion
PartitionReassignment
PartitionReassignmentIsrChange
IsrChangeNotification
PreferredReplicaLeaderElection
AutoPreferredReplicaLeaderElection
ControlledShutdown
TopicDeletionStopReplicaResult
Startup
ControllerChange
Reelect

The KIP only shows 10 event types (and it's not a proper subset of the
above set).

I think this mismatch would cause the ControllerState to incorrectly be in
the Idle state when in fact the controller could be doing a lot of work.

1. Should ControllerState exactly consist of the 14 controller event types
+ the 1 Idle state?
2. If so, what's the process for adding/removing/merging event types w.r.t.
this metric?

On Tue, May 9, 2017 at 4:45 PM, Ismael Juma  wrote:

> Becket,
>
> Are you OK with extending the metrics via a subsequent KIP (assuming that
> what we're doing here doesn't prevent that)? The KIP freeze is tomorrow
> (although I will give it an extra day or two as many in the community have
> been attending the Kafka Summit this week), so we should avoid increasing
> the scope unless it's important for future improvements.
>
> Thanks,
> Ismael
>
> On Wed, May 10, 2017 at 12:09 AM, Jun Rao  wrote:
>
> > Hi, Becket,
> >
> > q10. The reason why there is not a timer metric for broker change event
> is
> > that the controller currently always has a LeaderElectionRateAndTimeMs
> > timer metric (in ControllerStats).
> >
> > q11. I agree that that it's useful to know the queue time in the
> > controller event queue and suggested that earlier. Onur thinks that it's
> a
> > bit too early to add that since we are about to change how to queue
> events
> > from ZK. Similarly, we will probably also make changes to batch requests
> > from the controller to the broker. So, perhaps we can add more metrics
> once
> > those changes in the controller have been made. For now, knowing the
> > controller state and the controller channel queue size is probably good
> > enough.
> >
> > Thanks,
> >
> > Jun
> >
> >
> >
> > On Mon, May 8, 2017 at 10:05 PM, Becket Qin 
> wrote:
> >
> >> @Ismael,
> >>
> >> About the stage and event type. Yes, I think each event handling should
> >> have those stages covered. It is similar to what we are doing for the
> >> requests on the broker side. We have benefited from such systematic
> metric
> >> structure a lot so I think it would be worth following the same way in
> the
> >> controller.
> >>
> >> As an example, for BrokerChangeEvent (or any event), I am thinking we
> >> would
> >> have the following metrics:
> >> 1. EventRate
> >> 2. EventQueueTime : The event queue time
> >> 3. EventHandlingTime: The event handling time (including zk path
> updates)
> >> 4. EventControllerChannelQueueTime: The queue time of the corresponding
> >> LeaderAndIsrRequest and UpdateMetadataRequest in the controller channel
> >> queue.
> >> 5. EventControllerChannelSendTime: The time to send the corresponding
> >> requests, could be the total time for the corresponding
> >> LeaderAndIsrRequest
> >> and UpdateMetadataRequest. It is typically small, but in some cases
> could
> >> be slower than we expect.
> >> 6. EventAckTime: The time waiting for all the corresponding responses.
> >> 7. EventHandlingTotalTime: sum of 2-6
> >>
> >> Note that all the metrics are from the event and cluster wide state
> >> transition point of view, but not from a single request type point of
> >> view.
> >> Among the above metrics, 4,5,6 could potentially be per broker.
> >>
> >> Thanks,
> >>
> >> Jiangjie (Becket) Qin
> >>
> >> On Mon, May 8, 2017 at 7:24 PM, Onur Karaman <
> >> onurkaraman.apa...@gmail.com>
> >> wrote:
> >>
> >> > I had a similar comment to Becket but accidentally posted it on the
> vote
> >> > thread last Friday. From that thread:
> >> > "I noticed that both the ControllerState metric and the *RateAndTimeMs
> >> > metrics
> >> > only cover a subset of the controller event types. Was this
> >> intentional?"
> >> >
> >> > I think it makes most sense to just have the ControllerState metric
> and
> >> > *RateAndTimeMs metrics exactly match up with the event types.
> >> >
> >> > On Mon, May 8, 2017 at 4:35 PM, Ismael Juma 
> wrote:
> >> >
> >> > > Hi Becket,
> >> > >
> >> > > Thanks for the feedback. Comments inline.
> >> > >
> >> > > On Tue, May 9, 2017 at 12:19 AM, Becket Qin 
> >> > wrote:
> >> > >>
> >> > >> q10. With event loop based controller design, it seems natrural to
> >> have
> >> > >> the
> >> > >> processing time for each controller event type. In that case, the
> >> > current
> >> > >> metrics seem not covering all the event processing time? e.g.
> (broker
> >> > >> joining and broker failure event handling time

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-09 Thread Ismael Juma
Becket,

Are you OK with extending the metrics via a subsequent KIP (assuming that
what we're doing here doesn't prevent that)? The KIP freeze is tomorrow
(although I will give it an extra day or two as many in the community have
been attending the Kafka Summit this week), so we should avoid increasing
the scope unless it's important for future improvements.

Thanks,
Ismael

On Wed, May 10, 2017 at 12:09 AM, Jun Rao  wrote:

> Hi, Becket,
>
> q10. The reason why there is not a timer metric for broker change event is
> that the controller currently always has a LeaderElectionRateAndTimeMs
> timer metric (in ControllerStats).
>
> q11. I agree that that it's useful to know the queue time in the
> controller event queue and suggested that earlier. Onur thinks that it's a
> bit too early to add that since we are about to change how to queue events
> from ZK. Similarly, we will probably also make changes to batch requests
> from the controller to the broker. So, perhaps we can add more metrics once
> those changes in the controller have been made. For now, knowing the
> controller state and the controller channel queue size is probably good
> enough.
>
> Thanks,
>
> Jun
>
>
>
> On Mon, May 8, 2017 at 10:05 PM, Becket Qin  wrote:
>
>> @Ismael,
>>
>> About the stage and event type. Yes, I think each event handling should
>> have those stages covered. It is similar to what we are doing for the
>> requests on the broker side. We have benefited from such systematic metric
>> structure a lot so I think it would be worth following the same way in the
>> controller.
>>
>> As an example, for BrokerChangeEvent (or any event), I am thinking we
>> would
>> have the following metrics:
>> 1. EventRate
>> 2. EventQueueTime : The event queue time
>> 3. EventHandlingTime: The event handling time (including zk path updates)
>> 4. EventControllerChannelQueueTime: The queue time of the corresponding
>> LeaderAndIsrRequest and UpdateMetadataRequest in the controller channel
>> queue.
>> 5. EventControllerChannelSendTime: The time to send the corresponding
>> requests, could be the total time for the corresponding
>> LeaderAndIsrRequest
>> and UpdateMetadataRequest. It is typically small, but in some cases could
>> be slower than we expect.
>> 6. EventAckTime: The time waiting for all the corresponding responses.
>> 7. EventHandlingTotalTime: sum of 2-6
>>
>> Note that all the metrics are from the event and cluster wide state
>> transition point of view, but not from a single request type point of
>> view.
>> Among the above metrics, 4,5,6 could potentially be per broker.
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>> On Mon, May 8, 2017 at 7:24 PM, Onur Karaman <
>> onurkaraman.apa...@gmail.com>
>> wrote:
>>
>> > I had a similar comment to Becket but accidentally posted it on the vote
>> > thread last Friday. From that thread:
>> > "I noticed that both the ControllerState metric and the *RateAndTimeMs
>> > metrics
>> > only cover a subset of the controller event types. Was this
>> intentional?"
>> >
>> > I think it makes most sense to just have the ControllerState metric and
>> > *RateAndTimeMs metrics exactly match up with the event types.
>> >
>> > On Mon, May 8, 2017 at 4:35 PM, Ismael Juma  wrote:
>> >
>> > > Hi Becket,
>> > >
>> > > Thanks for the feedback. Comments inline.
>> > >
>> > > On Tue, May 9, 2017 at 12:19 AM, Becket Qin 
>> > wrote:
>> > >>
>> > >> q10. With event loop based controller design, it seems natrural to
>> have
>> > >> the
>> > >> processing time for each controller event type. In that case, the
>> > current
>> > >> metrics seem not covering all the event processing time? e.g. (broker
>> > >> joining and broker failure event handling time).
>> > >>
>> > >
>> > > I'll leave it to Jun to explain why some metrics were not included in
>> the
>> > > proposal.
>> > >
>> > > q11. There are actually a couple of stages for controller state
>> > transition
>> > >> and propagation. More specifically:
>> > >> Stage 1: Event queue time
>> > >> Stage 2: Event process time (including all the zk path updates)
>> > >> Stage 3: State propagation time (including the state propagation
>> queuing
>> > >> time on the controller and the actual request sent to response
>> receive
>> > >> time)
>> > >>
>> > >> I think it worth to have metrics for each of the stage. Stage 3 might
>> > be a
>> > >> little tricky because we may need to potentially manage the per
>> broker
>> > >> propagation time. Arguably the state propagation time is a little
>> > >> overlapping with the broker side request handling time, but for some
>> > state
>> > >> change that involves multiple types of requests, it could still be
>> > useful
>> > >> to know what is the time for a state transition to be propagated to
>> the
>> > >> entire cluster.
>> > >>
>> > >
>> > > Can you please elaborate on how you would like to see this measured.
>> Do
>> > > you mean that each event would have a separate metric for each of
>> these
>> > > stages? Maybe a worked out ex

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-09 Thread Jun Rao
Hi, Becket,

q10. The reason why there is not a timer metric for broker change event is
that the controller currently always has a LeaderElectionRateAndTimeMs
timer metric (in ControllerStats).

q11. I agree that that it's useful to know the queue time in the controller
event queue and suggested that earlier. Onur thinks that it's a bit too
early to add that since we are about to change how to queue events from ZK.
Similarly, we will probably also make changes to batch requests from the
controller to the broker. So, perhaps we can add more metrics once those
changes in the controller have been made. For now, knowing the controller
state and the controller channel queue size is probably good enough.

Thanks,

Jun



On Mon, May 8, 2017 at 10:05 PM, Becket Qin  wrote:

> @Ismael,
>
> About the stage and event type. Yes, I think each event handling should
> have those stages covered. It is similar to what we are doing for the
> requests on the broker side. We have benefited from such systematic metric
> structure a lot so I think it would be worth following the same way in the
> controller.
>
> As an example, for BrokerChangeEvent (or any event), I am thinking we would
> have the following metrics:
> 1. EventRate
> 2. EventQueueTime : The event queue time
> 3. EventHandlingTime: The event handling time (including zk path updates)
> 4. EventControllerChannelQueueTime: The queue time of the corresponding
> LeaderAndIsrRequest and UpdateMetadataRequest in the controller channel
> queue.
> 5. EventControllerChannelSendTime: The time to send the corresponding
> requests, could be the total time for the corresponding LeaderAndIsrRequest
> and UpdateMetadataRequest. It is typically small, but in some cases could
> be slower than we expect.
> 6. EventAckTime: The time waiting for all the corresponding responses.
> 7. EventHandlingTotalTime: sum of 2-6
>
> Note that all the metrics are from the event and cluster wide state
> transition point of view, but not from a single request type point of view.
> Among the above metrics, 4,5,6 could potentially be per broker.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Mon, May 8, 2017 at 7:24 PM, Onur Karaman  >
> wrote:
>
> > I had a similar comment to Becket but accidentally posted it on the vote
> > thread last Friday. From that thread:
> > "I noticed that both the ControllerState metric and the *RateAndTimeMs
> > metrics
> > only cover a subset of the controller event types. Was this intentional?"
> >
> > I think it makes most sense to just have the ControllerState metric and
> > *RateAndTimeMs metrics exactly match up with the event types.
> >
> > On Mon, May 8, 2017 at 4:35 PM, Ismael Juma  wrote:
> >
> > > Hi Becket,
> > >
> > > Thanks for the feedback. Comments inline.
> > >
> > > On Tue, May 9, 2017 at 12:19 AM, Becket Qin 
> > wrote:
> > >>
> > >> q10. With event loop based controller design, it seems natrural to
> have
> > >> the
> > >> processing time for each controller event type. In that case, the
> > current
> > >> metrics seem not covering all the event processing time? e.g. (broker
> > >> joining and broker failure event handling time).
> > >>
> > >
> > > I'll leave it to Jun to explain why some metrics were not included in
> the
> > > proposal.
> > >
> > > q11. There are actually a couple of stages for controller state
> > transition
> > >> and propagation. More specifically:
> > >> Stage 1: Event queue time
> > >> Stage 2: Event process time (including all the zk path updates)
> > >> Stage 3: State propagation time (including the state propagation
> queuing
> > >> time on the controller and the actual request sent to response receive
> > >> time)
> > >>
> > >> I think it worth to have metrics for each of the stage. Stage 3 might
> > be a
> > >> little tricky because we may need to potentially manage the per broker
> > >> propagation time. Arguably the state propagation time is a little
> > >> overlapping with the broker side request handling time, but for some
> > state
> > >> change that involves multiple types of requests, it could still be
> > useful
> > >> to know what is the time for a state transition to be propagated to
> the
> > >> entire cluster.
> > >>
> > >
> > > Can you please elaborate on how you would like to see this measured. Do
> > > you mean that each event would have a separate metric for each of these
> > > stages? Maybe a worked out example would make it clear.
> > >
> > > q11. Regarding (13), controller actually do not update the ISR but just
> > >> read it. The error message seems from the brokers. It usually happens
> in
> > >> the following case:
> > >> 1. Current leader broker has the cached ZNode version V
> > >> 2. The controller elected a new leader and updated the ZNode, now
> ZNode
> > >> version becomes V+1,
> > >> 3. Controller sends the LeaderAndIsrRequest to the replica brokers to
> > >> propagate the new leader information as well as the new zkVersion.
> > >> 4. Before the old leader process the LeaderAndIsrRequest from
> 

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-08 Thread Becket Qin
@Ismael,

About the stage and event type. Yes, I think each event handling should
have those stages covered. It is similar to what we are doing for the
requests on the broker side. We have benefited from such systematic metric
structure a lot so I think it would be worth following the same way in the
controller.

As an example, for BrokerChangeEvent (or any event), I am thinking we would
have the following metrics:
1. EventRate
2. EventQueueTime : The event queue time
3. EventHandlingTime: The event handling time (including zk path updates)
4. EventControllerChannelQueueTime: The queue time of the corresponding
LeaderAndIsrRequest and UpdateMetadataRequest in the controller channel
queue.
5. EventControllerChannelSendTime: The time to send the corresponding
requests, could be the total time for the corresponding LeaderAndIsrRequest
and UpdateMetadataRequest. It is typically small, but in some cases could
be slower than we expect.
6. EventAckTime: The time waiting for all the corresponding responses.
7. EventHandlingTotalTime: sum of 2-6

Note that all the metrics are from the event and cluster wide state
transition point of view, but not from a single request type point of view.
Among the above metrics, 4,5,6 could potentially be per broker.

Thanks,

Jiangjie (Becket) Qin

On Mon, May 8, 2017 at 7:24 PM, Onur Karaman 
wrote:

> I had a similar comment to Becket but accidentally posted it on the vote
> thread last Friday. From that thread:
> "I noticed that both the ControllerState metric and the *RateAndTimeMs
> metrics
> only cover a subset of the controller event types. Was this intentional?"
>
> I think it makes most sense to just have the ControllerState metric and
> *RateAndTimeMs metrics exactly match up with the event types.
>
> On Mon, May 8, 2017 at 4:35 PM, Ismael Juma  wrote:
>
> > Hi Becket,
> >
> > Thanks for the feedback. Comments inline.
> >
> > On Tue, May 9, 2017 at 12:19 AM, Becket Qin 
> wrote:
> >>
> >> q10. With event loop based controller design, it seems natrural to have
> >> the
> >> processing time for each controller event type. In that case, the
> current
> >> metrics seem not covering all the event processing time? e.g. (broker
> >> joining and broker failure event handling time).
> >>
> >
> > I'll leave it to Jun to explain why some metrics were not included in the
> > proposal.
> >
> > q11. There are actually a couple of stages for controller state
> transition
> >> and propagation. More specifically:
> >> Stage 1: Event queue time
> >> Stage 2: Event process time (including all the zk path updates)
> >> Stage 3: State propagation time (including the state propagation queuing
> >> time on the controller and the actual request sent to response receive
> >> time)
> >>
> >> I think it worth to have metrics for each of the stage. Stage 3 might
> be a
> >> little tricky because we may need to potentially manage the per broker
> >> propagation time. Arguably the state propagation time is a little
> >> overlapping with the broker side request handling time, but for some
> state
> >> change that involves multiple types of requests, it could still be
> useful
> >> to know what is the time for a state transition to be propagated to the
> >> entire cluster.
> >>
> >
> > Can you please elaborate on how you would like to see this measured. Do
> > you mean that each event would have a separate metric for each of these
> > stages? Maybe a worked out example would make it clear.
> >
> > q11. Regarding (13), controller actually do not update the ISR but just
> >> read it. The error message seems from the brokers. It usually happens in
> >> the following case:
> >> 1. Current leader broker has the cached ZNode version V
> >> 2. The controller elected a new leader and updated the ZNode, now ZNode
> >> version becomes V+1,
> >> 3. Controller sends the LeaderAndIsrRequest to the replica brokers to
> >> propagate the new leader information as well as the new zkVersion.
> >> 4. Before the old leader process the LeaderAndIsrRequest from controller
> >> in
> >> step 3, The old leader tries to update ISR and found the cached
> zkVersion
> >> V
> >> is different from the actual zkVersion V+1.
> >>
> >> It looks that this is not a controller metric.
> >
> >
> > Yes, it's not a controller metric, that's why it's under the "Partition
> > Metrics" section. In the PR, I actually implemented it in ReplicaManager
> > alongside IsrExpandsPerSec and IsrShrinksPerSec.
> >
> > Thanks,
> > Ismael
> >
>


Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-08 Thread Onur Karaman
I had a similar comment to Becket but accidentally posted it on the vote
thread last Friday. From that thread:
"I noticed that both the ControllerState metric and the *RateAndTimeMs metrics
only cover a subset of the controller event types. Was this intentional?"

I think it makes most sense to just have the ControllerState metric and
*RateAndTimeMs metrics exactly match up with the event types.

On Mon, May 8, 2017 at 4:35 PM, Ismael Juma  wrote:

> Hi Becket,
>
> Thanks for the feedback. Comments inline.
>
> On Tue, May 9, 2017 at 12:19 AM, Becket Qin  wrote:
>>
>> q10. With event loop based controller design, it seems natrural to have
>> the
>> processing time for each controller event type. In that case, the current
>> metrics seem not covering all the event processing time? e.g. (broker
>> joining and broker failure event handling time).
>>
>
> I'll leave it to Jun to explain why some metrics were not included in the
> proposal.
>
> q11. There are actually a couple of stages for controller state transition
>> and propagation. More specifically:
>> Stage 1: Event queue time
>> Stage 2: Event process time (including all the zk path updates)
>> Stage 3: State propagation time (including the state propagation queuing
>> time on the controller and the actual request sent to response receive
>> time)
>>
>> I think it worth to have metrics for each of the stage. Stage 3 might be a
>> little tricky because we may need to potentially manage the per broker
>> propagation time. Arguably the state propagation time is a little
>> overlapping with the broker side request handling time, but for some state
>> change that involves multiple types of requests, it could still be useful
>> to know what is the time for a state transition to be propagated to the
>> entire cluster.
>>
>
> Can you please elaborate on how you would like to see this measured. Do
> you mean that each event would have a separate metric for each of these
> stages? Maybe a worked out example would make it clear.
>
> q11. Regarding (13), controller actually do not update the ISR but just
>> read it. The error message seems from the brokers. It usually happens in
>> the following case:
>> 1. Current leader broker has the cached ZNode version V
>> 2. The controller elected a new leader and updated the ZNode, now ZNode
>> version becomes V+1,
>> 3. Controller sends the LeaderAndIsrRequest to the replica brokers to
>> propagate the new leader information as well as the new zkVersion.
>> 4. Before the old leader process the LeaderAndIsrRequest from controller
>> in
>> step 3, The old leader tries to update ISR and found the cached zkVersion
>> V
>> is different from the actual zkVersion V+1.
>>
>> It looks that this is not a controller metric.
>
>
> Yes, it's not a controller metric, that's why it's under the "Partition
> Metrics" section. In the PR, I actually implemented it in ReplicaManager
> alongside IsrExpandsPerSec and IsrShrinksPerSec.
>
> Thanks,
> Ismael
>


Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-08 Thread Ismael Juma
Hi Becket,

Thanks for the feedback. Comments inline.

On Tue, May 9, 2017 at 12:19 AM, Becket Qin  wrote:
>
> q10. With event loop based controller design, it seems natrural to have the
> processing time for each controller event type. In that case, the current
> metrics seem not covering all the event processing time? e.g. (broker
> joining and broker failure event handling time).
>

I'll leave it to Jun to explain why some metrics were not included in the
proposal.

q11. There are actually a couple of stages for controller state transition
> and propagation. More specifically:
> Stage 1: Event queue time
> Stage 2: Event process time (including all the zk path updates)
> Stage 3: State propagation time (including the state propagation queuing
> time on the controller and the actual request sent to response receive
> time)
>
> I think it worth to have metrics for each of the stage. Stage 3 might be a
> little tricky because we may need to potentially manage the per broker
> propagation time. Arguably the state propagation time is a little
> overlapping with the broker side request handling time, but for some state
> change that involves multiple types of requests, it could still be useful
> to know what is the time for a state transition to be propagated to the
> entire cluster.
>

Can you please elaborate on how you would like to see this measured. Do you
mean that each event would have a separate metric for each of these stages?
Maybe a worked out example would make it clear.

q11. Regarding (13), controller actually do not update the ISR but just
> read it. The error message seems from the brokers. It usually happens in
> the following case:
> 1. Current leader broker has the cached ZNode version V
> 2. The controller elected a new leader and updated the ZNode, now ZNode
> version becomes V+1,
> 3. Controller sends the LeaderAndIsrRequest to the replica brokers to
> propagate the new leader information as well as the new zkVersion.
> 4. Before the old leader process the LeaderAndIsrRequest from controller in
> step 3, The old leader tries to update ISR and found the cached zkVersion V
> is different from the actual zkVersion V+1.
>
> It looks that this is not a controller metric.


Yes, it's not a controller metric, that's why it's under the "Partition
Metrics" section. In the PR, I actually implemented it in ReplicaManager
alongside IsrExpandsPerSec and IsrShrinksPerSec.

Thanks,
Ismael


Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-08 Thread Becket Qin
Hi Ismael and Jun,

Thanks for the KIP. It is very useful. A couple of comments:

q10. With event loop based controller design, it seems natrural to have the
processing time for each controller event type. In that case, the current
metrics seem not covering all the event processing time? e.g. (broker
joining and broker failure event handling time).

q11. There are actually a couple of stages for controller state transition
and propagation. More specifically:
Stage 1: Event queue time
Stage 2: Event process time (including all the zk path updates)
Stage 3: State propagation time (including the state propagation queuing
time on the controller and the actual request sent to response receive time)

I think it worth to have metrics for each of the stage. Stage 3 might be a
little tricky because we may need to potentially manage the per broker
propagation time. Arguably the state propagation time is a little
overlapping with the broker side request handling time, but for some state
change that involves multiple types of requests, it could still be useful
to know what is the time for a state transition to be propagated to the
entire cluster.

q11. Regarding (13), controller actually do not update the ISR but just
read it. The error message seems from the brokers. It usually happens in
the following case:
1. Current leader broker has the cached ZNode version V
2. The controller elected a new leader and updated the ZNode, now ZNode
version becomes V+1,
3. Controller sends the LeaderAndIsrRequest to the replica brokers to
propagate the new leader information as well as the new zkVersion.
4. Before the old leader process the LeaderAndIsrRequest from controller in
step 3, The old leader tries to update ISR and found the cached zkVersion V
is different from the actual zkVersion V+1.

It looks that this is not a controller metric. I kind of think that if we
can have good metrics on the controller side to include the end to end
state propagation (i.e. until the LeaderAndIsrResponse is received), we
don't need to worry about this intermittent ISR update failure. If that
keeps happening, we should see a long state propagation time reported by
the controller or a long LeaderAndIsrRequest queue time.

Thanks,

Jiangjie (Becket) Qin

On Mon, May 8, 2017 at 2:47 PM, Joel Koshy  wrote:

> Hi Ismael,
>
>
> > > What about a broker that is not the controller? Would you need a
> separate
> > > idle-not-controller state?
> >
> >
> > Do we need a separate state or can users just use the
> ActiveControllerCount
> > metric to check if the broker is the controller?
> >
>
> Sure - the ACC metric should be sufficient.
>
>
> > > Given that most of the state changes are short
> > > we would just see blips in the best case and nothing in the worst case
> > > (depending on how often metrics get sampled). It would only help if you
> > > want to visually detect any transitions that are taking an inordinate
> > > duration.
> > >
> >
> > Right. Do you think this is not worth it?
> >
>
> Not sure - I felt the RateAndTime sensors are sufficient and would give
> more intuitive results. E.g., users would emit the state metric as a gauge
> but the underlying metric reporter would need to sample the metric
> frequently enough to capture state changes. i.e., for most metrics backends
> you would likely see a flat line even through a series of (fast) state
> changes.
>
>
> >
> > Ok - my thought was since we are already using kafka-metrics for quotas
> and
> > > selector metrics we could just do the same for this (and any *new*
> > metrics
> > > on the broker).
> > >
> >
> > I don't have a strong opinion either way as I think both options have
> pros
> > and cons. It seems like we don't have the concept of a Gauge in Kafka
> > Metrics at the moment, so it seems like it would be easier to do it via
> > Yammer metrics while we discuss what's the best way to unify all the
> > metrics again.
> >
>
> Sounds good.
>


Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-08 Thread Joel Koshy
Hi Ismael,


> > What about a broker that is not the controller? Would you need a separate
> > idle-not-controller state?
>
>
> Do we need a separate state or can users just use the ActiveControllerCount
> metric to check if the broker is the controller?
>

Sure - the ACC metric should be sufficient.


> > Given that most of the state changes are short
> > we would just see blips in the best case and nothing in the worst case
> > (depending on how often metrics get sampled). It would only help if you
> > want to visually detect any transitions that are taking an inordinate
> > duration.
> >
>
> Right. Do you think this is not worth it?
>

Not sure - I felt the RateAndTime sensors are sufficient and would give
more intuitive results. E.g., users would emit the state metric as a gauge
but the underlying metric reporter would need to sample the metric
frequently enough to capture state changes. i.e., for most metrics backends
you would likely see a flat line even through a series of (fast) state
changes.


>
> Ok - my thought was since we are already using kafka-metrics for quotas and
> > selector metrics we could just do the same for this (and any *new*
> metrics
> > on the broker).
> >
>
> I don't have a strong opinion either way as I think both options have pros
> and cons. It seems like we don't have the concept of a Gauge in Kafka
> Metrics at the moment, so it seems like it would be easier to do it via
> Yammer metrics while we discuss what's the best way to unify all the
> metrics again.
>

Sounds good.


Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-04 Thread Ismael Juma
Hi Joel,

Thanks for the feedback. Comments inline.

On Wed, May 3, 2017 at 7:52 PM, Joel Koshy  wrote:

> Yes - that is roughly what I was thinking (although deletes are no longer
> long running). Also, what is the "steady-state" controller state? Idle?
>

Yes.


> What about a broker that is not the controller? Would you need a separate
> idle-not-controller state?


Do we need a separate state or can users just use the ActiveControllerCount
metric to check if the broker is the controller?


> Given that most of the state changes are short
> we would just see blips in the best case and nothing in the worst case
> (depending on how often metrics get sampled). It would only help if you
> want to visually detect any transitions that are taking an inordinate
> duration.
>

Right. Do you think this is not worth it?

Ok - my thought was since we are already using kafka-metrics for quotas and
> selector metrics we could just do the same for this (and any *new* metrics
> on the broker).
>

I don't have a strong opinion either way as I think both options have pros
and cons. It seems like we don't have the concept of a Gauge in Kafka
Metrics at the moment, so it seems like it would be easier to do it via
Yammer metrics while we discuss what's the best way to unify all the
metrics again.

Thanks,
Ismael


Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-04 Thread Ismael Juma
Hi Onur,

Comments inline.

On Wed, May 3, 2017 at 6:54 PM, Onur Karaman 
wrote:

> Regarding the ControllerState and the potential for overlap, I think it
> depends on our definition of controller state. While KAFKA-5028 allows only
> a single ControllerEvent to be processed at a time, it still allows
> interleavings for long-lasting actions like partition reassignment and
> topic deletion. For instance, a topic can get created while another topic
> is undergoing partition reassignment. In that sense, there is overlap.
> However, in the sense of the ControllerEvents being processed, there can be
> no overlap.
>

Yes, the proposal is for the event that is currently being processed, so no
overlap. I clarified in the KIP.

>
> I also think adding the QueueSize and QueueTimeMs metrics could be a
> premature move. I completely agree that these metrics would be valuable
> given KAFKA-5028. However, I'm not sure whether the controller event queue
> and controller thread as implemented today is actually here to stay or if
> it's merely a first step in the controller redesign. Especially when
> considering the possibility of moving away from the simple synchronous
> zookeeper apis and having better control over handling zookeeper
> disconnects and session expirations, it's possible that the queue and
> thread could actually get ripped out of the controller and being part of
> something more general, making these two controller-level metrics invalid.
>

That's fair, I moved these metrics to "Future work" with an explanation why.

https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=69407758&selectedPageVersions=7&selectedPageVersions=6

Ismael


Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-03 Thread Joel Koshy
On Wed, May 3, 2017 at 10:54 AM, Onur Karaman 
wrote:

> Regarding the ControllerState and the potential for overlap, I think it
> depends on our definition of controller state. While KAFKA-5028 allows only
> a single ControllerEvent to be processed at a time, it still allows
> interleavings for long-lasting actions like partition reassignment and
> topic deletion. For instance, a topic can get created while another topic
> is undergoing partition reassignment. In that sense, there is overlap.
> However, in the sense of the ControllerEvents being processed, there can be
> no overlap.
>

Yes - that is roughly what I was thinking (although deletes are no longer
long running). Also, what is the "steady-state" controller state? Idle?
What about a broker that is not the controller? Would you need a separate
idle-not-controller state? Given that most of the state changes are short
we would just see blips in the best case and nothing in the worst case
(depending on how often metrics get sampled). It would only help if you
want to visually detect any transitions that are taking an inordinate
duration.



> > 1. Yes, the long term goal is to migrate the metrics on the broker to
>> > kafka-metrics. Since many people are using Yammer reporters, we probably
>> > need to support a few popular ones in kafka-metrics before migrating.
>> Until
>> > that happens, we probably want to stick with the Yammer metrics on the
>> > server side unless we depend on features from kafka-metrics (e.g,
>> quota).
>>
>
Ok - my thought was since we are already using kafka-metrics for quotas and
selector metrics we could just do the same for this (and any *new* metrics
on the broker).


> 4. Metrics #2 and #3. The issue with relying on metric #1 is that the
>> > latter is sensitive to the frequency of metric collection. For example,
>> if
>> > the starting of the controller takes 30 secs and the metric is only
>> > collected once a minute, one may not know the latency with just metric
>> #1,
>> > but will know the latency with metrics #2 and #3. Are you concerned
>> about
>> > the memory overhead of histograms? It doesn't seem that a couple of more
>> > histograms will hurt.
>>
>
No I don't have concerns about the histograms - just wondering if it is
useful enough to have these in the first place, but your summary makes
sense.

Joel


> >
>> > Hi, Isamel,
>> >
>> > Thanks the for proposal. A couple of more comments.,
>> >
>> > 10. It would be useful to add a new metrics for the controller queue
>> size.
>> > kafka.controller:type=ControllerStats,name=QueueSize
>> >
>> > 11. It would also be useful to know how long an event is waiting in the
>> > controller queue before being processing. Perhaps, we can add a
>> histogram
>> > metric like the following.
>> > kafka.controller:type=ControllerStats,name=QueueTimeMs
>> >
>> > Jun
>> >
>> > On Thu, Apr 27, 2017 at 11:39 AM, Joel Koshy 
>> wrote:
>> >
>> > > Thanks for the KIP - couple of comments:
>> > > - Do you intend to actually use yammer metrics? or use kafka-metrics
>> and
>> > > split the timer into an explicit rate and time? I think long term we
>> > ought
>> > > to move off yammer and use kafka-metrics only. Actually either is
>> fine,
>> > but
>> > > we should ideally use only one in the long term - and I thought the
>> plan
>> > > was to use kafka-metrics.
>> > > - metric #9 appears to be redundant since we already have per-API
>> request
>> > > rate and time metrics.
>> > > - Same for metric #4, #5 (as there are request stats for
>> > > DeleteTopicRequest - although it is possible for users to trigger
>> deletes
>> > > via ZK)
>> > > - metric #2, #3 are potentially useful, but a bit overkill for a
>> > > histogram. Alternative is to stick to last known value, but that
>> doesn't
>> > > play well with alerts if a high value isn't reset/decayed. Perhaps
>> metric
>> > > #1 would be sufficient to gauge slow start/resignation transitions.
>> > > - metric #1 - some of the states may actually overlap
>> > > - I don't actually understand the semantics of metric #6. Is it rate
>> of
>> > > partition reassignment triggers? does the number of partitions matter?
>> > >
>> > > Joel
>> > >
>> > > On Thu, Apr 27, 2017 at 8:04 AM, Tom Crayford 
>> > > wrote:
>> > >
>> > >> Ismael,
>> > >>
>> > >> Great, that sounds lovely.
>> > >>
>> > >> I'd like a `Timer` (using yammer metrics parlance) over how long it
>> took
>> > >> to
>> > >> process the event, so we can get at p99 and max times spent
>> processing
>> > >> things. Maybe we could even do a log at warning level if event
>> > processing
>> > >> takes over some timeout?
>> > >>
>> > >> Thanks
>> > >>
>> > >> Tom
>> > >>
>> > >> On Thu, Apr 27, 2017 at 3:59 PM, Ismael Juma 
>> wrote:
>> > >>
>> > >> > Hi Tom,
>> > >> >
>> > >> > Yes, the plan is to merge KAFKA-5028 first and then use a lock-free
>> > >> > approach for the new  metrics. I considered mentioning that in the
>> KIP
>> > >> > given KAFKA-5120, but didn't in the end. I'll add it to make

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-03 Thread Onur Karaman
Regarding the ControllerState and the potential for overlap, I think it
depends on our definition of controller state. While KAFKA-5028 allows only
a single ControllerEvent to be processed at a time, it still allows
interleavings for long-lasting actions like partition reassignment and
topic deletion. For instance, a topic can get created while another topic
is undergoing partition reassignment. In that sense, there is overlap.
However, in the sense of the ControllerEvents being processed, there can be
no overlap.

I also think adding the QueueSize and QueueTimeMs metrics could be a
premature move. I completely agree that these metrics would be valuable
given KAFKA-5028. However, I'm not sure whether the controller event queue
and controller thread as implemented today is actually here to stay or if
it's merely a first step in the controller redesign. Especially when
considering the possibility of moving away from the simple synchronous
zookeeper apis and having better control over handling zookeeper
disconnects and session expirations, it's possible that the queue and
thread could actually get ripped out of the controller and being part of
something more general, making these two controller-level metrics invalid.

On Wed, May 3, 2017 at 7:55 AM, Ismael Juma  wrote:

> Thanks for the feedback Tom, Joel and Jun.
>
> I updated the KIP in the following way:
>
> 1. Removed ControlledShutdownRateAndTimeMs
> 2. Added QueueSize and QueueTimeMs
> 3. Renamed FailedIsrUpdateRate to FailedIsrUpdatesPerSec for consistency
> with other metrics in the Partition class
> 4. Mentioned that Yammer metrics will be used
> 5. Noted that no Controller locks are acquired when retrieving metric
> values
>
> https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?
> pageId=69407758&selectedPageVersions=6&selectedPageVersions=4
>
> If there are no additional concerns, I will start a vote tomorrow.
>
> Ismael
>
> On Tue, May 2, 2017 at 2:35 AM, Jun Rao  wrote:
>
> > Hi, Joel,
> >
> > 1. Yes, the long term goal is to migrate the metrics on the broker to
> > kafka-metrics. Since many people are using Yammer reporters, we probably
> > need to support a few popular ones in kafka-metrics before migrating.
> Until
> > that happens, we probably want to stick with the Yammer metrics on the
> > server side unless we depend on features from kafka-metrics (e.g, quota).
> >
> > 2. Thanks for Onur, we now have moved to a single threaded model. So,
> none
> > of the event in metric #1 will overlap.
> >
> > 3. Metric #6 just track the rate/latency each time the controller is
> called
> > to initiate or resume the processing of a reassginment request.
> >
> > 4. Metrics #2 and #3. The issue with relying on metric #1 is that the
> > latter is sensitive to the frequency of metric collection. For example,
> if
> > the starting of the controller takes 30 secs and the metric is only
> > collected once a minute, one may not know the latency with just metric
> #1,
> > but will know the latency with metrics #2 and #3. Are you concerned about
> > the memory overhead of histograms? It doesn't seem that a couple of more
> > histograms will hurt.
> >
> > 5. Metric #9. Agreed. After KAFKA-5028, this will be reflected in the
> > remoteTimeMs of the controlled shutdown request.
> >
> > 6. Metrics #4 and #5. They are actually a bit different. The local time
> of
> > createTopic/deleteTopic just includes the time to add/delete the topic
> path
> > in ZK. The remote time includes the time that the controller processes
> the
> > request plus the time for the metadata to be propagated back to the
> > controller. So, knowing just the portion of the time spent in the
> > controller can still be useful.
> >
> > Hi, Isamel,
> >
> > Thanks the for proposal. A couple of more comments.,
> >
> > 10. It would be useful to add a new metrics for the controller queue
> size.
> > kafka.controller:type=ControllerStats,name=QueueSize
> >
> > 11. It would also be useful to know how long an event is waiting in the
> > controller queue before being processing. Perhaps, we can add a histogram
> > metric like the following.
> > kafka.controller:type=ControllerStats,name=QueueTimeMs
> >
> > Jun
> >
> > On Thu, Apr 27, 2017 at 11:39 AM, Joel Koshy 
> wrote:
> >
> > > Thanks for the KIP - couple of comments:
> > > - Do you intend to actually use yammer metrics? or use kafka-metrics
> and
> > > split the timer into an explicit rate and time? I think long term we
> > ought
> > > to move off yammer and use kafka-metrics only. Actually either is fine,
> > but
> > > we should ideally use only one in the long term - and I thought the
> plan
> > > was to use kafka-metrics.
> > > - metric #9 appears to be redundant since we already have per-API
> request
> > > rate and time metrics.
> > > - Same for metric #4, #5 (as there are request stats for
> > > DeleteTopicRequest - although it is possible for users to trigger
> deletes
> > > via ZK)
> > > - metric #2, #3 are potentially useful, bu

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-03 Thread Ismael Juma
Thanks for the feedback Tom, Joel and Jun.

I updated the KIP in the following way:

1. Removed ControlledShutdownRateAndTimeMs
2. Added QueueSize and QueueTimeMs
3. Renamed FailedIsrUpdateRate to FailedIsrUpdatesPerSec for consistency
with other metrics in the Partition class
4. Mentioned that Yammer metrics will be used
5. Noted that no Controller locks are acquired when retrieving metric values

https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=69407758&selectedPageVersions=6&selectedPageVersions=4

If there are no additional concerns, I will start a vote tomorrow.

Ismael

On Tue, May 2, 2017 at 2:35 AM, Jun Rao  wrote:

> Hi, Joel,
>
> 1. Yes, the long term goal is to migrate the metrics on the broker to
> kafka-metrics. Since many people are using Yammer reporters, we probably
> need to support a few popular ones in kafka-metrics before migrating. Until
> that happens, we probably want to stick with the Yammer metrics on the
> server side unless we depend on features from kafka-metrics (e.g, quota).
>
> 2. Thanks for Onur, we now have moved to a single threaded model. So, none
> of the event in metric #1 will overlap.
>
> 3. Metric #6 just track the rate/latency each time the controller is called
> to initiate or resume the processing of a reassginment request.
>
> 4. Metrics #2 and #3. The issue with relying on metric #1 is that the
> latter is sensitive to the frequency of metric collection. For example, if
> the starting of the controller takes 30 secs and the metric is only
> collected once a minute, one may not know the latency with just metric #1,
> but will know the latency with metrics #2 and #3. Are you concerned about
> the memory overhead of histograms? It doesn't seem that a couple of more
> histograms will hurt.
>
> 5. Metric #9. Agreed. After KAFKA-5028, this will be reflected in the
> remoteTimeMs of the controlled shutdown request.
>
> 6. Metrics #4 and #5. They are actually a bit different. The local time of
> createTopic/deleteTopic just includes the time to add/delete the topic path
> in ZK. The remote time includes the time that the controller processes the
> request plus the time for the metadata to be propagated back to the
> controller. So, knowing just the portion of the time spent in the
> controller can still be useful.
>
> Hi, Isamel,
>
> Thanks the for proposal. A couple of more comments.,
>
> 10. It would be useful to add a new metrics for the controller queue size.
> kafka.controller:type=ControllerStats,name=QueueSize
>
> 11. It would also be useful to know how long an event is waiting in the
> controller queue before being processing. Perhaps, we can add a histogram
> metric like the following.
> kafka.controller:type=ControllerStats,name=QueueTimeMs
>
> Jun
>
> On Thu, Apr 27, 2017 at 11:39 AM, Joel Koshy  wrote:
>
> > Thanks for the KIP - couple of comments:
> > - Do you intend to actually use yammer metrics? or use kafka-metrics and
> > split the timer into an explicit rate and time? I think long term we
> ought
> > to move off yammer and use kafka-metrics only. Actually either is fine,
> but
> > we should ideally use only one in the long term - and I thought the plan
> > was to use kafka-metrics.
> > - metric #9 appears to be redundant since we already have per-API request
> > rate and time metrics.
> > - Same for metric #4, #5 (as there are request stats for
> > DeleteTopicRequest - although it is possible for users to trigger deletes
> > via ZK)
> > - metric #2, #3 are potentially useful, but a bit overkill for a
> > histogram. Alternative is to stick to last known value, but that doesn't
> > play well with alerts if a high value isn't reset/decayed. Perhaps metric
> > #1 would be sufficient to gauge slow start/resignation transitions.
> > - metric #1 - some of the states may actually overlap
> > - I don't actually understand the semantics of metric #6. Is it rate of
> > partition reassignment triggers? does the number of partitions matter?
> >
> > Joel
> >
> > On Thu, Apr 27, 2017 at 8:04 AM, Tom Crayford 
> > wrote:
> >
> >> Ismael,
> >>
> >> Great, that sounds lovely.
> >>
> >> I'd like a `Timer` (using yammer metrics parlance) over how long it took
> >> to
> >> process the event, so we can get at p99 and max times spent processing
> >> things. Maybe we could even do a log at warning level if event
> processing
> >> takes over some timeout?
> >>
> >> Thanks
> >>
> >> Tom
> >>
> >> On Thu, Apr 27, 2017 at 3:59 PM, Ismael Juma  wrote:
> >>
> >> > Hi Tom,
> >> >
> >> > Yes, the plan is to merge KAFKA-5028 first and then use a lock-free
> >> > approach for the new  metrics. I considered mentioning that in the KIP
> >> > given KAFKA-5120, but didn't in the end. I'll add it to make it clear.
> >> >
> >> > Regarding locks, they are removed by KAFKA-5028, as you say. So, if I
> >> > understand correctly, you are suggesting an event processing rate
> metric
> >> > with event type as a tag? Onur and Jun, what do you think?
> >> >
> >> > Ismael
>

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-01 Thread Jun Rao
Hi, Joel,

1. Yes, the long term goal is to migrate the metrics on the broker to
kafka-metrics. Since many people are using Yammer reporters, we probably
need to support a few popular ones in kafka-metrics before migrating. Until
that happens, we probably want to stick with the Yammer metrics on the
server side unless we depend on features from kafka-metrics (e.g, quota).

2. Thanks for Onur, we now have moved to a single threaded model. So, none
of the event in metric #1 will overlap.

3. Metric #6 just track the rate/latency each time the controller is called
to initiate or resume the processing of a reassginment request.

4. Metrics #2 and #3. The issue with relying on metric #1 is that the
latter is sensitive to the frequency of metric collection. For example, if
the starting of the controller takes 30 secs and the metric is only
collected once a minute, one may not know the latency with just metric #1,
but will know the latency with metrics #2 and #3. Are you concerned about
the memory overhead of histograms? It doesn't seem that a couple of more
histograms will hurt.

5. Metric #9. Agreed. After KAFKA-5028, this will be reflected in the
remoteTimeMs of the controlled shutdown request.

6. Metrics #4 and #5. They are actually a bit different. The local time of
createTopic/deleteTopic just includes the time to add/delete the topic path
in ZK. The remote time includes the time that the controller processes the
request plus the time for the metadata to be propagated back to the
controller. So, knowing just the portion of the time spent in the
controller can still be useful.

Hi, Isamel,

Thanks the for proposal. A couple of more comments.,

10. It would be useful to add a new metrics for the controller queue size.
kafka.controller:type=ControllerStats,name=QueueSize

11. It would also be useful to know how long an event is waiting in the
controller queue before being processing. Perhaps, we can add a histogram
metric like the following.
kafka.controller:type=ControllerStats,name=QueueTimeMs

Jun

On Thu, Apr 27, 2017 at 11:39 AM, Joel Koshy  wrote:

> Thanks for the KIP - couple of comments:
> - Do you intend to actually use yammer metrics? or use kafka-metrics and
> split the timer into an explicit rate and time? I think long term we ought
> to move off yammer and use kafka-metrics only. Actually either is fine, but
> we should ideally use only one in the long term - and I thought the plan
> was to use kafka-metrics.
> - metric #9 appears to be redundant since we already have per-API request
> rate and time metrics.
> - Same for metric #4, #5 (as there are request stats for
> DeleteTopicRequest - although it is possible for users to trigger deletes
> via ZK)
> - metric #2, #3 are potentially useful, but a bit overkill for a
> histogram. Alternative is to stick to last known value, but that doesn't
> play well with alerts if a high value isn't reset/decayed. Perhaps metric
> #1 would be sufficient to gauge slow start/resignation transitions.
> - metric #1 - some of the states may actually overlap
> - I don't actually understand the semantics of metric #6. Is it rate of
> partition reassignment triggers? does the number of partitions matter?
>
> Joel
>
> On Thu, Apr 27, 2017 at 8:04 AM, Tom Crayford 
> wrote:
>
>> Ismael,
>>
>> Great, that sounds lovely.
>>
>> I'd like a `Timer` (using yammer metrics parlance) over how long it took
>> to
>> process the event, so we can get at p99 and max times spent processing
>> things. Maybe we could even do a log at warning level if event processing
>> takes over some timeout?
>>
>> Thanks
>>
>> Tom
>>
>> On Thu, Apr 27, 2017 at 3:59 PM, Ismael Juma  wrote:
>>
>> > Hi Tom,
>> >
>> > Yes, the plan is to merge KAFKA-5028 first and then use a lock-free
>> > approach for the new  metrics. I considered mentioning that in the KIP
>> > given KAFKA-5120, but didn't in the end. I'll add it to make it clear.
>> >
>> > Regarding locks, they are removed by KAFKA-5028, as you say. So, if I
>> > understand correctly, you are suggesting an event processing rate metric
>> > with event type as a tag? Onur and Jun, what do you think?
>> >
>> > Ismael
>> >
>> > On Thu, Apr 27, 2017 at 3:47 PM, Tom Crayford 
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > We (Heroku) are very excited about this KIP, as we've struggled a bit
>> > with
>> > > controller stability recently. Having these additional metrics would
>> be
>> > > wonderful.
>> > >
>> > > I'd like to ensure polling these metrics *doesn't* hold any locks etc,
>> > > because, as noted in https://issues.apache.org/jira/browse/KAFKA-5120
>> ,
>> > > that
>> > > lock can be held for quite some time. This may become not an issue as
>> of
>> > > KAFKA-5028 though.
>> > >
>> > > Lastly, I'd love to see some metrics around how long the controller
>> > spends
>> > > inside its lock. We've been tracking an issue (
>> > > https://issues.apache.org/jira/browse/KAFKA-5116) where it can hold
>> the
>> > > lock for many, many minutes in a zk client list

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-04-27 Thread Joel Koshy
Thanks for the KIP - couple of comments:
- Do you intend to actually use yammer metrics? or use kafka-metrics and
split the timer into an explicit rate and time? I think long term we ought
to move off yammer and use kafka-metrics only. Actually either is fine, but
we should ideally use only one in the long term - and I thought the plan
was to use kafka-metrics.
- metric #9 appears to be redundant since we already have per-API request
rate and time metrics.
- Same for metric #4, #5 (as there are request stats for DeleteTopicRequest
- although it is possible for users to trigger deletes via ZK)
- metric #2, #3 are potentially useful, but a bit overkill for a histogram.
Alternative is to stick to last known value, but that doesn't play well
with alerts if a high value isn't reset/decayed. Perhaps metric #1 would be
sufficient to gauge slow start/resignation transitions.
- metric #1 - some of the states may actually overlap
- I don't actually understand the semantics of metric #6. Is it rate of
partition reassignment triggers? does the number of partitions matter?

Joel

On Thu, Apr 27, 2017 at 8:04 AM, Tom Crayford  wrote:

> Ismael,
>
> Great, that sounds lovely.
>
> I'd like a `Timer` (using yammer metrics parlance) over how long it took to
> process the event, so we can get at p99 and max times spent processing
> things. Maybe we could even do a log at warning level if event processing
> takes over some timeout?
>
> Thanks
>
> Tom
>
> On Thu, Apr 27, 2017 at 3:59 PM, Ismael Juma  wrote:
>
> > Hi Tom,
> >
> > Yes, the plan is to merge KAFKA-5028 first and then use a lock-free
> > approach for the new  metrics. I considered mentioning that in the KIP
> > given KAFKA-5120, but didn't in the end. I'll add it to make it clear.
> >
> > Regarding locks, they are removed by KAFKA-5028, as you say. So, if I
> > understand correctly, you are suggesting an event processing rate metric
> > with event type as a tag? Onur and Jun, what do you think?
> >
> > Ismael
> >
> > On Thu, Apr 27, 2017 at 3:47 PM, Tom Crayford 
> > wrote:
> >
> > > Hi,
> > >
> > > We (Heroku) are very excited about this KIP, as we've struggled a bit
> > with
> > > controller stability recently. Having these additional metrics would be
> > > wonderful.
> > >
> > > I'd like to ensure polling these metrics *doesn't* hold any locks etc,
> > > because, as noted in https://issues.apache.org/jira/browse/KAFKA-5120,
> > > that
> > > lock can be held for quite some time. This may become not an issue as
> of
> > > KAFKA-5028 though.
> > >
> > > Lastly, I'd love to see some metrics around how long the controller
> > spends
> > > inside its lock. We've been tracking an issue (
> > > https://issues.apache.org/jira/browse/KAFKA-5116) where it can hold
> the
> > > lock for many, many minutes in a zk client listener thread when
> > responding
> > > to a single request. I'm not sure how that plays into
> > > https://issues.apache.org/jira/browse/KAFKA-5028 (which I assume will
> > land
> > > before this metrics patch), but it feels like there will be equivalent
> > > problems ("how long does it spend processing any individual message
> from
> > > the queue, broken down by message type").
> > >
> > > These are minor improvements though, the addition of more metrics to
> the
> > > controller is already going to be very helpful.
> > >
> > > Thanks
> > >
> > > Tom Crayford
> > > Heroku Kafka
> > >
> > > On Thu, Apr 27, 2017 at 3:10 PM, Ismael Juma 
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > We've posted "KIP-143: Controller Health Metrics" for discussion:
> > > >
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 143%3A+Controller+Health+Metrics
> > > >
> > > > Please take a look. Your feedback is appreciated.
> > > >
> > > > Thanks,
> > > > Ismael
> > > >
> > >
> >
>


Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-04-27 Thread Tom Crayford
Ismael,

Great, that sounds lovely.

I'd like a `Timer` (using yammer metrics parlance) over how long it took to
process the event, so we can get at p99 and max times spent processing
things. Maybe we could even do a log at warning level if event processing
takes over some timeout?

Thanks

Tom

On Thu, Apr 27, 2017 at 3:59 PM, Ismael Juma  wrote:

> Hi Tom,
>
> Yes, the plan is to merge KAFKA-5028 first and then use a lock-free
> approach for the new  metrics. I considered mentioning that in the KIP
> given KAFKA-5120, but didn't in the end. I'll add it to make it clear.
>
> Regarding locks, they are removed by KAFKA-5028, as you say. So, if I
> understand correctly, you are suggesting an event processing rate metric
> with event type as a tag? Onur and Jun, what do you think?
>
> Ismael
>
> On Thu, Apr 27, 2017 at 3:47 PM, Tom Crayford 
> wrote:
>
> > Hi,
> >
> > We (Heroku) are very excited about this KIP, as we've struggled a bit
> with
> > controller stability recently. Having these additional metrics would be
> > wonderful.
> >
> > I'd like to ensure polling these metrics *doesn't* hold any locks etc,
> > because, as noted in https://issues.apache.org/jira/browse/KAFKA-5120,
> > that
> > lock can be held for quite some time. This may become not an issue as of
> > KAFKA-5028 though.
> >
> > Lastly, I'd love to see some metrics around how long the controller
> spends
> > inside its lock. We've been tracking an issue (
> > https://issues.apache.org/jira/browse/KAFKA-5116) where it can hold the
> > lock for many, many minutes in a zk client listener thread when
> responding
> > to a single request. I'm not sure how that plays into
> > https://issues.apache.org/jira/browse/KAFKA-5028 (which I assume will
> land
> > before this metrics patch), but it feels like there will be equivalent
> > problems ("how long does it spend processing any individual message from
> > the queue, broken down by message type").
> >
> > These are minor improvements though, the addition of more metrics to the
> > controller is already going to be very helpful.
> >
> > Thanks
> >
> > Tom Crayford
> > Heroku Kafka
> >
> > On Thu, Apr 27, 2017 at 3:10 PM, Ismael Juma  wrote:
> >
> > > Hi all,
> > >
> > > We've posted "KIP-143: Controller Health Metrics" for discussion:
> > >
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 143%3A+Controller+Health+Metrics
> > >
> > > Please take a look. Your feedback is appreciated.
> > >
> > > Thanks,
> > > Ismael
> > >
> >
>


Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-04-27 Thread Ismael Juma
Hi Tom,

Yes, the plan is to merge KAFKA-5028 first and then use a lock-free
approach for the new  metrics. I considered mentioning that in the KIP
given KAFKA-5120, but didn't in the end. I'll add it to make it clear.

Regarding locks, they are removed by KAFKA-5028, as you say. So, if I
understand correctly, you are suggesting an event processing rate metric
with event type as a tag? Onur and Jun, what do you think?

Ismael

On Thu, Apr 27, 2017 at 3:47 PM, Tom Crayford  wrote:

> Hi,
>
> We (Heroku) are very excited about this KIP, as we've struggled a bit with
> controller stability recently. Having these additional metrics would be
> wonderful.
>
> I'd like to ensure polling these metrics *doesn't* hold any locks etc,
> because, as noted in https://issues.apache.org/jira/browse/KAFKA-5120,
> that
> lock can be held for quite some time. This may become not an issue as of
> KAFKA-5028 though.
>
> Lastly, I'd love to see some metrics around how long the controller spends
> inside its lock. We've been tracking an issue (
> https://issues.apache.org/jira/browse/KAFKA-5116) where it can hold the
> lock for many, many minutes in a zk client listener thread when responding
> to a single request. I'm not sure how that plays into
> https://issues.apache.org/jira/browse/KAFKA-5028 (which I assume will land
> before this metrics patch), but it feels like there will be equivalent
> problems ("how long does it spend processing any individual message from
> the queue, broken down by message type").
>
> These are minor improvements though, the addition of more metrics to the
> controller is already going to be very helpful.
>
> Thanks
>
> Tom Crayford
> Heroku Kafka
>
> On Thu, Apr 27, 2017 at 3:10 PM, Ismael Juma  wrote:
>
> > Hi all,
> >
> > We've posted "KIP-143: Controller Health Metrics" for discussion:
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 143%3A+Controller+Health+Metrics
> >
> > Please take a look. Your feedback is appreciated.
> >
> > Thanks,
> > Ismael
> >
>


Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-04-27 Thread Tom Crayford
Hi,

We (Heroku) are very excited about this KIP, as we've struggled a bit with
controller stability recently. Having these additional metrics would be
wonderful.

I'd like to ensure polling these metrics *doesn't* hold any locks etc,
because, as noted in https://issues.apache.org/jira/browse/KAFKA-5120, that
lock can be held for quite some time. This may become not an issue as of
KAFKA-5028 though.

Lastly, I'd love to see some metrics around how long the controller spends
inside its lock. We've been tracking an issue (
https://issues.apache.org/jira/browse/KAFKA-5116) where it can hold the
lock for many, many minutes in a zk client listener thread when responding
to a single request. I'm not sure how that plays into
https://issues.apache.org/jira/browse/KAFKA-5028 (which I assume will land
before this metrics patch), but it feels like there will be equivalent
problems ("how long does it spend processing any individual message from
the queue, broken down by message type").

These are minor improvements though, the addition of more metrics to the
controller is already going to be very helpful.

Thanks

Tom Crayford
Heroku Kafka

On Thu, Apr 27, 2017 at 3:10 PM, Ismael Juma  wrote:

> Hi all,
>
> We've posted "KIP-143: Controller Health Metrics" for discussion:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 143%3A+Controller+Health+Metrics
>
> Please take a look. Your feedback is appreciated.
>
> Thanks,
> Ismael
>


[DISCUSS] KIP-143: Controller Health Metrics

2017-04-27 Thread Ismael Juma
Hi all,

We've posted "KIP-143: Controller Health Metrics" for discussion:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-143%3A+Controller+Health+Metrics

Please take a look. Your feedback is appreciated.

Thanks,
Ismael