Re: Re[2]: Partition map exchange metrics

Nikolay Izhikov Wed, 24 Jul 2019 03:35:17 -0700

Nikita

What is the difference between those two metrics?


ср, 24 июля 2019 г., 12:45 Nikita Amelchev <[email protected]>:

> Igniters, thanks for comments.
>
> From the discussion it can be seen that we need only two metrics for now:
> - сacheOperationsBlockedDuration (long)
> - totalCacheOperationsBlockedDuration (long)
>
> I will prepare PR at the nearest time.
>
> ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky <[email protected]
> >:
> >
> > +1 with Anton decisions.
> >
> >
> > >Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <[email protected]>:
> > >
> > >Folks,
> > >
> > >It looks like we're trying to implement "extended debug" instead of
> > >"monitoring".
> > >It should not be interesting for real admin what phase of PME is in
> > >progress and so on.
> > >Interested metrics are
> > >- total blocked time (will be used for real SLA counting)
> > >- are we blocked right now (shows we have an SLA degradation right now)
> > >Duration of the current blocking period can be easily presented using
> any
> > >modern monitoring tool by regular checks.
> > >Initial true will means "period start", precision will be a result of
> > >checks frequency.
> > >Anyway, I'm ok to have current metric presented with long, where long
> is a
> > >duration, see no reason, but ok :)
> > >
> > >All other features you mentioned are useful for code or
> > >deployment improving and can (should) be taken from logs at the analysis
> > >phase.
> > >
> > >On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < [email protected] >
> wrote:
> > >
> > >> Folks, let me step in.
> > >>
> > >> Nikita, thanks for your suggestions!
> > >>
> > >> > 1. initialVersion. Topology version that initiates the exchange.
> > >> > 2. initTime. Time PME was started.
> > >> > 3. initEvent. Event that triggered PME.
> > >> > 4. partitionReleaseTime. Time when a node has finished waiting for
> all
> > >> > updates and translations on a previous topology.
> > >> > 5. sendSingleMessageTime. Time when a node sent a single message.
> > >> > 6. recieveFullMessageTime. Time when a node received a full message.
> > >> > 7. finishTime. Time PME was ended.
> > >> >
> > >> > When new PME started all these metrics resets.
> > >> Every metric from Nikita's list looks useful and simple to implement.
> > >> I think that it would be better to change format of metrics 4, 5, 6
> and
> > >> 7 a bit: we can keep only difference between time of previous event
> and
> > >> time of corresponding event. Such metrics would be easier to perceive:
> > >> they answer to specific questions "how much time did partition release
> > >> take?" or "how much time did awaiting of distributed phase end take?".
> > >> Also, if results of 4, 5, 6, 7 will be exported to monitoring system,
> > >> graphs will show how different stages times change from one PME to
> another.
> > >>
> > >> > When PME cause no blocking, it's a good PME and I see no reason to
> have
> > >> > monitoring related to it
> > >> Agree with Anton here. These metrics should be measured only for true
> > >> distributed exchange. Saving results for client leave/join PMEs will
> > >> just complicate monitoring.
> > >>
> > >> > I agree with total blocking duration metric but
> > >> > I still don't understand why instant value indicating that
> operations are
> > >> > blocked should be boolean.
> > >> > Duration time since blocking has started looks more appropriate and
> > >> useful.
> > >> > It gives more information while semantic is left the same.
> > >> Totally agree with Pavel here. Both "accumulated block time" and
> > >> "current PME block time" metrics are useful. Growth of accumulated
> > >> metric for specific period of time (should be easy to check via
> > >> monitoring system graph) will show for how much business operations
> were
> > >> blocked in total, and non-zero current metric will show that we are
> > >> experiencing issues right now. Boolean metric "are we blocked right
> now"
> > >> is not needed as it's obviously can be inferred from "current PME
> block
> > >> time".
> > >>
> > >> Best Regards,
> > >> Ivan Rakov
> > >>
> > >> On 23.07.2019 16:02, Pavel Kovalenko wrote:
> > >> > Nikita,
> > >> >
> > >> > I agree with total blocking duration metric but
> > >> > I still don't understand why instant value indicating that
> operations are
> > >> > blocked should be boolean.
> > >> > Duration time since blocking has started looks more appropriate and
> > >> useful.
> > >> > It gives more information while semantic is left the same.
> > >> >
> > >> >
> > >> >
> > >> > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < [email protected]
> >:
> > >> >
> > >> >> Folks,
> > >> >>
> > >> >> All previous suggestions have some disadvantages. It can be several
> > >> >> exchanges between two metric updates and fast exchange can rewrite
> > >> >> previous long exchange.
> > >> >>
> > >> >> We can introduce a metric of total blocking duration that will
> > >> >> accumulate at the end of the exchange. So, users will get actual
> > >> >> information about how long operations were blocked. Cluster metric
> > >> >> will be a maximum of local nodes metrics. And we need a boolean
> metric
> > >> >> that will indicate realtime status. It needs because of duration
> > >> >> metric updates at the end of the exchange.
> > >> >>
> > >> >> So I propose to change the current metric that not released to the
> > >> >> totalCacheOperationsBlockingDuration metric and to add the
> > >> >> isCacheOperationsBlocked metric.
> > >> >>
> > >> >> WDYT?
> > >> >>
> > >> >> пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < [email protected] >:
> > >> >>> Nikolay,
> > >> >>>
> > >> >>> Still see no reason to replace boolean with long.
> > >> >>>
> > >> >>> On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <
> [email protected] >
> > >> >> wrote:
> > >> >>>> Anton.
> > >> >>>>
> > >> >>>> 1. Value exported based on SPI settings, not in the moment it
> changed.
> > >> >>>>
> > >> >>>> 2. Clock synchronisation - if we export start time, we should
> also
> > >> >> export
> > >> >>>> node local timestamp.
> > >> >>>>
> > >> >>>> пн, 22 июля 2019 г., 8:33 Anton Vinogradov < [email protected] >:
> > >> >>>>
> > >> >>>>> Folks,
> > >> >>>>>
> > >> >>>>> What's the reason for duration counting?
> > >> >>>>> AFAIU, it's a monitoring system feature to count the durations.
> > >> >>>>> Sine monitoring system checks metrics periodically it will know
> the
> > >> >>>>> duration by its own log.
> > >> >>>>>
> > >> >>>>> On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <
> [email protected] >
> > >> >>>>> wrote:
> > >> >>>>>
> > >> >>>>>> Nikita,
> > >> >>>>>>
> > >> >>>>>> Yes, I mean duration not timestamp. For the metric name, I
> suggest
> > >> >>>>>> "cacheOperationsBlockingDuration", I think it cleaner
> represents
> > >> >> what
> > >> >>>> is
> > >> >>>>>> blocked during PME.
> > >> >>>>>> We can also combine both timestamp
> > >> >> "cacheOperationsBlockingStartTs" and
> > >> >>>>>> duration to have better correlation when cache operations were
> > >> >> blocked
> > >> >>>>> and
> > >> >>>>>> how much time it's taken.
> > >> >>>>>> For instant view (like in JMX bean) a calculated value as you
> > >> >> mentioned
> > >> >>>>>> can be used.
> > >> >>>>>> For metrics are exported to some backend (IEP-35) a counter
> can be
> > >> >>>> used.
> > >> >>>>>> The counter is incremented by blocking time after blocking has
> > >> >> ended.
> > >> >>>>>> пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <
> [email protected]
> > >> >>> :
> > >> >>>>>>> Pavel,
> > >> >>>>>>>
> > >> >>>>>>> The main purpose of this metric is
> > >> >>>>>>>>> how much time we wait for resuming cache operations
> > >> >>>>>>> Seems I misunderstood you. Do you mean timestamp or duration
> here?
> > >> >>>>>>>>> What do you think if we change the boolean value of metric
> to a
> > >> >>>> long
> > >> >>>>>>> value that represents time in milliseconds when operations
> were
> > >> >>>> blocked?
> > >> >>>>>>> This time can be calculated as (currentTime -
> > >> >>>>>>> timeSinceOperationsBlocked) in case of timestamp.
> > >> >>>>>>>
> > >> >>>>>>> Duration will be more understandable. It'll be something like
> > >> >>>>>>> getCurrentBlockingPmeDuration. But I haven't come up with a
> better
> > >> >>>>>>> name yet.
> > >> >>>>>>>
> > >> >>>>>>> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <
> [email protected]
> > >> >>> :
> > >> >>>>>>>> Nikita,
> > >> >>>>>>>>
> > >> >>>>>>>> I think getCurrentPmeDuration doesn't show useful
> information.
> > >> >> The
> > >> >>>>> main
> > >> >>>>>>> PME side effect for end-users is blocking cache operations.
> Not
> > >> >> all
> > >> >>>> PME
> > >> >>>>>>> time blocks it.
> > >> >>>>>>>> What information gives to an end-user timestamp of
> > >> >>>>>>> "timeSinceOperationsBlocked"? For what analysis it can be
> used and
> > >> >>>> how?
> > >> >>>>>>>> пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <
> > >> >>  [email protected]
> > >> >>>>> :
> > >> >>>>>>>>> Hi Pavel,
> > >> >>>>>>>>>
> > >> >>>>>>>>> This time already can be obtained from the
> > >> >> getCurrentPmeDuration
> > >> >>>> and
> > >> >>>>>>>>> new isOperationsBlockedByPme metrics.
> > >> >>>>>>>>>
> > >> >>>>>>>>> As an alternative solution, I can rework recently added
> > >> >>>>>>>>> getCurrentPmeDuration metric (not released yet). Seems for
> > >> >> users it
> > >> >>>>>>>>> useless in case of non-blocking PME.
> > >> >>>>>>>>> Lets name it timeSinceOperationsBlocked. It'll be timestamp
> > >> >> when
> > >> >>>>>>>>> blocking started (minimal value of cluster nodes) and 0 if
> > >> >> blocking
> > >> >>>>>>>>> ends (there is no running PME).
> > >> >>>>>>>>>
> > >> >>>>>>>>> WDYT?
> > >> >>>>>>>>>
> > >> >>>>>>>>> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <
> > >> >>  [email protected] >:
> > >> >>>>>>>>>> Hi Nikita,
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Thank you for working on this. What do you think if we
> > >> >> change the
> > >> >>>>>>> boolean
> > >> >>>>>>>>>> value of metric to a long value that represents time in
> > >> >>>>> milliseconds
> > >> >>>>>>> when
> > >> >>>>>>>>>> operations were blocked?
> > >> >>>>>>>>>> Since we have not only JMX and now metrics are periodically
> > >> >>>>> exported
> > >> >>>>>>> to
> > >> >>>>>>>>>> some backend it can give a more clear picture of how much
> > >> >> time we
> > >> >>>>>>> wait for
> > >> >>>>>>>>>> resuming cache operations instead of instant boolean
> > >> >> indicator.
> > >> >>>>>>>>>> пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <
> > >> >>>>  [email protected]
> > >> >>>>>> :
> > >> >>>>>>>>>>> Anton, Nikolay,
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> Thanks for the support.
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> For now, we have the getCurrentPmeDuration() metric that
> > >> >> does
> > >> >>>> not
> > >> >>>>>>> show
> > >> >>>>>>>>>>> influence on the cluster correctly. PME can be without
> > >> >> blocking
> > >> >>>>>>>>>>> operations. For example, client node join/leave events.
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> I suggest add new metric - isOperationsBlockedByPme().
> > >> >>>> Together,
> > >> >>>>>>> these
> > >> >>>>>>>>>>> metrics will show influence of the PME on cluster and user
> > >> >>>>>>> operations.
> > >> >>>>>>>>>>> I have prepared PR for this (Bot visa is green). [1] Can
> > >> >> anyone
> > >> >>>>>>> take a
> > >> >>>>>>>>>>> look?
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-11961
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
> > >> >>>>>  [email protected]
> > >> >>>>>>>> :
> > >> >>>>>>>>>>>> I think administator of Ignite cluster should be able to
> > >> >>>>> monitor
> > >> >>>>>>> all
> > >> >>>>>>>>>>> Ignite process, including non blocking PME.
> > >> >>>>>>>>>>>> В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> > >> >>>>>>>>>>>>> BTW,
> > >> >>>>>>>>>>>>> Found PME metric - getCurrentPmeDuration().
> > >> >>>>>>>>>>>>> Seems, it shows exactly PME time and not so useful
> > >> >> because
> > >> >>>> of
> > >> >>>>>>> this.
> > >> >>>>>>>>>>>>> The goal it so show exactly blocking period.
> > >> >>>>>>>>>>>>> When PME cause no blocking, it's a good PME and I see
> > >> >> no
> > >> >>>>>>> reason to have
> > >> >>>>>>>>>>>>> monitoring related to it :)
> > >> >>>>>>>>>>>>>
> > >> >>>>>>>>>>>>> On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
> > >> >>>>>>>  [email protected] >
> > >> >>>>>>>>>>> wrote:
> > >> >>>>>>>>>>>>>> Anton.
> > >> >>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>> Why do we need to postpone implementation of this
> > >> >>>> metrics?
> > >> >>>>>>>>>>>>>> For now, implementation of new metric is very simple.
> > >> >>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>> I think we can implement this metrics as a single
> > >> >>>>>>> contribution.
> > >> >>>>>>>>>>>>>> В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov
> > >> >> пишет:
> > >> >>>>>>>>>>>>>>> Nikita,
> > >> >>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>> Looks like all we need now is a 1 simple metric:
> > >> >> are
> > >> >>>>>>> operations
> > >> >>>>>>>>>>> blocked?
> > >> >>>>>>>>>>>>>>> Just a true or false.
> > >> >>>>>>>>>>>>>>> Lest start from this.
> > >> >>>>>>>>>>>>>>> All other metrics can be extracted from logs now
> > >> >> and
> > >> >>>> can
> > >> >>>>> be
> > >> >>>>>>>>>>> implemented
> > >> >>>>>>>>>>>>>>> later.
> > >> >>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>> On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> > >> >>>>>>>>>>>  [email protected] >
> > >> >>>>>>>>>>>>>>> wrote:
> > >> >>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>> +1.
> > >> >>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>> Nikita, please, go ahead.
> > >> >>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>> вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
> > >> >>>>>>>  [email protected]
> > >> >>>>>>>>>>>> :
> > >> >>>>>>>>>>>>>>>>> Hello, Igniters.
> > >> >>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>> I suggest to add some useful metrics about the
> > >> >>>>>>> partition map
> > >> >>>>>>>>>>> exchange
> > >> >>>>>>>>>>>>>>>>> (PME). For now, the duration of PME stages
> > >> >>>> available
> > >> >>>>>>> only in
> > >> >>>>>>>>>>> log
> > >> >>>>>>>>>>>>>> files
> > >> >>>>>>>>>>>>>>>>> and cannot be obtained using JMX or other
> > >> >> external
> > >> >>>>>>> tools. [1]
> > >> >>>>>>>>>>>>>>>>> I made the list of local node metrics that
> > >> >> help to
> > >> >>>>>>> understand
> > >> >>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>> actual status of current PME:
> > >> >>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>> 1. initialVersion. Topology version that
> > >> >> initiates
> > >> >>>>> the
> > >> >>>>>>>>>>> exchange.
> > >> >>>>>>>>>>>>>>>>> 2. initTime. Time PME was started.
> > >> >>>>>>>>>>>>>>>>> 3. initEvent. Event that triggered PME.
> > >> >>>>>>>>>>>>>>>>> 4. partitionReleaseTime. Time when a node has
> > >> >>>>> finished
> > >> >>>>>>> waiting
> > >> >>>>>>>>>>> for
> > >> >>>>>>>>>>>>>> all
> > >> >>>>>>>>>>>>>>>>> updates and translations on a previous
> > >> >> topology.
> > >> >>>>>>>>>>>>>>>>> 5. sendSingleMessageTime. Time when a node
> > >> >> sent a
> > >> >>>>>>> single
> > >> >>>>>>>>>>> message.
> > >> >>>>>>>>>>>>>>>>> 6. recieveFullMessageTime. Time when a node
> > >> >>>> received
> > >> >>>>> a
> > >> >>>>>>> full
> > >> >>>>>>>>>>> message.
> > >> >>>>>>>>>>>>>>>>> 7. finishTime. Time PME was ended.
> > >> >>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>> When new PME started all these metrics resets.
> > >> >>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>> These metrics help to understand:
> > >> >>>>>>>>>>>>>>>>> - how long PME was (current or previous).
> > >> >>>>>>>>>>>>>>>>> - how long awaited for all updates was
> > >> >> completed.
> > >> >>>>>>>>>>>>>>>>> - what node blocks PME (didn't send a single
> > >> >>>> message)
> > >> >>>>>>>>>>>>>>>>> - what triggered PME.
> > >> >>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>> Thoughts?
> > >> >>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>> [1]
> > >> >>>>>  https://issues.apache.org/jira/browse/IGNITE-11961
> > >> >>>>>>>>>>>>>>>>> --
> > >> >>>>>>>>>>>>>>>>> Best wishes,
> > >> >>>>>>>>>>>>>>>>> Amelchev Nikita
> > >> >>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> --
> > >> >>>>>>>>>>> Best wishes,
> > >> >>>>>>>>>>> Amelchev Nikita
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>
> > >> >>>>>>>>>
> > >> >>>>>>>>> --
> > >> >>>>>>>>> Best wishes,
> > >> >>>>>>>>> Amelchev Nikita
> > >> >>>>>>>
> > >> >>>>>>>
> > >> >>>>>>> --
> > >> >>>>>>> Best wishes,
> > >> >>>>>>> Amelchev Nikita
> > >> >>>>>>>
> > >> >>
> > >> >>
> > >> >> --
> > >> >> Best wishes,
> > >> >> Amelchev Nikita
> > >> >>
> > >>
> >
> >
> > --
> > Zhenya Stanilovsky
>
>
>
> --
> Best wishes,
> Amelchev Nikita
>

Re: Re[2]: Partition map exchange metrics

Reply via email to