Nikolay,

Okay, sounds reasonable.
I just want to add that currentPmeTime is also useful alerting systems, not
only for eye observing. If the time become too long and exceeds some
threshold appropriate alert firing can help to early determine a critical
problem.

On Thu, 25 Jul 2019 at 21.12, Nikolay Izhikov <nizhi...@apache.org> wrote:

> I think exact time should be obtained from logs, isnt it?
>
>
> чт, 25 июля 2019 г., 20:00 Pavel Kovalenko <jokse...@gmail.com>:
>
> > Nikolay,
> >
> > Yes, I have a chance to see HistogramMetric and moreover reviewed it) My
> > question was mostly about what exactly we will track in Histogram.
> > If we use histogram do you know how we can find exact time e.g. when PME
> > with time > 1s happened?
> >
> > чт, 25 июл. 2019 г. в 19:24, Nikolay Izhikov <nizhi...@apache.org>:
> >
> > > Pavel
> > >
> > > Do you have a chance to see HistogramMetric source?
> > > It in master now.
> > > Look in source would be better then my explanation)
> > >
> > > We should count PME processes that blocks operations for some amount of
> > > time. For example [less then 50, less then 250, less then 1000, more
> then
> > > 1000] millis.
> > >
> > > чт, 25 июля 2019 г., 18:55 Pavel Kovalenko <jokse...@gmail.com>:
> > >
> > > > Nikolay,
> > > >
> > > > Could you please explain deeper what structure will be of PME
> > histogram?
> > > >
> > > > чт, 25 июл. 2019 г. в 11:56, Nikolay Izhikov <nizhi...@apache.org>:
> > > >
> > > > > Hello, Nikita.
> > > > >
> > > > > I think
> > > > >
> > > > > > 1. The totalCacheOperationsBlockedDuration metric that will
> > > accumulate
> > > > > > all blocking durations that happen after node starts.
> > > > >
> > > > > No, we don't need it.
> > > > >
> > > > > > 2. Blocking duration histogram. Based on the HistogramMetric
> class.
> > > > >
> > > > > Yes, we need it.
> > > > >
> > > > > В Чт, 25/07/2019 в 11:50 +0300, Nikita Amelchev пишет:
> > > > > > Igniters,
> > > > > >
> > > > > > All want to see the сacheOperationsBlockedDuration metric that
> will
> > > > > > show current blocking duration or 0 if there is no blocking right
> > > now.
> > > > > >
> > > > > > Do we need the following metrics? It seems one of them will be
> > > > > superfluous.
> > > > > > 1. The totalCacheOperationsBlockedDuration metric that will
> > > accumulate
> > > > > > all blocking durations that happen after node starts.
> > > > > > 2. Blocking duration histogram. Based on the HistogramMetric
> class.
> > > > > > User will be able to configure bounds.
> > > > > >
> > > > > > ср, 24 июл. 2019 г. в 18:26, Nikolay Izhikov <
> nizhi...@apache.org
> > >:
> > > > > > >
> > > > > > > Guys.
> > > > > > >
> > > > > > > I think we should go with the 2 metrics
> > > > > > >
> > > > > > >         * current PME duration (resets on finish)
> > > > > > >
> > > > > > >                 This metric required for alerting(or automatic
> > > > > actions) on long PME.
> > > > > > >
> > > > > > >         * PME duration histogram (value added to metrics on PME
> > > > finish)
> > > > > > >                 This metric required for an:
> > > > > > >                         * Quick PME trend analysis
> > > > > > >                         * Quick PME history analysis
> > > > > > >
> > > > > > >
> > > > > > > В Ср, 24/07/2019 в 15:01 +0300, Ivan Rakov пишет:
> > > > > > > > Nikita and Maxim,
> > > > > > > >
> > > > > > > > > What if we just update current metric getCurrentPmeDuration
> > > > > behaviour
> > > > > > > > > to show durations only for blocking PMEs?
> > > > > > > > > Remain it as a long value and rename it to
> > > > > getCacheOperationsBlockedDuration.
> > > > > > > > >
> > > > > > > > > No other changes will require.
> > > > > > > > >
> > > > > > > > > WDYT?
> > > > > > > >
> > > > > > > > I agree with these two metrics. I also think that current
> > > > > > > > getCurrentPmeDuration will become redundant.
> > > > > > > >
> > > > > > > > Anton,
> > > > > > > >
> > > > > > > > > It looks like we're trying to implement "extended debug"
> > > instead
> > > > of
> > > > > > > > > "monitoring".
> > > > > > > > > It should not be interesting for real admin what phase of
> PME
> > > is
> > > > in
> > > > > > > > > progress and so on.
> > > > > > > >
> > > > > > > > PME is mission critical cluster process. I agree that
> there's a
> > > > fine
> > > > > > > > line between monitoring and debug here. However, it's not
> good
> > to
> > > > add
> > > > > > > > monitoring capabilities only for scenario when everything is
> > > > alright.
> > > > > > > > If PME will really hang, *real admin* will be extremely
> > > interested
> > > > > how
> > > > > > > > to return cluster back to working state. Metrics about stages
> > > > > completion
> > > > > > > > time may really help here: e.g. if one specific node hasn't
> > > > completed
> > > > > > > > stage X while rest of the cluster has, it can be a signal
> that
> > > this
> > > > > node
> > > > > > > > should be killed.
> > > > > > > >
> > > > > > > > Of course, it's possible to build monitoring system that
> > extract
> > > > this
> > > > > > > > information from logs, but:
> > > > > > > > - It's more resource intensive as it requires parsing logs
> for
> > > all
> > > > > the time
> > > > > > > > - It's less reliable as log messages may change
> > > > > > > >
> > > > > > > > Best Regards,
> > > > > > > > Ivan Rakov
> > > > > > > >
> > > > > > > > On 24.07.2019 14:57, Maxim Muzafarov wrote:
> > > > > > > > > Folks,
> > > > > > > > >
> > > > > > > > > +1 with Anton post.
> > > > > > > > >
> > > > > > > > > What if we just update current metric getCurrentPmeDuration
> > > > > behaviour
> > > > > > > > > to show durations only for blocking PMEs?
> > > > > > > > > Remain it as a long value and rename it to
> > > > > getCacheOperationsBlockedDuration.
> > > > > > > > >
> > > > > > > > > No other changes will require.
> > > > > > > > >
> > > > > > > > > WDYT?
> > > > > > > > >
> > > > > > > > > On Wed, 24 Jul 2019 at 14:02, Nikita Amelchev <
> > > > > nsamelc...@gmail.com> wrote:
> > > > > > > > > > Nikolay,
> > > > > > > > > >
> > > > > > > > > > The сacheOperationsBlockedDuration metric will show
> current
> > > > > blocking
> > > > > > > > > > duration or 0 if there is no blocking right now.
> > > > > > > > > >
> > > > > > > > > > The totalCacheOperationsBlockedDuration metric will
> > > accumulate
> > > > > all
> > > > > > > > > > blocking durations that happen after node starts.
> > > > > > > > > >
> > > > > > > > > > ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <
> > > > > nizhi...@apache.org>:
> > > > > > > > > > > Nikita
> > > > > > > > > > >
> > > > > > > > > > > What is the difference between those two metrics?
> > > > > > > > > > >
> > > > > > > > > > > ср, 24 июля 2019 г., 12:45 Nikita Amelchev <
> > > > > nsamelc...@gmail.com>:
> > > > > > > > > > >
> > > > > > > > > > > > Igniters, thanks for comments.
> > > > > > > > > > > >
> > > > > > > > > > > >  From the discussion it can be seen that we need only
> > two
> > > > > metrics for now:
> > > > > > > > > > > > - сacheOperationsBlockedDuration (long)
> > > > > > > > > > > > - totalCacheOperationsBlockedDuration (long)
> > > > > > > > > > > >
> > > > > > > > > > > > I will prepare PR at the nearest time.
> > > > > > > > > > > >
> > > > > > > > > > > > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky
> > > > > <arzamas...@mail.ru.invalid
> > > > > > > > > > > > > :
> > > > > > > > > > > > >
> > > > > > > > > > > > > +1 with Anton decisions.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Среда, 24 июля 2019, 8:44 +03:00 от Anton
> > Vinogradov
> > > <
> > > > > a...@apache.org>:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It looks like we're trying to implement "extended
> > > > debug"
> > > > > instead of
> > > > > > > > > > > > > > "monitoring".
> > > > > > > > > > > > > > It should not be interesting for real admin what
> > > phase
> > > > > of PME is in
> > > > > > > > > > > > > > progress and so on.
> > > > > > > > > > > > > > Interested metrics are
> > > > > > > > > > > > > > - total blocked time (will be used for real SLA
> > > > counting)
> > > > > > > > > > > > > > - are we blocked right now (shows we have an SLA
> > > > > degradation right now)
> > > > > > > > > > > > > > Duration of the current blocking period can be
> > easily
> > > > > presented using
> > > > > > > > > > > >
> > > > > > > > > > > > any
> > > > > > > > > > > > > > modern monitoring tool by regular checks.
> > > > > > > > > > > > > > Initial true will means "period start", precision
> > > will
> > > > > be a result of
> > > > > > > > > > > > > > checks frequency.
> > > > > > > > > > > > > > Anyway, I'm ok to have current metric presented
> > with
> > > > > long, where long
> > > > > > > > > > > >
> > > > > > > > > > > > is a
> > > > > > > > > > > > > > duration, see no reason, but ok :)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > All other features you mentioned are useful for
> > code
> > > or
> > > > > > > > > > > > > > deployment improving and can (should) be taken
> from
> > > > logs
> > > > > at the analysis
> > > > > > > > > > > > > > phase.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov <
> > > > > ivan.glu...@gmail.com >
> > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > Folks, let me step in.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Nikita, thanks for your suggestions!
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1. initialVersion. Topology version that
> > > initiates
> > > > > the exchange.
> > > > > > > > > > > > > > > > 2. initTime. Time PME was started.
> > > > > > > > > > > > > > > > 3. initEvent. Event that triggered PME.
> > > > > > > > > > > > > > > > 4. partitionReleaseTime. Time when a node has
> > > > > finished waiting for
> > > > > > > > > > > >
> > > > > > > > > > > > all
> > > > > > > > > > > > > > > > updates and translations on a previous
> > topology.
> > > > > > > > > > > > > > > > 5. sendSingleMessageTime. Time when a node
> > sent a
> > > > > single message.
> > > > > > > > > > > > > > > > 6. recieveFullMessageTime. Time when a node
> > > > received
> > > > > a full message.
> > > > > > > > > > > > > > > > 7. finishTime. Time PME was ended.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > When new PME started all these metrics
> resets.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Every metric from Nikita's list looks useful
> and
> > > > > simple to implement.
> > > > > > > > > > > > > > > I think that it would be better to change
> format
> > of
> > > > > metrics 4, 5, 6
> > > > > > > > > > > >
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > 7 a bit: we can keep only difference between
> time
> > > of
> > > > > previous event
> > > > > > > > > > > >
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > time of corresponding event. Such metrics would
> > be
> > > > > easier to perceive:
> > > > > > > > > > > > > > > they answer to specific questions "how much
> time
> > > did
> > > > > partition release
> > > > > > > > > > > > > > > take?" or "how much time did awaiting of
> > > distributed
> > > > > phase end take?".
> > > > > > > > > > > > > > > Also, if results of 4, 5, 6, 7 will be exported
> > to
> > > > > monitoring system,
> > > > > > > > > > > > > > > graphs will show how different stages times
> > change
> > > > > from one PME to
> > > > > > > > > > > >
> > > > > > > > > > > > another.
> > > > > > > > > > > > > > > > When PME cause no blocking, it's a good PME
> > and I
> > > > > see no reason to
> > > > > > > > > > > >
> > > > > > > > > > > > have
> > > > > > > > > > > > > > > > monitoring related to it
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Agree with Anton here. These metrics should be
> > > > > measured only for true
> > > > > > > > > > > > > > > distributed exchange. Saving results for client
> > > > > leave/join PMEs will
> > > > > > > > > > > > > > > just complicate monitoring.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I agree with total blocking duration metric
> but
> > > > > > > > > > > > > > > > I still don't understand why instant value
> > > > > indicating that
> > > > > > > > > > > >
> > > > > > > > > > > > operations are
> > > > > > > > > > > > > > > > blocked should be boolean.
> > > > > > > > > > > > > > > > Duration time since blocking has started
> looks
> > > more
> > > > > appropriate and
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > useful.
> > > > > > > > > > > > > > > > It gives more information while semantic is
> > left
> > > > the
> > > > > same.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Totally agree with Pavel here. Both
> "accumulated
> > > > block
> > > > > time" and
> > > > > > > > > > > > > > > "current PME block time" metrics are useful.
> > Growth
> > > > of
> > > > > accumulated
> > > > > > > > > > > > > > > metric for specific period of time (should be
> > easy
> > > to
> > > > > check via
> > > > > > > > > > > > > > > monitoring system graph) will show for how much
> > > > > business operations
> > > > > > > > > > > >
> > > > > > > > > > > > were
> > > > > > > > > > > > > > > blocked in total, and non-zero current metric
> > will
> > > > > show that we are
> > > > > > > > > > > > > > > experiencing issues right now. Boolean metric
> > "are
> > > we
> > > > > blocked right
> > > > > > > > > > > >
> > > > > > > > > > > > now"
> > > > > > > > > > > > > > > is not needed as it's obviously can be inferred
> > > from
> > > > > "current PME
> > > > > > > > > > > >
> > > > > > > > > > > > block
> > > > > > > > > > > > > > > time".
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Best Regards,
> > > > > > > > > > > > > > > Ivan Rakov
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On 23.07.2019 16:02, Pavel Kovalenko wrote:
> > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I agree with total blocking duration metric
> but
> > > > > > > > > > > > > > > > I still don't understand why instant value
> > > > > indicating that
> > > > > > > > > > > >
> > > > > > > > > > > > operations are
> > > > > > > > > > > > > > > > blocked should be boolean.
> > > > > > > > > > > > > > > > Duration time since blocking has started
> looks
> > > more
> > > > > appropriate and
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > useful.
> > > > > > > > > > > > > > > > It gives more information while semantic is
> > left
> > > > the
> > > > > same.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev
> <
> > > > > nsamelc...@gmail.com
> > > > > > > > > > > > >
> > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > All previous suggestions have some
> > > disadvantages.
> > > > > It can be several
> > > > > > > > > > > > > > > > > exchanges between two metric updates and
> fast
> > > > > exchange can rewrite
> > > > > > > > > > > > > > > > > previous long exchange.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > We can introduce a metric of total blocking
> > > > > duration that will
> > > > > > > > > > > > > > > > > accumulate at the end of the exchange. So,
> > > users
> > > > > will get actual
> > > > > > > > > > > > > > > > > information about how long operations were
> > > > > blocked. Cluster metric
> > > > > > > > > > > > > > > > > will be a maximum of local nodes metrics.
> And
> > > we
> > > > > need a boolean
> > > > > > > > > > > >
> > > > > > > > > > > > metric
> > > > > > > > > > > > > > > > > that will indicate realtime status. It
> needs
> > > > > because of duration
> > > > > > > > > > > > > > > > > metric updates at the end of the exchange.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > So I propose to change the current metric
> > that
> > > > not
> > > > > released to the
> > > > > > > > > > > > > > > > > totalCacheOperationsBlockingDuration metric
> > and
> > > > to
> > > > > add the
> > > > > > > > > > > > > > > > > isCacheOperationsBlocked metric.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > пн, 22 июл. 2019 г. в 09:27, Anton
> > Vinogradov <
> > > > > a...@apache.org >:
> > > > > > > > > > > > > > > > > > Nikolay,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Still see no reason to replace boolean
> with
> > > > long.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Mon, Jul 22, 2019 at 9:19 AM Nikolay
> > > > Izhikov <
> > > > > > > > > > > >
> > > > > > > > > > > > nizhi...@apache.org >
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > 1. Value exported based on SPI
> settings,
> > > not
> > > > > in the moment it
> > > > > > > > > > > >
> > > > > > > > > > > > changed.
> > > > > > > > > > > > > > > > > > > 2. Clock synchronisation - if we export
> > > start
> > > > > time, we should
> > > > > > > > > > > >
> > > > > > > > > > > > also
> > > > > > > > > > > > > > > > > export
> > > > > > > > > > > > > > > > > > > node local timestamp.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > пн, 22 июля 2019 г., 8:33 Anton
> > Vinogradov
> > > <
> > > > > a...@apache.org >:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > What's the reason for duration
> > counting?
> > > > > > > > > > > > > > > > > > > > AFAIU, it's a monitoring system
> feature
> > > to
> > > > > count the durations.
> > > > > > > > > > > > > > > > > > > > Sine monitoring system checks metrics
> > > > > periodically it will know
> > > > > > > > > > > >
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > duration by its own log.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Fri, Jul 19, 2019 at 7:32 PM Pavel
> > > > > Kovalenko <
> > > > > > > > > > > >
> > > > > > > > > > > > jokse...@gmail.com >
> > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Yes, I mean duration not timestamp.
> > For
> > > > > the metric name, I
> > > > > > > > > > > >
> > > > > > > > > > > > suggest
> > > > > > > > > > > > > > > > > > > > > "cacheOperationsBlockingDuration",
> I
> > > > think
> > > > > it cleaner
> > > > > > > > > > > >
> > > > > > > > > > > > represents
> > > > > > > > > > > > > > > > > what
> > > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > > > blocked during PME.
> > > > > > > > > > > > > > > > > > > > > We can also combine both timestamp
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > "cacheOperationsBlockingStartTs" and
> > > > > > > > > > > > > > > > > > > > > duration to have better correlation
> > > when
> > > > > cache operations were
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > blocked
> > > > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > how much time it's taken.
> > > > > > > > > > > > > > > > > > > > > For instant view (like in JMX
> bean) a
> > > > > calculated value as you
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > mentioned
> > > > > > > > > > > > > > > > > > > > > can be used.
> > > > > > > > > > > > > > > > > > > > > For metrics are exported to some
> > > backend
> > > > > (IEP-35) a counter
> > > > > > > > > > > >
> > > > > > > > > > > > can be
> > > > > > > > > > > > > > > > > > > used.
> > > > > > > > > > > > > > > > > > > > > The counter is incremented by
> > blocking
> > > > > time after blocking has
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ended.
> > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 19:10, Nikita
> > > > > Amelchev <
> > > > > > > > > > > >
> > > > > > > > > > > > nsamelc...@gmail.com
> > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > Pavel,
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > The main purpose of this metric
> is
> > > > > > > > > > > > > > > > > > > > > > > > how much time we wait for
> > > resuming
> > > > > cache operations
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Seems I misunderstood you. Do you
> > > mean
> > > > > timestamp or duration
> > > > > > > > > > > >
> > > > > > > > > > > > here?
> > > > > > > > > > > > > > > > > > > > > > > > What do you think if we
> change
> > > the
> > > > > boolean value of metric
> > > > > > > > > > > >
> > > > > > > > > > > > to a
> > > > > > > > > > > > > > > > > > > long
> > > > > > > > > > > > > > > > > > > > > > value that represents time in
> > > > > milliseconds when operations
> > > > > > > > > > > >
> > > > > > > > > > > > were
> > > > > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > > > > This time can be calculated as
> > > > > (currentTime -
> > > > > > > > > > > > > > > > > > > > > > timeSinceOperationsBlocked) in
> case
> > > of
> > > > > timestamp.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Duration will be more
> > understandable.
> > > > > It'll be something like
> > > > > > > > > > > > > > > > > > > > > > getCurrentBlockingPmeDuration.
> But
> > I
> > > > > haven't come up with a
> > > > > > > > > > > >
> > > > > > > > > > > > better
> > > > > > > > > > > > > > > > > > > > > > name yet.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 18:30,
> Pavel
> > > > > Kovalenko <
> > > > > > > > > > > >
> > > > > > > > > > > > jokse...@gmail.com
> > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > I think getCurrentPmeDuration
> > > doesn't
> > > > > show useful
> > > > > > > > > > > >
> > > > > > > > > > > > information.
> > > > > > > > > > > > > > > > > The
> > > > > > > > > > > > > > > > > > > > main
> > > > > > > > > > > > > > > > > > > > > > PME side effect for end-users is
> > > > > blocking cache operations.
> > > > > > > > > > > >
> > > > > > > > > > > > Not
> > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > PME
> > > > > > > > > > > > > > > > > > > > > > time blocks it.
> > > > > > > > > > > > > > > > > > > > > > > What information gives to an
> > > end-user
> > > > > timestamp of
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > "timeSinceOperationsBlocked"? For
> > > what
> > > > > analysis it can be
> > > > > > > > > > > >
> > > > > > > > > > > > used and
> > > > > > > > > > > > > > > > > > > how?
> > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 17:48,
> > Nikita
> > > > > Amelchev <
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >   nsamelc...@gmail.com
> > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > Hi Pavel,
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > This time already can be
> > obtained
> > > > > from the
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > getCurrentPmeDuration
> > > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > > > > new isOperationsBlockedByPme
> > > > metrics.
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > As an alternative solution, I
> > can
> > > > > rework recently added
> > > > > > > > > > > > > > > > > > > > > > > > getCurrentPmeDuration metric
> > (not
> > > > > released yet). Seems for
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > users it
> > > > > > > > > > > > > > > > > > > > > > > > useless in case of
> non-blocking
> > > > PME.
> > > > > > > > > > > > > > > > > > > > > > > > Lets name it
> > > > > timeSinceOperationsBlocked. It'll be timestamp
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > > > > > blocking started (minimal
> value
> > > of
> > > > > cluster nodes) and 0 if
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > > > > > > ends (there is no running
> PME).
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 15:56,
> > > Pavel
> > > > > Kovalenko <
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >   jokse...@gmail.com >:
> > > > > > > > > > > > > > > > > > > > > > > > > Hi Nikita,
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > Thank you for working on
> > this.
> > > > > What do you think if we
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > change the
> > > > > > > > > > > > > > > > > > > > > > boolean
> > > > > > > > > > > > > > > > > > > > > > > > > value of metric to a long
> > value
> > > > > that represents time in
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > milliseconds
> > > > > > > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > > > > > > operations were blocked?
> > > > > > > > > > > > > > > > > > > > > > > > > Since we have not only JMX
> > and
> > > > now
> > > > > metrics are periodically
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > exported
> > > > > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > > > > some backend it can give a
> > more
> > > > > clear picture of how much
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > time we
> > > > > > > > > > > > > > > > > > > > > > wait for
> > > > > > > > > > > > > > > > > > > > > > > > > resuming cache operations
> > > instead
> > > > > of instant boolean
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > indicator.
> > > > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в
> 14:41,
> > > > > Nikita Amelchev <
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >   nsamelc...@gmail.com
> > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > Anton, Nikolay,
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the support.
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > For now, we have the
> > > > > getCurrentPmeDuration() metric that
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > does
> > > > > > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > > > > > > show
> > > > > > > > > > > > > > > > > > > > > > > > > > influence on the cluster
> > > > > correctly. PME can be without
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > > > > > > > > operations. For example,
> > > client
> > > > > node join/leave events.
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > I suggest add new metric
> -
> > > > > isOperationsBlockedByPme().
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Together,
> > > > > > > > > > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > > > > > > > > > > > metrics will show
> influence
> > > of
> > > > > the PME on cluster and user
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > operations.
> > > > > > > > > > > > > > > > > > > > > > > > > > I have prepared PR for
> this
> > > > (Bot
> > > > > visa is green). [1] Can
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > anyone
> > > > > > > > > > > > > > > > > > > > > > take a
> > > > > > > > > > > > > > > > > > > > > > > > > > look?
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июл. 2019 г. в
> > 14:58,
> > > > > Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >   nizhi...@apache.org
> > > > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > > I think administator of
> > > > Ignite
> > > > > cluster should be able to
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > monitor
> > > > > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > > > > Ignite process, including
> > non
> > > > > blocking PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в
> 14:57
> > > > > +0300, Anton Vinogradov пишет:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > BTW,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Found PME metric -
> > > > > getCurrentPmeDuration().
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Seems, it shows
> exactly
> > > PME
> > > > > time and not so useful
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > because
> > > > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > > > > > this.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > The goal it so show
> > > exactly
> > > > > blocking period.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > When PME cause no
> > > blocking,
> > > > > it's a good PME and I see
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > no
> > > > > > > > > > > > > > > > > > > > > > reason to have
> > > > > > > > > > > > > > > > > > > > > > > > > > > > monitoring related to
> > it
> > > :)
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019
> at
> > > > 2:50
> > > > > PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >   nizhi...@apache.org >
> > > > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Why do we need to
> > > > postpone
> > > > > implementation of this
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > metrics?
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > For now,
> > implementation
> > > > of
> > > > > new metric is very simple.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think we can
> > > implement
> > > > > this metrics as a single
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > contribution.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в
> > > 13:47
> > > > > +0300, Anton Vinogradov
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > пишет:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Looks like all we
> > > need
> > > > > now is a 1 simple metric:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > > > > > operations
> > > > > > > > > > > > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Just a true or
> > false.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Lest start from
> > this.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > All other metrics
> > can
> > > > be
> > > > > extracted from logs now
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > > > > > > > implemented
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > later.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16,
> > 2019
> > > at
> > > > > 12:46 PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > >   nizhi...@apache.org >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +1.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita, please,
> > go
> > > > > ahead.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июля
> 2019
> > > г.,
> > > > > 11:45 Nikita Amelchev <
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >   nsamelc...@gmail.com
> > > > > > > > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hello,
> > Igniters.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I suggest to
> > add
> > > > > some useful metrics about the
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > partition map
> > > > > > > > > > > > > > > > > > > > > > > > > > exchange
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (PME). For
> now,
> > > the
> > > > > duration of PME stages
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > available
> > > > > > > > > > > > > > > > > > > > > > only in
> > > > > > > > > > > > > > > > > > > > > > > > > > log
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > files
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > and cannot be
> > > > > obtained using JMX or other
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > external
> > > > > > > > > > > > > > > > > > > > > > tools. [1]
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I made the
> list
> > > of
> > > > > local node metrics that
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > help to
> > > > > > > > > > > > > > > > > > > > > > understand
> > > > > > > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > actual status
> > of
> > > > > current PME:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1.
> > > initialVersion.
> > > > > Topology version that
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > initiates
> > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > > exchange.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. initTime.
> > Time
> > > > > PME was started.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. initEvent.
> > > Event
> > > > > that triggered PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 4.
> > > > > partitionReleaseTime. Time when a node has
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > finished
> > > > > > > > > > > > > > > > > > > > > > waiting
> > > > > > > > > > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > updates and
> > > > > translations on a previous
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > topology.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 5.
> > > > > sendSingleMessageTime. Time when a node
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > sent a
> > > > > > > > > > > > > > > > > > > > > > single
> > > > > > > > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 6.
> > > > > recieveFullMessageTime. Time when a node
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > received
> > > > > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > > > > full
> > > > > > > > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 7.
> finishTime.
> > > Time
> > > > > PME was ended.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > When new PME
> > > > started
> > > > > all these metrics resets.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > These metrics
> > > help
> > > > > to understand:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - how long
> PME
> > > was
> > > > > (current or previous).
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - how long
> > > awaited
> > > > > for all updates was
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > completed.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - what node
> > > blocks
> > > > > PME (didn't send a single
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > message)
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - what
> > triggered
> > > > PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thoughts?
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Amelchev
> Nikita
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Zhenya Stanilovsky
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best wishes,
> > > > > > > > > > Amelchev Nikita
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to