Nikita and Maxim,

What if we just update current metric getCurrentPmeDuration behaviour
to show durations only for blocking PMEs?
Remain it as a long value and rename it to getCacheOperationsBlockedDuration.

No other changes will require.

WDYT?
I agree with these two metrics. I also think that current getCurrentPmeDuration will become redundant.

Anton,

It looks like we're trying to implement "extended debug" instead of
"monitoring".
It should not be interesting for real admin what phase of PME is in
progress and so on.

PME is mission critical cluster process. I agree that there's a fine line between monitoring and debug here. However, it's not good to add monitoring capabilities only for scenario when everything is alright. If PME will really hang, *real admin* will be extremely interested how to return cluster back to working state. Metrics about stages completion time may really help here: e.g. if one specific node hasn't completed stage X while rest of the cluster has, it can be a signal that this node should be killed.

Of course, it's possible to build monitoring system that extract this information from logs, but:
- It's more resource intensive as it requires parsing logs for all the time
- It's less reliable as log messages may change

Best Regards,
Ivan Rakov

On 24.07.2019 14:57, Maxim Muzafarov wrote:
Folks,

+1 with Anton post.

What if we just update current metric getCurrentPmeDuration behaviour
to show durations only for blocking PMEs?
Remain it as a long value and rename it to getCacheOperationsBlockedDuration.

No other changes will require.

WDYT?

On Wed, 24 Jul 2019 at 14:02, Nikita Amelchev <nsamelc...@gmail.com> wrote:
Nikolay,

The сacheOperationsBlockedDuration metric will show current blocking
duration or 0 if there is no blocking right now.

The totalCacheOperationsBlockedDuration metric will accumulate all
blocking durations that happen after node starts.

ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <nizhi...@apache.org>:
Nikita

What is the difference between those two metrics?

ср, 24 июля 2019 г., 12:45 Nikita Amelchev <nsamelc...@gmail.com>:

Igniters, thanks for comments.

 From the discussion it can be seen that we need only two metrics for now:
- сacheOperationsBlockedDuration (long)
- totalCacheOperationsBlockedDuration (long)

I will prepare PR at the nearest time.

ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky <arzamas...@mail.ru.invalid
:

+1 with Anton decisions.


Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <a...@apache.org>:

Folks,

It looks like we're trying to implement "extended debug" instead of
"monitoring".
It should not be interesting for real admin what phase of PME is in
progress and so on.
Interested metrics are
- total blocked time (will be used for real SLA counting)
- are we blocked right now (shows we have an SLA degradation right now)
Duration of the current blocking period can be easily presented using
any
modern monitoring tool by regular checks.
Initial true will means "period start", precision will be a result of
checks frequency.
Anyway, I'm ok to have current metric presented with long, where long
is a
duration, see no reason, but ok :)

All other features you mentioned are useful for code or
deployment improving and can (should) be taken from logs at the analysis
phase.

On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < ivan.glu...@gmail.com >
wrote:
Folks, let me step in.

Nikita, thanks for your suggestions!

1. initialVersion. Topology version that initiates the exchange.
2. initTime. Time PME was started.
3. initEvent. Event that triggered PME.
4. partitionReleaseTime. Time when a node has finished waiting for
all
updates and translations on a previous topology.
5. sendSingleMessageTime. Time when a node sent a single message.
6. recieveFullMessageTime. Time when a node received a full message.
7. finishTime. Time PME was ended.

When new PME started all these metrics resets.
Every metric from Nikita's list looks useful and simple to implement.
I think that it would be better to change format of metrics 4, 5, 6
and
7 a bit: we can keep only difference between time of previous event
and
time of corresponding event. Such metrics would be easier to perceive:
they answer to specific questions "how much time did partition release
take?" or "how much time did awaiting of distributed phase end take?".
Also, if results of 4, 5, 6, 7 will be exported to monitoring system,
graphs will show how different stages times change from one PME to
another.
When PME cause no blocking, it's a good PME and I see no reason to
have
monitoring related to it
Agree with Anton here. These metrics should be measured only for true
distributed exchange. Saving results for client leave/join PMEs will
just complicate monitoring.

I agree with total blocking duration metric but
I still don't understand why instant value indicating that
operations are
blocked should be boolean.
Duration time since blocking has started looks more appropriate and
useful.
It gives more information while semantic is left the same.
Totally agree with Pavel here. Both "accumulated block time" and
"current PME block time" metrics are useful. Growth of accumulated
metric for specific period of time (should be easy to check via
monitoring system graph) will show for how much business operations
were
blocked in total, and non-zero current metric will show that we are
experiencing issues right now. Boolean metric "are we blocked right
now"
is not needed as it's obviously can be inferred from "current PME
block
time".

Best Regards,
Ivan Rakov

On 23.07.2019 16:02, Pavel Kovalenko wrote:
Nikita,

I agree with total blocking duration metric but
I still don't understand why instant value indicating that
operations are
blocked should be boolean.
Duration time since blocking has started looks more appropriate and
useful.
It gives more information while semantic is left the same.



вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < nsamelc...@gmail.com
:
Folks,

All previous suggestions have some disadvantages. It can be several
exchanges between two metric updates and fast exchange can rewrite
previous long exchange.

We can introduce a metric of total blocking duration that will
accumulate at the end of the exchange. So, users will get actual
information about how long operations were blocked. Cluster metric
will be a maximum of local nodes metrics. And we need a boolean
metric
that will indicate realtime status. It needs because of duration
metric updates at the end of the exchange.

So I propose to change the current metric that not released to the
totalCacheOperationsBlockingDuration metric and to add the
isCacheOperationsBlocked metric.

WDYT?

пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < a...@apache.org >:
Nikolay,

Still see no reason to replace boolean with long.

On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <
nizhi...@apache.org >
wrote:
Anton.

1. Value exported based on SPI settings, not in the moment it
changed.
2. Clock synchronisation - if we export start time, we should
also
export
node local timestamp.

пн, 22 июля 2019 г., 8:33 Anton Vinogradov < a...@apache.org >:

Folks,

What's the reason for duration counting?
AFAIU, it's a monitoring system feature to count the durations.
Sine monitoring system checks metrics periodically it will know
the
duration by its own log.

On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <
jokse...@gmail.com >
wrote:

Nikita,

Yes, I mean duration not timestamp. For the metric name, I
suggest
"cacheOperationsBlockingDuration", I think it cleaner
represents
what
is
blocked during PME.
We can also combine both timestamp
"cacheOperationsBlockingStartTs" and
duration to have better correlation when cache operations were
blocked
and
how much time it's taken.
For instant view (like in JMX bean) a calculated value as you
mentioned
can be used.
For metrics are exported to some backend (IEP-35) a counter
can be
used.
The counter is incremented by blocking time after blocking has
ended.
пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <
nsamelc...@gmail.com
:
Pavel,

The main purpose of this metric is
how much time we wait for resuming cache operations
Seems I misunderstood you. Do you mean timestamp or duration
here?
What do you think if we change the boolean value of metric
to a
long
value that represents time in milliseconds when operations
were
blocked?
This time can be calculated as (currentTime -
timeSinceOperationsBlocked) in case of timestamp.

Duration will be more understandable. It'll be something like
getCurrentBlockingPmeDuration. But I haven't come up with a
better
name yet.

пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <
jokse...@gmail.com
:
Nikita,

I think getCurrentPmeDuration doesn't show useful
information.
The
main
PME side effect for end-users is blocking cache operations.
Not
all
PME
time blocks it.
What information gives to an end-user timestamp of
"timeSinceOperationsBlocked"? For what analysis it can be
used and
how?
пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <
  nsamelc...@gmail.com
:
Hi Pavel,

This time already can be obtained from the
getCurrentPmeDuration
and
new isOperationsBlockedByPme metrics.

As an alternative solution, I can rework recently added
getCurrentPmeDuration metric (not released yet). Seems for
users it
useless in case of non-blocking PME.
Lets name it timeSinceOperationsBlocked. It'll be timestamp
when
blocking started (minimal value of cluster nodes) and 0 if
blocking
ends (there is no running PME).

WDYT?

пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <
  jokse...@gmail.com >:
Hi Nikita,

Thank you for working on this. What do you think if we
change the
boolean
value of metric to a long value that represents time in
milliseconds
when
operations were blocked?
Since we have not only JMX and now metrics are periodically
exported
to
some backend it can give a more clear picture of how much
time we
wait for
resuming cache operations instead of instant boolean
indicator.
пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <
  nsamelc...@gmail.com
:
Anton, Nikolay,

Thanks for the support.

For now, we have the getCurrentPmeDuration() metric that
does
not
show
influence on the cluster correctly. PME can be without
blocking
operations. For example, client node join/leave events.

I suggest add new metric - isOperationsBlockedByPme().
Together,
these
metrics will show influence of the PME on cluster and user
operations.
I have prepared PR for this (Bot visa is green). [1] Can
anyone
take a
look?

[1]  https://issues.apache.org/jira/browse/IGNITE-11961

вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
  nizhi...@apache.org
:
I think administator of Ignite cluster should be able to
monitor
all
Ignite process, including non blocking PME.
В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
BTW,
Found PME metric - getCurrentPmeDuration().
Seems, it shows exactly PME time and not so useful
because
of
this.
The goal it so show exactly blocking period.
When PME cause no blocking, it's a good PME and I see
no
reason to have
monitoring related to it :)

On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
  nizhi...@apache.org >
wrote:
Anton.

Why do we need to postpone implementation of this
metrics?
For now, implementation of new metric is very simple.

I think we can implement this metrics as a single
contribution.
В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov
пишет:
Nikita,

Looks like all we need now is a 1 simple metric:
are
operations
blocked?
Just a true or false.
Lest start from this.
All other metrics can be extracted from logs now
and
can
be
implemented
later.

On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
  nizhi...@apache.org >
wrote:

+1.

Nikita, please, go ahead.


вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
  nsamelc...@gmail.com
:
Hello, Igniters.

I suggest to add some useful metrics about the
partition map
exchange
(PME). For now, the duration of PME stages
available
only in
log
files
and cannot be obtained using JMX or other
external
tools. [1]
I made the list of local node metrics that
help to
understand
the
actual status of current PME:

1. initialVersion. Topology version that
initiates
the
exchange.
2. initTime. Time PME was started.
3. initEvent. Event that triggered PME.
4. partitionReleaseTime. Time when a node has
finished
waiting
for
all
updates and translations on a previous
topology.
5. sendSingleMessageTime. Time when a node
sent a
single
message.
6. recieveFullMessageTime. Time when a node
received
a
full
message.
7. finishTime. Time PME was ended.

When new PME started all these metrics resets.

These metrics help to understand:
- how long PME was (current or previous).
- how long awaited for all updates was
completed.
- what node blocks PME (didn't send a single
message)
- what triggered PME.

Thoughts?

[1]
  https://issues.apache.org/jira/browse/IGNITE-11961
--
Best wishes,
Amelchev Nikita


--
Best wishes,
Amelchev Nikita


--
Best wishes,
Amelchev Nikita

--
Best wishes,
Amelchev Nikita


--
Best wishes,
Amelchev Nikita


--
Zhenya Stanilovsky


--
Best wishes,
Amelchev Nikita



--
Best wishes,
Amelchev Nikita

Reply via email to