Re: Partition map exchange metrics

Ivan Rakov Tue, 23 Jul 2019 09:22:35 -0700

Folks, let me step in.

Nikita, thanks for your suggestions!

1. initialVersion. Topology version that initiates the exchange.
2. initTime. Time PME was started.
3. initEvent. Event that triggered PME.
4. partitionReleaseTime. Time when a node has finished waiting for all
updates and translations on a previous topology.
5. sendSingleMessageTime. Time when a node sent a single message.
6. recieveFullMessageTime. Time when a node received a full message.
7. finishTime. Time PME was ended.

When new PME started all these metrics resets.

Every metric from Nikita's list looks useful and simple to implement.

I think that it would be better to change format of metrics 4, 5, 6 and7 a bit: we can keep only difference between time of previous event andtime of corresponding event. Such metrics would be easier to perceive:they answer to specific questions "how much time did partition releasetake?" or "how much time did awaiting of distributed phase end take?".Also, if results of 4, 5, 6, 7 will be exported to monitoring system,graphs will show how different stages times change from one PME to another.

When PME cause no blocking, it's a good PME and I see no reason to have
monitoring related to it

Agree with Anton here. These metrics should be measured only for truedistributed exchange. Saving results for client leave/join PMEs willjust complicate monitoring.

I agree with total blocking duration metric but
I still don't understand why instant value indicating that operations are
blocked should be boolean.
Duration time since blocking has started looks more appropriate and useful.
It gives more information while semantic is left the same.

Totally agree with Pavel here. Both "accumulated block time" and"current PME block time" metrics are useful. Growth of accumulatedmetric for specific period of time (should be easy to check viamonitoring system graph) will show for how much business operations wereblocked in total, and non-zero current metric will show that we areexperiencing issues right now. Boolean metric "are we blocked right now"is not needed as it's obviously can be inferred from "current PME blocktime".


Best Regards,
Ivan Rakov

On 23.07.2019 16:02, Pavel Kovalenko wrote:

Nikita,

I agree with total blocking duration metric but
I still don't understand why instant value indicating that operations are
blocked should be boolean.
Duration time since blocking has started looks more appropriate and useful.
It gives more information while semantic is left the same.



вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev <[email protected]>:

Folks,

All previous suggestions have some disadvantages. It can be several
exchanges between two metric updates and fast exchange can rewrite
previous long exchange.

We can introduce a metric of total blocking duration that will
accumulate at the end of the exchange. So, users will get actual
information about how long operations were blocked. Cluster metric
will be a maximum of local nodes metrics. And we need a boolean metric
that will indicate realtime status. It needs because of duration
metric updates at the end of the exchange.

So I propose to change the current metric that not released to the
totalCacheOperationsBlockingDuration metric and to add the
isCacheOperationsBlocked metric.

WDYT?

пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov <[email protected]>:

Nikolay,

Still see no reason to replace boolean with long.

On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <[email protected]>

wrote:

Anton.

1. Value exported based on SPI settings, not in the moment it changed.

2. Clock synchronisation - if we export start time, we should also

export

node local timestamp.

пн, 22 июля 2019 г., 8:33 Anton Vinogradov <[email protected]>:

Folks,

What's the reason for duration counting?
AFAIU, it's a monitoring system feature to count the durations.
Sine monitoring system checks metrics periodically it will know the
duration by its own log.

On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <[email protected]>
wrote:

Nikita,

Yes, I mean duration not timestamp. For the metric name, I suggest
"cacheOperationsBlockingDuration", I think it cleaner represents

what

is

blocked during PME.
We can also combine both timestamp

"cacheOperationsBlockingStartTs" and

duration to have better correlation when cache operations were

blocked

and

how much time it's taken.
For instant view (like in JMX bean) a calculated value as you

mentioned

can be used.
For metrics are exported to some backend (IEP-35) a counter can be

used.

The counter is incremented by blocking time after blocking has

ended.

пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <[email protected]

Pavel,

The main purpose of this metric is

how much time we wait for resuming cache operations

Seems I misunderstood you. Do you mean timestamp or duration here?

What do you think if we change the boolean value of metric to a

long

value that represents time in milliseconds when operations were

blocked?

This time can be calculated as (currentTime -
timeSinceOperationsBlocked) in case of timestamp.

Duration will be more understandable. It'll be something like
getCurrentBlockingPmeDuration. But I haven't come up with a better
name yet.

пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <[email protected]

Nikita,

I think getCurrentPmeDuration doesn't show useful information.

The

main

PME side effect for end-users is blocking cache operations. Not

all

PME

time blocks it.

What information gives to an end-user timestamp of

"timeSinceOperationsBlocked"? For what analysis it can be used and

how?

пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <

[email protected]

Hi Pavel,

This time already can be obtained from the

getCurrentPmeDuration

and

new isOperationsBlockedByPme metrics.

As an alternative solution, I can rework recently added
getCurrentPmeDuration metric (not released yet). Seems for

users it

useless in case of non-blocking PME.
Lets name it timeSinceOperationsBlocked. It'll be timestamp

when

blocking started (minimal value of cluster nodes) and 0 if

blocking

ends (there is no running PME).

WDYT?

пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <

[email protected]>:

Hi Nikita,

Thank you for working on this. What do you think if we

change the

boolean

value of metric to a long value that represents time in

milliseconds

when

operations were blocked?
Since we have not only JMX and now metrics are periodically

exported

to

some backend it can give a more clear picture of how much

time we

wait for

resuming cache operations instead of instant boolean

indicator.

пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <

[email protected]

Anton, Nikolay,

Thanks for the support.

For now, we have the getCurrentPmeDuration() metric that

does

not

show

influence on the cluster correctly. PME can be without

blocking

operations. For example, client node join/leave events.

I suggest add new metric - isOperationsBlockedByPme().

Together,

these

metrics will show influence of the PME on cluster and user

operations.

I have prepared PR for this (Bot visa is green). [1] Can

anyone

take a

look?

[1] https://issues.apache.org/jira/browse/IGNITE-11961

вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <

[email protected]

I think administator of Ignite cluster should be able to

monitor

all

Ignite process, including non blocking PME.

В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:

BTW,
Found PME metric - getCurrentPmeDuration().
Seems, it shows exactly PME time and not so useful

because

of

this.

The goal it so show exactly blocking period.
When PME cause no blocking, it's a good PME and I see

no

reason to have

monitoring related to it :)

On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <

[email protected]>

wrote:

Anton.

Why do we need to postpone implementation of this

metrics?

For now, implementation of new metric is very simple.

I think we can implement this metrics as a single

contribution.

В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov

пишет:

Nikita,

Looks like all we need now is a 1 simple metric:

are

operations

blocked?

Just a true or false.
Lest start from this.
All other metrics can be extracted from logs now

and

can

be

implemented

later.

On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <

[email protected]>

wrote:

+1.

Nikita, please, go ahead.


вт, 16 июля 2019 г., 11:45 Nikita Amelchev <

[email protected]

Hello, Igniters.

I suggest to add some useful metrics about the

partition map

exchange

(PME). For now, the duration of PME stages

available

only in

log

files

and cannot be obtained using JMX or other

external

tools. [1]

I made the list of local node metrics that

help to

understand

the

actual status of current PME:

1. initialVersion. Topology version that

initiates

the

exchange.

2. initTime. Time PME was started.
3. initEvent. Event that triggered PME.
4. partitionReleaseTime. Time when a node has

finished

waiting

for

all

updates and translations on a previous

topology.

5. sendSingleMessageTime. Time when a node

sent a

single

message.

6. recieveFullMessageTime. Time when a node

received

full

message.

7. finishTime. Time PME was ended.

When new PME started all these metrics resets.

These metrics help to understand:
- how long PME was (current or previous).
- how long awaited for all updates was

completed.

- what node blocks PME (didn't send a single

message)

- what triggered PME.

Thoughts?

[1]

https://issues.apache.org/jira/browse/IGNITE-11961

--
Best wishes,
Amelchev Nikita



--
Best wishes,
Amelchev Nikita



--
Best wishes,
Amelchev Nikita



--
Best wishes,
Amelchev Nikita



--
Best wishes,
Amelchev Nikita

Re: Partition map exchange metrics

Reply via email to