Nikolay, The сacheOperationsBlockedDuration metric will show current blocking duration or 0 if there is no blocking right now.
The totalCacheOperationsBlockedDuration metric will accumulate all blocking durations that happen after node starts. ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <nizhi...@apache.org>: > > Nikita > > What is the difference between those two metrics? > > ср, 24 июля 2019 г., 12:45 Nikita Amelchev <nsamelc...@gmail.com>: > > > Igniters, thanks for comments. > > > > From the discussion it can be seen that we need only two metrics for now: > > - сacheOperationsBlockedDuration (long) > > - totalCacheOperationsBlockedDuration (long) > > > > I will prepare PR at the nearest time. > > > > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky <arzamas...@mail.ru.invalid > > >: > > > > > > +1 with Anton decisions. > > > > > > > > > >Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <a...@apache.org>: > > > > > > > >Folks, > > > > > > > >It looks like we're trying to implement "extended debug" instead of > > > >"monitoring". > > > >It should not be interesting for real admin what phase of PME is in > > > >progress and so on. > > > >Interested metrics are > > > >- total blocked time (will be used for real SLA counting) > > > >- are we blocked right now (shows we have an SLA degradation right now) > > > >Duration of the current blocking period can be easily presented using > > any > > > >modern monitoring tool by regular checks. > > > >Initial true will means "period start", precision will be a result of > > > >checks frequency. > > > >Anyway, I'm ok to have current metric presented with long, where long > > is a > > > >duration, see no reason, but ok :) > > > > > > > >All other features you mentioned are useful for code or > > > >deployment improving and can (should) be taken from logs at the analysis > > > >phase. > > > > > > > >On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < ivan.glu...@gmail.com > > > wrote: > > > > > > > >> Folks, let me step in. > > > >> > > > >> Nikita, thanks for your suggestions! > > > >> > > > >> > 1. initialVersion. Topology version that initiates the exchange. > > > >> > 2. initTime. Time PME was started. > > > >> > 3. initEvent. Event that triggered PME. > > > >> > 4. partitionReleaseTime. Time when a node has finished waiting for > > all > > > >> > updates and translations on a previous topology. > > > >> > 5. sendSingleMessageTime. Time when a node sent a single message. > > > >> > 6. recieveFullMessageTime. Time when a node received a full message. > > > >> > 7. finishTime. Time PME was ended. > > > >> > > > > >> > When new PME started all these metrics resets. > > > >> Every metric from Nikita's list looks useful and simple to implement. > > > >> I think that it would be better to change format of metrics 4, 5, 6 > > and > > > >> 7 a bit: we can keep only difference between time of previous event > > and > > > >> time of corresponding event. Such metrics would be easier to perceive: > > > >> they answer to specific questions "how much time did partition release > > > >> take?" or "how much time did awaiting of distributed phase end take?". > > > >> Also, if results of 4, 5, 6, 7 will be exported to monitoring system, > > > >> graphs will show how different stages times change from one PME to > > another. > > > >> > > > >> > When PME cause no blocking, it's a good PME and I see no reason to > > have > > > >> > monitoring related to it > > > >> Agree with Anton here. These metrics should be measured only for true > > > >> distributed exchange. Saving results for client leave/join PMEs will > > > >> just complicate monitoring. > > > >> > > > >> > I agree with total blocking duration metric but > > > >> > I still don't understand why instant value indicating that > > operations are > > > >> > blocked should be boolean. > > > >> > Duration time since blocking has started looks more appropriate and > > > >> useful. > > > >> > It gives more information while semantic is left the same. > > > >> Totally agree with Pavel here. Both "accumulated block time" and > > > >> "current PME block time" metrics are useful. Growth of accumulated > > > >> metric for specific period of time (should be easy to check via > > > >> monitoring system graph) will show for how much business operations > > were > > > >> blocked in total, and non-zero current metric will show that we are > > > >> experiencing issues right now. Boolean metric "are we blocked right > > now" > > > >> is not needed as it's obviously can be inferred from "current PME > > block > > > >> time". > > > >> > > > >> Best Regards, > > > >> Ivan Rakov > > > >> > > > >> On 23.07.2019 16:02, Pavel Kovalenko wrote: > > > >> > Nikita, > > > >> > > > > >> > I agree with total blocking duration metric but > > > >> > I still don't understand why instant value indicating that > > operations are > > > >> > blocked should be boolean. > > > >> > Duration time since blocking has started looks more appropriate and > > > >> useful. > > > >> > It gives more information while semantic is left the same. > > > >> > > > > >> > > > > >> > > > > >> > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < nsamelc...@gmail.com > > >: > > > >> > > > > >> >> Folks, > > > >> >> > > > >> >> All previous suggestions have some disadvantages. It can be several > > > >> >> exchanges between two metric updates and fast exchange can rewrite > > > >> >> previous long exchange. > > > >> >> > > > >> >> We can introduce a metric of total blocking duration that will > > > >> >> accumulate at the end of the exchange. So, users will get actual > > > >> >> information about how long operations were blocked. Cluster metric > > > >> >> will be a maximum of local nodes metrics. And we need a boolean > > metric > > > >> >> that will indicate realtime status. It needs because of duration > > > >> >> metric updates at the end of the exchange. > > > >> >> > > > >> >> So I propose to change the current metric that not released to the > > > >> >> totalCacheOperationsBlockingDuration metric and to add the > > > >> >> isCacheOperationsBlocked metric. > > > >> >> > > > >> >> WDYT? > > > >> >> > > > >> >> пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < a...@apache.org >: > > > >> >>> Nikolay, > > > >> >>> > > > >> >>> Still see no reason to replace boolean with long. > > > >> >>> > > > >> >>> On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov < > > nizhi...@apache.org > > > > >> >> wrote: > > > >> >>>> Anton. > > > >> >>>> > > > >> >>>> 1. Value exported based on SPI settings, not in the moment it > > changed. > > > >> >>>> > > > >> >>>> 2. Clock synchronisation - if we export start time, we should > > also > > > >> >> export > > > >> >>>> node local timestamp. > > > >> >>>> > > > >> >>>> пн, 22 июля 2019 г., 8:33 Anton Vinogradov < a...@apache.org >: > > > >> >>>> > > > >> >>>>> Folks, > > > >> >>>>> > > > >> >>>>> What's the reason for duration counting? > > > >> >>>>> AFAIU, it's a monitoring system feature to count the durations. > > > >> >>>>> Sine monitoring system checks metrics periodically it will know > > the > > > >> >>>>> duration by its own log. > > > >> >>>>> > > > >> >>>>> On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko < > > jokse...@gmail.com > > > > >> >>>>> wrote: > > > >> >>>>> > > > >> >>>>>> Nikita, > > > >> >>>>>> > > > >> >>>>>> Yes, I mean duration not timestamp. For the metric name, I > > suggest > > > >> >>>>>> "cacheOperationsBlockingDuration", I think it cleaner > > represents > > > >> >> what > > > >> >>>> is > > > >> >>>>>> blocked during PME. > > > >> >>>>>> We can also combine both timestamp > > > >> >> "cacheOperationsBlockingStartTs" and > > > >> >>>>>> duration to have better correlation when cache operations were > > > >> >> blocked > > > >> >>>>> and > > > >> >>>>>> how much time it's taken. > > > >> >>>>>> For instant view (like in JMX bean) a calculated value as you > > > >> >> mentioned > > > >> >>>>>> can be used. > > > >> >>>>>> For metrics are exported to some backend (IEP-35) a counter > > can be > > > >> >>>> used. > > > >> >>>>>> The counter is incremented by blocking time after blocking has > > > >> >> ended. > > > >> >>>>>> пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev < > > nsamelc...@gmail.com > > > >> >>> : > > > >> >>>>>>> Pavel, > > > >> >>>>>>> > > > >> >>>>>>> The main purpose of this metric is > > > >> >>>>>>>>> how much time we wait for resuming cache operations > > > >> >>>>>>> Seems I misunderstood you. Do you mean timestamp or duration > > here? > > > >> >>>>>>>>> What do you think if we change the boolean value of metric > > to a > > > >> >>>> long > > > >> >>>>>>> value that represents time in milliseconds when operations > > were > > > >> >>>> blocked? > > > >> >>>>>>> This time can be calculated as (currentTime - > > > >> >>>>>>> timeSinceOperationsBlocked) in case of timestamp. > > > >> >>>>>>> > > > >> >>>>>>> Duration will be more understandable. It'll be something like > > > >> >>>>>>> getCurrentBlockingPmeDuration. But I haven't come up with a > > better > > > >> >>>>>>> name yet. > > > >> >>>>>>> > > > >> >>>>>>> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko < > > jokse...@gmail.com > > > >> >>> : > > > >> >>>>>>>> Nikita, > > > >> >>>>>>>> > > > >> >>>>>>>> I think getCurrentPmeDuration doesn't show useful > > information. > > > >> >> The > > > >> >>>>> main > > > >> >>>>>>> PME side effect for end-users is blocking cache operations. > > Not > > > >> >> all > > > >> >>>> PME > > > >> >>>>>>> time blocks it. > > > >> >>>>>>>> What information gives to an end-user timestamp of > > > >> >>>>>>> "timeSinceOperationsBlocked"? For what analysis it can be > > used and > > > >> >>>> how? > > > >> >>>>>>>> пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev < > > > >> >> nsamelc...@gmail.com > > > >> >>>>> : > > > >> >>>>>>>>> Hi Pavel, > > > >> >>>>>>>>> > > > >> >>>>>>>>> This time already can be obtained from the > > > >> >> getCurrentPmeDuration > > > >> >>>> and > > > >> >>>>>>>>> new isOperationsBlockedByPme metrics. > > > >> >>>>>>>>> > > > >> >>>>>>>>> As an alternative solution, I can rework recently added > > > >> >>>>>>>>> getCurrentPmeDuration metric (not released yet). Seems for > > > >> >> users it > > > >> >>>>>>>>> useless in case of non-blocking PME. > > > >> >>>>>>>>> Lets name it timeSinceOperationsBlocked. It'll be timestamp > > > >> >> when > > > >> >>>>>>>>> blocking started (minimal value of cluster nodes) and 0 if > > > >> >> blocking > > > >> >>>>>>>>> ends (there is no running PME). > > > >> >>>>>>>>> > > > >> >>>>>>>>> WDYT? > > > >> >>>>>>>>> > > > >> >>>>>>>>> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko < > > > >> >> jokse...@gmail.com >: > > > >> >>>>>>>>>> Hi Nikita, > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> Thank you for working on this. What do you think if we > > > >> >> change the > > > >> >>>>>>> boolean > > > >> >>>>>>>>>> value of metric to a long value that represents time in > > > >> >>>>> milliseconds > > > >> >>>>>>> when > > > >> >>>>>>>>>> operations were blocked? > > > >> >>>>>>>>>> Since we have not only JMX and now metrics are periodically > > > >> >>>>> exported > > > >> >>>>>>> to > > > >> >>>>>>>>>> some backend it can give a more clear picture of how much > > > >> >> time we > > > >> >>>>>>> wait for > > > >> >>>>>>>>>> resuming cache operations instead of instant boolean > > > >> >> indicator. > > > >> >>>>>>>>>> пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev < > > > >> >>>> nsamelc...@gmail.com > > > >> >>>>>> : > > > >> >>>>>>>>>>> Anton, Nikolay, > > > >> >>>>>>>>>>> > > > >> >>>>>>>>>>> Thanks for the support. > > > >> >>>>>>>>>>> > > > >> >>>>>>>>>>> For now, we have the getCurrentPmeDuration() metric that > > > >> >> does > > > >> >>>> not > > > >> >>>>>>> show > > > >> >>>>>>>>>>> influence on the cluster correctly. PME can be without > > > >> >> blocking > > > >> >>>>>>>>>>> operations. For example, client node join/leave events. > > > >> >>>>>>>>>>> > > > >> >>>>>>>>>>> I suggest add new metric - isOperationsBlockedByPme(). > > > >> >>>> Together, > > > >> >>>>>>> these > > > >> >>>>>>>>>>> metrics will show influence of the PME on cluster and user > > > >> >>>>>>> operations. > > > >> >>>>>>>>>>> I have prepared PR for this (Bot visa is green). [1] Can > > > >> >> anyone > > > >> >>>>>>> take a > > > >> >>>>>>>>>>> look? > > > >> >>>>>>>>>>> > > > >> >>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-11961 > > > >> >>>>>>>>>>> > > > >> >>>>>>>>>>> вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov < > > > >> >>>>> nizhi...@apache.org > > > >> >>>>>>>> : > > > >> >>>>>>>>>>>> I think administator of Ignite cluster should be able to > > > >> >>>>> monitor > > > >> >>>>>>> all > > > >> >>>>>>>>>>> Ignite process, including non blocking PME. > > > >> >>>>>>>>>>>> В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет: > > > >> >>>>>>>>>>>>> BTW, > > > >> >>>>>>>>>>>>> Found PME metric - getCurrentPmeDuration(). > > > >> >>>>>>>>>>>>> Seems, it shows exactly PME time and not so useful > > > >> >> because > > > >> >>>> of > > > >> >>>>>>> this. > > > >> >>>>>>>>>>>>> The goal it so show exactly blocking period. > > > >> >>>>>>>>>>>>> When PME cause no blocking, it's a good PME and I see > > > >> >> no > > > >> >>>>>>> reason to have > > > >> >>>>>>>>>>>>> monitoring related to it :) > > > >> >>>>>>>>>>>>> > > > >> >>>>>>>>>>>>> On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov < > > > >> >>>>>>> nizhi...@apache.org > > > > >> >>>>>>>>>>> wrote: > > > >> >>>>>>>>>>>>>> Anton. > > > >> >>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>> Why do we need to postpone implementation of this > > > >> >>>> metrics? > > > >> >>>>>>>>>>>>>> For now, implementation of new metric is very simple. > > > >> >>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>> I think we can implement this metrics as a single > > > >> >>>>>>> contribution. > > > >> >>>>>>>>>>>>>> В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov > > > >> >> пишет: > > > >> >>>>>>>>>>>>>>> Nikita, > > > >> >>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>> Looks like all we need now is a 1 simple metric: > > > >> >> are > > > >> >>>>>>> operations > > > >> >>>>>>>>>>> blocked? > > > >> >>>>>>>>>>>>>>> Just a true or false. > > > >> >>>>>>>>>>>>>>> Lest start from this. > > > >> >>>>>>>>>>>>>>> All other metrics can be extracted from logs now > > > >> >> and > > > >> >>>> can > > > >> >>>>> be > > > >> >>>>>>>>>>> implemented > > > >> >>>>>>>>>>>>>>> later. > > > >> >>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>> On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov < > > > >> >>>>>>>>>>> nizhi...@apache.org > > > > >> >>>>>>>>>>>>>>> wrote: > > > >> >>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>>> +1. > > > >> >>>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>>> Nikita, please, go ahead. > > > >> >>>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>>> вт, 16 июля 2019 г., 11:45 Nikita Amelchev < > > > >> >>>>>>> nsamelc...@gmail.com > > > >> >>>>>>>>>>>> : > > > >> >>>>>>>>>>>>>>>>> Hello, Igniters. > > > >> >>>>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>>>> I suggest to add some useful metrics about the > > > >> >>>>>>> partition map > > > >> >>>>>>>>>>> exchange > > > >> >>>>>>>>>>>>>>>>> (PME). For now, the duration of PME stages > > > >> >>>> available > > > >> >>>>>>> only in > > > >> >>>>>>>>>>> log > > > >> >>>>>>>>>>>>>> files > > > >> >>>>>>>>>>>>>>>>> and cannot be obtained using JMX or other > > > >> >> external > > > >> >>>>>>> tools. [1] > > > >> >>>>>>>>>>>>>>>>> I made the list of local node metrics that > > > >> >> help to > > > >> >>>>>>> understand > > > >> >>>>>>>>>>> the > > > >> >>>>>>>>>>>>>>>>> actual status of current PME: > > > >> >>>>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>>>> 1. initialVersion. Topology version that > > > >> >> initiates > > > >> >>>>> the > > > >> >>>>>>>>>>> exchange. > > > >> >>>>>>>>>>>>>>>>> 2. initTime. Time PME was started. > > > >> >>>>>>>>>>>>>>>>> 3. initEvent. Event that triggered PME. > > > >> >>>>>>>>>>>>>>>>> 4. partitionReleaseTime. Time when a node has > > > >> >>>>> finished > > > >> >>>>>>> waiting > > > >> >>>>>>>>>>> for > > > >> >>>>>>>>>>>>>> all > > > >> >>>>>>>>>>>>>>>>> updates and translations on a previous > > > >> >> topology. > > > >> >>>>>>>>>>>>>>>>> 5. sendSingleMessageTime. Time when a node > > > >> >> sent a > > > >> >>>>>>> single > > > >> >>>>>>>>>>> message. > > > >> >>>>>>>>>>>>>>>>> 6. recieveFullMessageTime. Time when a node > > > >> >>>> received > > > >> >>>>> a > > > >> >>>>>>> full > > > >> >>>>>>>>>>> message. > > > >> >>>>>>>>>>>>>>>>> 7. finishTime. Time PME was ended. > > > >> >>>>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>>>> When new PME started all these metrics resets. > > > >> >>>>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>>>> These metrics help to understand: > > > >> >>>>>>>>>>>>>>>>> - how long PME was (current or previous). > > > >> >>>>>>>>>>>>>>>>> - how long awaited for all updates was > > > >> >> completed. > > > >> >>>>>>>>>>>>>>>>> - what node blocks PME (didn't send a single > > > >> >>>> message) > > > >> >>>>>>>>>>>>>>>>> - what triggered PME. > > > >> >>>>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>>>> Thoughts? > > > >> >>>>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>>>> [1] > > > >> >>>>> https://issues.apache.org/jira/browse/IGNITE-11961 > > > >> >>>>>>>>>>>>>>>>> -- > > > >> >>>>>>>>>>>>>>>>> Best wishes, > > > >> >>>>>>>>>>>>>>>>> Amelchev Nikita > > > >> >>>>>>>>>>>>>>>>> > > > >> >>>>>>>>>>> > > > >> >>>>>>>>>>> > > > >> >>>>>>>>>>> -- > > > >> >>>>>>>>>>> Best wishes, > > > >> >>>>>>>>>>> Amelchev Nikita > > > >> >>>>>>>>>>> > > > >> >>>>>>>>> > > > >> >>>>>>>>> > > > >> >>>>>>>>> -- > > > >> >>>>>>>>> Best wishes, > > > >> >>>>>>>>> Amelchev Nikita > > > >> >>>>>>> > > > >> >>>>>>> > > > >> >>>>>>> -- > > > >> >>>>>>> Best wishes, > > > >> >>>>>>> Amelchev Nikita > > > >> >>>>>>> > > > >> >> > > > >> >> > > > >> >> -- > > > >> >> Best wishes, > > > >> >> Amelchev Nikita > > > >> >> > > > >> > > > > > > > > > -- > > > Zhenya Stanilovsky > > > > > > > > -- > > Best wishes, > > Amelchev Nikita > > -- Best wishes, Amelchev Nikita