Nikita What is the difference between those two metrics?
ср, 24 июля 2019 г., 12:45 Nikita Amelchev <nsamelc...@gmail.com>: > Igniters, thanks for comments. > > From the discussion it can be seen that we need only two metrics for now: > - сacheOperationsBlockedDuration (long) > - totalCacheOperationsBlockedDuration (long) > > I will prepare PR at the nearest time. > > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky <arzamas...@mail.ru.invalid > >: > > > > +1 with Anton decisions. > > > > > > >Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <a...@apache.org>: > > > > > >Folks, > > > > > >It looks like we're trying to implement "extended debug" instead of > > >"monitoring". > > >It should not be interesting for real admin what phase of PME is in > > >progress and so on. > > >Interested metrics are > > >- total blocked time (will be used for real SLA counting) > > >- are we blocked right now (shows we have an SLA degradation right now) > > >Duration of the current blocking period can be easily presented using > any > > >modern monitoring tool by regular checks. > > >Initial true will means "period start", precision will be a result of > > >checks frequency. > > >Anyway, I'm ok to have current metric presented with long, where long > is a > > >duration, see no reason, but ok :) > > > > > >All other features you mentioned are useful for code or > > >deployment improving and can (should) be taken from logs at the analysis > > >phase. > > > > > >On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < ivan.glu...@gmail.com > > wrote: > > > > > >> Folks, let me step in. > > >> > > >> Nikita, thanks for your suggestions! > > >> > > >> > 1. initialVersion. Topology version that initiates the exchange. > > >> > 2. initTime. Time PME was started. > > >> > 3. initEvent. Event that triggered PME. > > >> > 4. partitionReleaseTime. Time when a node has finished waiting for > all > > >> > updates and translations on a previous topology. > > >> > 5. sendSingleMessageTime. Time when a node sent a single message. > > >> > 6. recieveFullMessageTime. Time when a node received a full message. > > >> > 7. finishTime. Time PME was ended. > > >> > > > >> > When new PME started all these metrics resets. > > >> Every metric from Nikita's list looks useful and simple to implement. > > >> I think that it would be better to change format of metrics 4, 5, 6 > and > > >> 7 a bit: we can keep only difference between time of previous event > and > > >> time of corresponding event. Such metrics would be easier to perceive: > > >> they answer to specific questions "how much time did partition release > > >> take?" or "how much time did awaiting of distributed phase end take?". > > >> Also, if results of 4, 5, 6, 7 will be exported to monitoring system, > > >> graphs will show how different stages times change from one PME to > another. > > >> > > >> > When PME cause no blocking, it's a good PME and I see no reason to > have > > >> > monitoring related to it > > >> Agree with Anton here. These metrics should be measured only for true > > >> distributed exchange. Saving results for client leave/join PMEs will > > >> just complicate monitoring. > > >> > > >> > I agree with total blocking duration metric but > > >> > I still don't understand why instant value indicating that > operations are > > >> > blocked should be boolean. > > >> > Duration time since blocking has started looks more appropriate and > > >> useful. > > >> > It gives more information while semantic is left the same. > > >> Totally agree with Pavel here. Both "accumulated block time" and > > >> "current PME block time" metrics are useful. Growth of accumulated > > >> metric for specific period of time (should be easy to check via > > >> monitoring system graph) will show for how much business operations > were > > >> blocked in total, and non-zero current metric will show that we are > > >> experiencing issues right now. Boolean metric "are we blocked right > now" > > >> is not needed as it's obviously can be inferred from "current PME > block > > >> time". > > >> > > >> Best Regards, > > >> Ivan Rakov > > >> > > >> On 23.07.2019 16:02, Pavel Kovalenko wrote: > > >> > Nikita, > > >> > > > >> > I agree with total blocking duration metric but > > >> > I still don't understand why instant value indicating that > operations are > > >> > blocked should be boolean. > > >> > Duration time since blocking has started looks more appropriate and > > >> useful. > > >> > It gives more information while semantic is left the same. > > >> > > > >> > > > >> > > > >> > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < nsamelc...@gmail.com > >: > > >> > > > >> >> Folks, > > >> >> > > >> >> All previous suggestions have some disadvantages. It can be several > > >> >> exchanges between two metric updates and fast exchange can rewrite > > >> >> previous long exchange. > > >> >> > > >> >> We can introduce a metric of total blocking duration that will > > >> >> accumulate at the end of the exchange. So, users will get actual > > >> >> information about how long operations were blocked. Cluster metric > > >> >> will be a maximum of local nodes metrics. And we need a boolean > metric > > >> >> that will indicate realtime status. It needs because of duration > > >> >> metric updates at the end of the exchange. > > >> >> > > >> >> So I propose to change the current metric that not released to the > > >> >> totalCacheOperationsBlockingDuration metric and to add the > > >> >> isCacheOperationsBlocked metric. > > >> >> > > >> >> WDYT? > > >> >> > > >> >> пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < a...@apache.org >: > > >> >>> Nikolay, > > >> >>> > > >> >>> Still see no reason to replace boolean with long. > > >> >>> > > >> >>> On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov < > nizhi...@apache.org > > > >> >> wrote: > > >> >>>> Anton. > > >> >>>> > > >> >>>> 1. Value exported based on SPI settings, not in the moment it > changed. > > >> >>>> > > >> >>>> 2. Clock synchronisation - if we export start time, we should > also > > >> >> export > > >> >>>> node local timestamp. > > >> >>>> > > >> >>>> пн, 22 июля 2019 г., 8:33 Anton Vinogradov < a...@apache.org >: > > >> >>>> > > >> >>>>> Folks, > > >> >>>>> > > >> >>>>> What's the reason for duration counting? > > >> >>>>> AFAIU, it's a monitoring system feature to count the durations. > > >> >>>>> Sine monitoring system checks metrics periodically it will know > the > > >> >>>>> duration by its own log. > > >> >>>>> > > >> >>>>> On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko < > jokse...@gmail.com > > > >> >>>>> wrote: > > >> >>>>> > > >> >>>>>> Nikita, > > >> >>>>>> > > >> >>>>>> Yes, I mean duration not timestamp. For the metric name, I > suggest > > >> >>>>>> "cacheOperationsBlockingDuration", I think it cleaner > represents > > >> >> what > > >> >>>> is > > >> >>>>>> blocked during PME. > > >> >>>>>> We can also combine both timestamp > > >> >> "cacheOperationsBlockingStartTs" and > > >> >>>>>> duration to have better correlation when cache operations were > > >> >> blocked > > >> >>>>> and > > >> >>>>>> how much time it's taken. > > >> >>>>>> For instant view (like in JMX bean) a calculated value as you > > >> >> mentioned > > >> >>>>>> can be used. > > >> >>>>>> For metrics are exported to some backend (IEP-35) a counter > can be > > >> >>>> used. > > >> >>>>>> The counter is incremented by blocking time after blocking has > > >> >> ended. > > >> >>>>>> пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev < > nsamelc...@gmail.com > > >> >>> : > > >> >>>>>>> Pavel, > > >> >>>>>>> > > >> >>>>>>> The main purpose of this metric is > > >> >>>>>>>>> how much time we wait for resuming cache operations > > >> >>>>>>> Seems I misunderstood you. Do you mean timestamp or duration > here? > > >> >>>>>>>>> What do you think if we change the boolean value of metric > to a > > >> >>>> long > > >> >>>>>>> value that represents time in milliseconds when operations > were > > >> >>>> blocked? > > >> >>>>>>> This time can be calculated as (currentTime - > > >> >>>>>>> timeSinceOperationsBlocked) in case of timestamp. > > >> >>>>>>> > > >> >>>>>>> Duration will be more understandable. It'll be something like > > >> >>>>>>> getCurrentBlockingPmeDuration. But I haven't come up with a > better > > >> >>>>>>> name yet. > > >> >>>>>>> > > >> >>>>>>> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko < > jokse...@gmail.com > > >> >>> : > > >> >>>>>>>> Nikita, > > >> >>>>>>>> > > >> >>>>>>>> I think getCurrentPmeDuration doesn't show useful > information. > > >> >> The > > >> >>>>> main > > >> >>>>>>> PME side effect for end-users is blocking cache operations. > Not > > >> >> all > > >> >>>> PME > > >> >>>>>>> time blocks it. > > >> >>>>>>>> What information gives to an end-user timestamp of > > >> >>>>>>> "timeSinceOperationsBlocked"? For what analysis it can be > used and > > >> >>>> how? > > >> >>>>>>>> пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev < > > >> >> nsamelc...@gmail.com > > >> >>>>> : > > >> >>>>>>>>> Hi Pavel, > > >> >>>>>>>>> > > >> >>>>>>>>> This time already can be obtained from the > > >> >> getCurrentPmeDuration > > >> >>>> and > > >> >>>>>>>>> new isOperationsBlockedByPme metrics. > > >> >>>>>>>>> > > >> >>>>>>>>> As an alternative solution, I can rework recently added > > >> >>>>>>>>> getCurrentPmeDuration metric (not released yet). Seems for > > >> >> users it > > >> >>>>>>>>> useless in case of non-blocking PME. > > >> >>>>>>>>> Lets name it timeSinceOperationsBlocked. It'll be timestamp > > >> >> when > > >> >>>>>>>>> blocking started (minimal value of cluster nodes) and 0 if > > >> >> blocking > > >> >>>>>>>>> ends (there is no running PME). > > >> >>>>>>>>> > > >> >>>>>>>>> WDYT? > > >> >>>>>>>>> > > >> >>>>>>>>> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko < > > >> >> jokse...@gmail.com >: > > >> >>>>>>>>>> Hi Nikita, > > >> >>>>>>>>>> > > >> >>>>>>>>>> Thank you for working on this. What do you think if we > > >> >> change the > > >> >>>>>>> boolean > > >> >>>>>>>>>> value of metric to a long value that represents time in > > >> >>>>> milliseconds > > >> >>>>>>> when > > >> >>>>>>>>>> operations were blocked? > > >> >>>>>>>>>> Since we have not only JMX and now metrics are periodically > > >> >>>>> exported > > >> >>>>>>> to > > >> >>>>>>>>>> some backend it can give a more clear picture of how much > > >> >> time we > > >> >>>>>>> wait for > > >> >>>>>>>>>> resuming cache operations instead of instant boolean > > >> >> indicator. > > >> >>>>>>>>>> пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev < > > >> >>>> nsamelc...@gmail.com > > >> >>>>>> : > > >> >>>>>>>>>>> Anton, Nikolay, > > >> >>>>>>>>>>> > > >> >>>>>>>>>>> Thanks for the support. > > >> >>>>>>>>>>> > > >> >>>>>>>>>>> For now, we have the getCurrentPmeDuration() metric that > > >> >> does > > >> >>>> not > > >> >>>>>>> show > > >> >>>>>>>>>>> influence on the cluster correctly. PME can be without > > >> >> blocking > > >> >>>>>>>>>>> operations. For example, client node join/leave events. > > >> >>>>>>>>>>> > > >> >>>>>>>>>>> I suggest add new metric - isOperationsBlockedByPme(). > > >> >>>> Together, > > >> >>>>>>> these > > >> >>>>>>>>>>> metrics will show influence of the PME on cluster and user > > >> >>>>>>> operations. > > >> >>>>>>>>>>> I have prepared PR for this (Bot visa is green). [1] Can > > >> >> anyone > > >> >>>>>>> take a > > >> >>>>>>>>>>> look? > > >> >>>>>>>>>>> > > >> >>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-11961 > > >> >>>>>>>>>>> > > >> >>>>>>>>>>> вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov < > > >> >>>>> nizhi...@apache.org > > >> >>>>>>>> : > > >> >>>>>>>>>>>> I think administator of Ignite cluster should be able to > > >> >>>>> monitor > > >> >>>>>>> all > > >> >>>>>>>>>>> Ignite process, including non blocking PME. > > >> >>>>>>>>>>>> В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет: > > >> >>>>>>>>>>>>> BTW, > > >> >>>>>>>>>>>>> Found PME metric - getCurrentPmeDuration(). > > >> >>>>>>>>>>>>> Seems, it shows exactly PME time and not so useful > > >> >> because > > >> >>>> of > > >> >>>>>>> this. > > >> >>>>>>>>>>>>> The goal it so show exactly blocking period. > > >> >>>>>>>>>>>>> When PME cause no blocking, it's a good PME and I see > > >> >> no > > >> >>>>>>> reason to have > > >> >>>>>>>>>>>>> monitoring related to it :) > > >> >>>>>>>>>>>>> > > >> >>>>>>>>>>>>> On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov < > > >> >>>>>>> nizhi...@apache.org > > > >> >>>>>>>>>>> wrote: > > >> >>>>>>>>>>>>>> Anton. > > >> >>>>>>>>>>>>>> > > >> >>>>>>>>>>>>>> Why do we need to postpone implementation of this > > >> >>>> metrics? > > >> >>>>>>>>>>>>>> For now, implementation of new metric is very simple. > > >> >>>>>>>>>>>>>> > > >> >>>>>>>>>>>>>> I think we can implement this metrics as a single > > >> >>>>>>> contribution. > > >> >>>>>>>>>>>>>> В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov > > >> >> пишет: > > >> >>>>>>>>>>>>>>> Nikita, > > >> >>>>>>>>>>>>>>> > > >> >>>>>>>>>>>>>>> Looks like all we need now is a 1 simple metric: > > >> >> are > > >> >>>>>>> operations > > >> >>>>>>>>>>> blocked? > > >> >>>>>>>>>>>>>>> Just a true or false. > > >> >>>>>>>>>>>>>>> Lest start from this. > > >> >>>>>>>>>>>>>>> All other metrics can be extracted from logs now > > >> >> and > > >> >>>> can > > >> >>>>> be > > >> >>>>>>>>>>> implemented > > >> >>>>>>>>>>>>>>> later. > > >> >>>>>>>>>>>>>>> > > >> >>>>>>>>>>>>>>> On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov < > > >> >>>>>>>>>>> nizhi...@apache.org > > > >> >>>>>>>>>>>>>>> wrote: > > >> >>>>>>>>>>>>>>> > > >> >>>>>>>>>>>>>>>> +1. > > >> >>>>>>>>>>>>>>>> > > >> >>>>>>>>>>>>>>>> Nikita, please, go ahead. > > >> >>>>>>>>>>>>>>>> > > >> >>>>>>>>>>>>>>>> > > >> >>>>>>>>>>>>>>>> вт, 16 июля 2019 г., 11:45 Nikita Amelchev < > > >> >>>>>>> nsamelc...@gmail.com > > >> >>>>>>>>>>>> : > > >> >>>>>>>>>>>>>>>>> Hello, Igniters. > > >> >>>>>>>>>>>>>>>>> > > >> >>>>>>>>>>>>>>>>> I suggest to add some useful metrics about the > > >> >>>>>>> partition map > > >> >>>>>>>>>>> exchange > > >> >>>>>>>>>>>>>>>>> (PME). For now, the duration of PME stages > > >> >>>> available > > >> >>>>>>> only in > > >> >>>>>>>>>>> log > > >> >>>>>>>>>>>>>> files > > >> >>>>>>>>>>>>>>>>> and cannot be obtained using JMX or other > > >> >> external > > >> >>>>>>> tools. [1] > > >> >>>>>>>>>>>>>>>>> I made the list of local node metrics that > > >> >> help to > > >> >>>>>>> understand > > >> >>>>>>>>>>> the > > >> >>>>>>>>>>>>>>>>> actual status of current PME: > > >> >>>>>>>>>>>>>>>>> > > >> >>>>>>>>>>>>>>>>> 1. initialVersion. Topology version that > > >> >> initiates > > >> >>>>> the > > >> >>>>>>>>>>> exchange. > > >> >>>>>>>>>>>>>>>>> 2. initTime. Time PME was started. > > >> >>>>>>>>>>>>>>>>> 3. initEvent. Event that triggered PME. > > >> >>>>>>>>>>>>>>>>> 4. partitionReleaseTime. Time when a node has > > >> >>>>> finished > > >> >>>>>>> waiting > > >> >>>>>>>>>>> for > > >> >>>>>>>>>>>>>> all > > >> >>>>>>>>>>>>>>>>> updates and translations on a previous > > >> >> topology. > > >> >>>>>>>>>>>>>>>>> 5. sendSingleMessageTime. Time when a node > > >> >> sent a > > >> >>>>>>> single > > >> >>>>>>>>>>> message. > > >> >>>>>>>>>>>>>>>>> 6. recieveFullMessageTime. Time when a node > > >> >>>> received > > >> >>>>> a > > >> >>>>>>> full > > >> >>>>>>>>>>> message. > > >> >>>>>>>>>>>>>>>>> 7. finishTime. Time PME was ended. > > >> >>>>>>>>>>>>>>>>> > > >> >>>>>>>>>>>>>>>>> When new PME started all these metrics resets. > > >> >>>>>>>>>>>>>>>>> > > >> >>>>>>>>>>>>>>>>> These metrics help to understand: > > >> >>>>>>>>>>>>>>>>> - how long PME was (current or previous). > > >> >>>>>>>>>>>>>>>>> - how long awaited for all updates was > > >> >> completed. > > >> >>>>>>>>>>>>>>>>> - what node blocks PME (didn't send a single > > >> >>>> message) > > >> >>>>>>>>>>>>>>>>> - what triggered PME. > > >> >>>>>>>>>>>>>>>>> > > >> >>>>>>>>>>>>>>>>> Thoughts? > > >> >>>>>>>>>>>>>>>>> > > >> >>>>>>>>>>>>>>>>> [1] > > >> >>>>> https://issues.apache.org/jira/browse/IGNITE-11961 > > >> >>>>>>>>>>>>>>>>> -- > > >> >>>>>>>>>>>>>>>>> Best wishes, > > >> >>>>>>>>>>>>>>>>> Amelchev Nikita > > >> >>>>>>>>>>>>>>>>> > > >> >>>>>>>>>>> > > >> >>>>>>>>>>> > > >> >>>>>>>>>>> -- > > >> >>>>>>>>>>> Best wishes, > > >> >>>>>>>>>>> Amelchev Nikita > > >> >>>>>>>>>>> > > >> >>>>>>>>> > > >> >>>>>>>>> > > >> >>>>>>>>> -- > > >> >>>>>>>>> Best wishes, > > >> >>>>>>>>> Amelchev Nikita > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> -- > > >> >>>>>>> Best wishes, > > >> >>>>>>> Amelchev Nikita > > >> >>>>>>> > > >> >> > > >> >> > > >> >> -- > > >> >> Best wishes, > > >> >> Amelchev Nikita > > >> >> > > >> > > > > > > -- > > Zhenya Stanilovsky > > > > -- > Best wishes, > Amelchev Nikita >