Hello, Andrey. > Where the sense in this value? I explained why this metrics are relatively > useless.
I don’t agree with you. I believe they are not useless for a user. And I try to explain why I think so. > But user can't distinguish one transaction from another, so his knowledge > doesn't make sense definitely. Users shouldn’t distinguish. If a cache has some percent of the relatively slow transaction this is a trigger to make a deeper investigation. > 1. Measure some important internals (WAL operations, checkpoint time, etc) > because it can talk about real problems. We already implement it. What metrics are missing for internal processes? > 2. Measure business operations in user context, not cache API operations. Why do you think these approaches should exclude one another? Users definitely should measure whole business transaction performance. I think we should provide a way to measure part of the business transaction that relates to the Ignite. > 20 дек. 2019 г., в 13:02, Andrey Gura <ag...@apache.org> написал(а): > >> The goal of the proposed metrics is to measure whole cache operations >> behavior. >> It provides some kind of statistics(histograms) for it. > > Nikolay, reformulating doesn't make metrics more meaningful. Seriously :) > >> Yes, metrics will evaluate API call performance > > And what? Where the sense in this value? I explained why this metrics > are relatively useless. > >> These are metrics of client-side operation performance. > > Again. It's just a number without any sense. > >> I think a specific user has knowledge - what are his transactions. > > May be. But user can't distinguish one transaction from another, so > his knowledge doesn't make sense definitely. > >> From these metrics it can answer on the question «If my transaction includes >> cacheXXX, how long it usually takes?» > > Actually not. The same caches can be involved in a dozen of > transactions and there are no ways to understand what transactions are > slow or fast. It is useless. > >> I disagree here. >> If you have a better approach to measure cache operations performance - >> please, share your vision. > > I already wrote about better approach. Two main points: > > 1. Measure some important internals (WAL operations, checkpoint time, > etc) because it can talk about real problems. > 2. Measure business operations in user context, not cache API operations. > > So what we have? We have useless metrics that are doubled by useless > histograms. > > We should reconsider approach to metrics and performance measuring. It > is hard and long task. There are no need to commit tons of useless > metrics that just decrease performance. > > Sorry for some sarcasm but I really believe in my opinion. Metrics > problem exists very very long time and existing metrics discussed many > times. No one can explain this metrics to users because it requires > too many additional knowledge about internals. And metric value > itself depends on many aspects of internals. It leads to impossibility > of interpretation. And it's good time to remove it (in AI 3.0 due to a > backward compatibility). > > On Thu, Dec 19, 2019 at 9:09 PM Николай Ижиков <nizhikov....@gmail.com> wrote: >> >> Hello, Andrey. >> >> The goal of the proposed metrics is to measure whole cache operations >> behavior. >> It provides some kind of statistics(histograms) for it. >> For more fine-grained analysis one will be use tracing or other «go deeper» >> tools. >> >>>> Measured for API calls on the caller node side >>> Values will the same only for cases when node is remote relative to data >> >> Yes, metrics will evaluate API call performance. >> I think this is the most valuable information from a user's point of view. >> >> Regular user wants to know how fast his cache operation performs. >> And these metrics provide the answer. >> >>> For regular data node (server node) timing will depend on answers for >>> question: >> >> I think these answers are always available. >> I barely can imagine a scenario when one monitor «black box» cluster and >> don’t know it. >> Even so, all answers are provided through system view we brought to the >> Ignite :) >> >>> What is transaction commit or rollback time? >> >> These are metrics of client-side operation performance. >> >> I think a specific user has knowledge - what are his transactions. >> From these metrics it can answer on the question «If my transaction includes >> cacheXXX, how long it usually takes?» >> I think it’s very valuable knowledge. >> >>> It will be implemented for most types of messages. >> >> Good, let’s do it? >> >>> So, from my point of view, commits for get/put/remove and commit/rollback >>> should be reverted. >> >> I disagree here. >> If you have a better approach to measure cache operations performance - >> please, share your vision. >> >>> 19 дек. 2019 г., в 16:03, Andrey Gura <ag...@apache.org> написал(а): >>> >>> From my point of view, Ignite should provide meaningful metrics for >>> internal components that could be useful for monitoring and analysis. >>> All suggested options are meaningless in a sense. Below I'll try >>> explain why. >>> >>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the >>>> caller node side. >>>> Implemented in [1], commit [2]. >>> >>> All cache operations in Ignite are distributed. So each value measured >>> for some cache operation will vary depending on where actually >>> operation is performed. Values will the same only for cases when node >>> is remote relative to data (e.g. client node). >>> >>> For regular data node (server node) timing will depend on answers for >>> question: >>> >>> - is node primary for particular key or not? (for all operations) >>> - how many backups configured for the cache? (for put and remove) >>> - what write synchronization mode is configured for particular cache? >>> (for put and remove) >>> - is readFromBackup enabled for the cache? (for get) >>> >>> Both Ignite users and Ignite developers can't make any decision based >>> on this metrics. >>> >>>> * `commit`, `rollback` time histograms. Measured for API calls on the >>>> caller node side [3]. >>> >>> What is transaction commit or rollback time? How it calculates in >>> Ignite now? What actions included into transaction? What actions not >>> related with cache executed during transactions? >>> >>> There is no any sense in time of transaction commit or rollback >>> because there are no any way to understand what transaction was >>> performed in particular period of time. Usually a lot of transactions >>> and we can't to distinguish from each other. >>> >>> Moreover, transaction usually treats as business operation. So only >>> way to measure performance properly is measure business operation >>> time. That is user should create own metrics set for some business >>> API. >>> >>> Further. What about cross cache transactions? At the moment tx >>> commit/rollback time will be added to corresponding metrics per each >>> cache evolved to the transaction. The *same time* for *each cache*. >>> Absolutely meaningless. >>> >>> Again, both Ignite users and Ignite developers can't make any decision >>> based on this metrics. But users can create own metrics set. >>> >>>> * histograms that measure the time of processing `get`, `put`, `remove`, >>>> `commit`, `rollback` messages on affinity nodes(primary and backups). >>>> Ticket doesn't exist for it. >>> >>> It will be implemented for most types of messages. >>> >>> Metrics, application monitoring, performance analysis and measurement >>> are a a little harder than it sounds. Therefore, we must approach this >>> issue more carefully. >>> Blindly adding new types of metrics will not only not improve the >>> situation, but will also worsen the overall performance of the system >>> because metric calculation always on the hot path. >>> >>> So, from my point of view, commits for get/put/remove and >>> commit/rollback should be reverted. >>> >>> On Mon, Dec 16, 2019 at 5:39 PM Nikita Amelchev <nsamelc...@gmail.com> >>> wrote: >>>> >>>> I think these metrics are useful. >>>> >>>> I have prepared PR [1] for commit and rollback histograms. [2] >>>> Nikolay, could you take a look, please? >>>> >>>> If you do not mind, I will try to add affinity-nodes cache metrics: >>>>>> * histograms that measure the time of processing `get`, `put`, `remove`, >>>>>> `commit`, `rollback` messages on affinity nodes(primary and backups). >>>>>> Ticket doesn't exist for it. >>>> >>>> I have filed a ticket for it. [3] >>>> >>>> [1] https://github.com/apache/ignite/pull/7141 >>>> [2] https://issues.apache.org/jira/browse/IGNITE-12450 >>>> [3] https://issues.apache.org/jira/browse/IGNITE-12453 >>>> >>>> пн, 16 дек. 2019 г. в 11:07, Alexei Scherbakov >>>> <alexey.scherbak...@gmail.com>: >>>>> >>>>> I think they are very useful. >>>>> >>>>> пн, 16 дек. 2019 г. в 10:51, Николай Ижиков <nizhi...@apache.org>: >>>>> >>>>>> Hello, Alexei. >>>>>> >>>>>> Thanks for the link on the ticket, lableled it with the IEP-35 label. >>>>>> What do you think about proposed metrics set? >>>>>> >>>>>>> 16 дек. 2019 г., в 10:29, Alexei Scherbakov < >>>>>> alexey.scherbak...@gmail.com> написал(а): >>>>>>> >>>>>>> Nikolay, >>>>>>> >>>>>>> What about batch operations? >>>>>>> >>>>>>> For messages processing the ticket does exist and even has an >>>>>>> implementation from before new metrics API times [1] >>>>>>> >>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-10418 >>>>>>> >>>>>>> пн, 16 дек. 2019 г. в 10:12, Николай Ижиков <nizhi...@apache.org>: >>>>>>> >>>>>>>> Hello, Igniters. >>>>>>>> >>>>>>>> I want to provide the user answers to the following question: "How >>>>>>>> cache >>>>>>>> API operations perform?" >>>>>>>> It seems, we need to implements metrics for basic cache API operations >>>>>>>> like get, put, remove for it. >>>>>>>> >>>>>>>> I think we should provide the following metrics: >>>>>>>> >>>>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the >>>>>>>> caller node side. >>>>>>>> Implemented in [1], commit [2]. >>>>>>>> >>>>>>>> * `commit`, `rollback` time histograms. Measured for API calls on the >>>>>>>> caller node side [3]. >>>>>>>> >>>>>>>> * histograms that measure the time of processing `get`, `put`, >>>>>>>> `remove`, >>>>>>>> `commit`, `rollback` messages on affinity nodes(primary and backups). >>>>>>>> Ticket doesn't exist for it. >>>>>>>> >>>>>>>> What do you think? >>>>>>>> >>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-12219 >>>>>>>> [2] >>>>>>>> >>>>>> https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364 >>>>>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12450 >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Best regards, >>>>>>> Alexei Scherbakov >>>>>> >>>>>> >>>>> >>>>> -- >>>>> >>>>> Best regards, >>>>> Alexei Scherbakov >>>> >>>> >>>> >>>> -- >>>> Best wishes, >>>> Amelchev Nikita >>