Re: Cache operations performance metrics

Andrey Gura Fri, 20 Dec 2019 06:14:15 -0800

>> And if we have two metrics that are triggered for the same then one of them 
>> is useless.
> I don't understand what is the two metrics you are talking about.


Please, don't loose context. In your example it was checkpoint time
and some cache operation time.

> A business transaction includes work with several data sources, sending 
> network requests, executing some remote services.
> If it becomes slower then we should know - what basic API operations become 
> slower.

No, we should now what basic operations become slower. It is problem
with network (net io), with disk (disk io), JVM (VM internal metrics),
etc. All this operations are bricks of API operations.

> So we have to measure  `PutTime` from Ignite, `InsertTime` from RDBMS and 
> other parts of a transaction.

We can't do it properly due to a transactions implementation specific.
I already wrote about it. It doesn't mean that we must not fix it. But
it also means that we should reconsider approach to metrics in Ignite.
Bigger doesn't mean better.

> Ignite cache operations obviously becomes 2 times slower.
> *Why* they become slower is the question of an ongoing investigation.

But business operations metrics will show the same. And many other
internals related metrics will show the same. It is transitive,
redundant and relatively useless metric if it doesn't bring something
new in information. 500 caches with similar configurations (the same
nodes, the same data region, the same affinity, etc) and 500 metrics
like put time will show the same trend.And the same trend will show a
couple of system internal metrics. A couple vs hundreds. Doesn't make
sense and useless.

> I tried to look at other open-source products.
> Here is an example of metrics provided by Apache Kafka [1] [2]

If somebody do something it doesn't mean that they do it properly.

On Fri, Dec 20, 2019 at 4:28 PM Николай Ижиков <[email protected]> wrote:
>
> > And if we have two metrics that are triggered for the same then one of them 
> > is useless.
>
> I don't understand what is the two metrics you are talking about.
> I wrote about a single metric for a single cache operation.
>
> > Obviously if you want know how fast or slow your business operation you 
> > must measure latency of your business operation. What could be easier?
>
> A business transaction includes work with several data sources, sending 
> network requests, executing some remote services.
> If it becomes slower then we should know - what basic API operations become 
> slower.
> So we have to measure  `PutTime` from Ignite, `InsertTime` from RDBMS and 
> other parts of a transaction.
>
> Ignite will provide this kind of value out of the box.
> I think it’s useful values.
>
> > User saw "cache put time" metric becomes x2 bigger. Does it become slower 
> > or faster? Or we just put into the cache values that 4x bigger in size?
>
> Ignite cache operations obviously becomes 2 times slower.
> *Why* they become slower is the question of an ongoing investigation.
>
> I tried to look at other open-source products.
> Here is an example of metrics provided by Apache Kafka [1] [2]
>
> `request-latency-avg` - The average request latency in ms.
> `records-lag-max` - The maximum lag in terms of number of records for any 
> partition in this window. An increasing value over time is your best 
> indication that the consumer group is not keeping up with the producers.
> `fetch-latency-avg` - The average time taken for a fetch request.
>
> It seems, they implement a similar approach to what I proposed.
>
>
> [1] https://docs.confluent.io/current/kafka/monitoring.html#producer-metrics
> [2] 
> https://docs.confluent.io/current/kafka/monitoring.html#new-consumer-metrics
>
> > 20 дек. 2019 г., в 15:53, Andrey Gura <[email protected]> написал(а):
> >
> >> For example, the user saw «checkpoint time» metric becomes x2 bigger.
> >
> > I just quote your words: " this is a trigger to make a deeper
> > investigation". And if we have two metrics that are triggered for the
> > same then one of them is useless.
> >
> >> How it relates to business operations?
> >
> > Why it should be related with business operation? It is concrete
> > metrics for concrete process which can slowdown all operations in the
> > system. Obviously if you want know how fast or slow your business
> > operation you must measure latency of your business operation. What
> > could be easier?
> >
> >> Is it become slower or faster?
> >
> > Very correct question! User saw "cache put time" metric becomes x2
> > bigger. Does it become slower or faster? Or we just put into the cache
> > values that 4x bigger in size? Or all time before we put values
> > locally and now we put values on remote nodes. Or our operations
> > execute in transaction and then time will depend on transaction type,
> > actions in transaction and other transaction and actually will nothing
> > talk about real cache operation. We have more questions then answers.
> >
> >> On the other hand - if `PuTime` increased - then we know for sure, all 
> >> operation executing `put` becomes slower.
> >
> > Of course not :) See above.
> >
> > On Fri, Dec 20, 2019 at 3:20 PM Николай Ижиков <[email protected]> wrote:
> >>
> >>> It also will be visible on other metrics
> >>
> >> How will it be visible?
> >>
> >> For example, the user saw «checkpoint time» metric becomes x2 bigger.
> >> How it relates to business operations? Is it become slower or faster?
> >> What does it mean for an application performance?
> >>
> >> On the other hand - if `PuTime` increased - then we know for sure, all 
> >> operation executing `put` becomes slower.
> >>
> >> *Why* it’s become slower - is the essence of «go deeper» investigation.
> >>
> >>> 20 дек. 2019 г., в 15:07, Andrey Gura <[email protected]> написал(а):
> >>>
> >>>> If a cache has some percent of the relatively slow transaction this is a 
> >>>> trigger to make a deeper investigation.
> >>>
> >>> It also will be visible on other metrics. So cache operations metrics
> >>> still useless because it transitive values.
> >>>
> >>>>> 1. Measure some important internals (WAL operations, checkpoint time, 
> >>>>> etc) because it can talk about real problems.
> >>>
> >>>> We already implement it.
> >>>
> >>> I don't talk that it isn't implemented. It is just example of things
> >>> that should be measured. All other metrics depends on internals.
> >>>
> >>>>> 2. Measure business operations in user context, not cache API 
> >>>>> operations.
> >>>
> >>>> Why do you think these approaches should exclude one another?
> >>>
> >>> Because one of them is useless.
> >>>
> >>> On Fri, Dec 20, 2019 at 1:43 PM Николай Ижиков <[email protected]> 
> >>> wrote:
> >>>>
> >>>> Hello, Andrey.
> >>>>
> >>>>> Where the sense in this value? I explained why this metrics are 
> >>>>> relatively useless.
> >>>>
> >>>> I don’t agree with you.
> >>>> I believe they are not useless for a user.
> >>>> And I try to explain why I think so.
> >>>>
> >>>>> But user can't distinguish one transaction from another, so his 
> >>>>> knowledge doesn't make sense definitely.
> >>>>
> >>>> Users shouldn’t distinguish.
> >>>> If a cache has some percent of the relatively slow transaction this is a 
> >>>> trigger to make a deeper investigation.
> >>>>
> >>>>> 1. Measure some important internals (WAL operations, checkpoint time, 
> >>>>> etc) because it can talk about real problems.
> >>>>
> >>>> We already implement it.
> >>>> What metrics are missing for internal processes?
> >>>>
> >>>>> 2. Measure business operations in user context, not cache API 
> >>>>> operations.
> >>>>
> >>>> Why do you think these approaches should exclude one another?
> >>>> Users definitely should measure whole business transaction performance.
> >>>>
> >>>> I think we should provide a way to measure part of the business 
> >>>> transaction that relates to the Ignite.
> >>>>
> >>>>
> >>>>> 20 дек. 2019 г., в 13:02, Andrey Gura <[email protected]> написал(а):
> >>>>>
> >>>>>> The goal of the proposed metrics is to measure whole cache operations 
> >>>>>> behavior.
> >>>>>> It provides some kind of statistics(histograms) for it.
> >>>>>
> >>>>> Nikolay, reformulating doesn't make metrics more meaningful. Seriously 
> >>>>> :)
> >>>>>
> >>>>>> Yes, metrics will evaluate API call performance
> >>>>>
> >>>>> And what? Where the sense in this value? I explained why this metrics
> >>>>> are relatively useless.
> >>>>>
> >>>>>> These are metrics of client-side operation performance.
> >>>>>
> >>>>> Again. It's just a number without any sense.
> >>>>>
> >>>>>> I think a specific user has knowledge - what are his transactions.
> >>>>>
> >>>>> May be. But user can't distinguish one transaction from another, so
> >>>>> his knowledge doesn't make sense definitely.
> >>>>>
> >>>>>> From these metrics it can answer on the question «If my transaction 
> >>>>>> includes cacheXXX, how long it usually takes?»
> >>>>>
> >>>>> Actually not. The same caches can be involved  in a dozen of
> >>>>> transactions and there are no ways to understand what transactions are
> >>>>> slow or fast. It is useless.
> >>>>>
> >>>>>> I disagree here.
> >>>>>> If you have a better approach to measure cache operations performance 
> >>>>>> - please, share your vision.
> >>>>>
> >>>>> I already wrote about better approach. Two main points:
> >>>>>
> >>>>> 1. Measure some important internals (WAL operations, checkpoint time,
> >>>>> etc) because it can talk about real problems.
> >>>>> 2. Measure business operations in user context, not cache API 
> >>>>> operations.
> >>>>>
> >>>>> So  what we have? We have useless metrics that are doubled by useless
> >>>>> histograms.
> >>>>>
> >>>>> We should reconsider approach to metrics and performance measuring. It
> >>>>> is hard and long task. There are no need to commit tons of useless
> >>>>> metrics that just decrease performance.
> >>>>>
> >>>>> Sorry for some sarcasm but I really believe in my opinion. Metrics
> >>>>> problem exists very very long time and existing metrics discussed many
> >>>>> times. No one can explain this metrics to users because it requires
> >>>>> too many additional knowledge about internals. And metric  value
> >>>>> itself depends on many aspects of internals. It leads to impossibility
> >>>>> of interpretation. And it's good time to remove it (in AI 3.0 due to a
> >>>>> backward compatibility).
> >>>>>
> >>>>> On Thu, Dec 19, 2019 at 9:09 PM Николай Ижиков <[email protected]> 
> >>>>> wrote:
> >>>>>>
> >>>>>> Hello, Andrey.
> >>>>>>
> >>>>>> The goal of the proposed metrics is to measure whole cache operations 
> >>>>>> behavior.
> >>>>>> It provides some kind of statistics(histograms) for it.
> >>>>>> For more fine-grained analysis one will be use tracing or other «go 
> >>>>>> deeper» tools.
> >>>>>>
> >>>>>>>> Measured for API calls on the caller node side
> >>>>>>> Values will the same only for cases when node is remote relative to 
> >>>>>>> data
> >>>>>>
> >>>>>> Yes, metrics will evaluate API call performance.
> >>>>>> I think this is the most valuable information from a user's point of 
> >>>>>> view.
> >>>>>>
> >>>>>> Regular user wants to know how fast his cache operation performs.
> >>>>>> And these metrics provide the answer.
> >>>>>>
> >>>>>>> For regular data node (server node) timing will depend on answers for 
> >>>>>>> question:
> >>>>>>
> >>>>>> I think these answers are always available.
> >>>>>> I barely can imagine a scenario when one monitor «black box» cluster 
> >>>>>> and don’t know it.
> >>>>>> Even so, all answers are provided through system view we brought to 
> >>>>>> the Ignite :)
> >>>>>>
> >>>>>>> What is transaction commit or rollback time?
> >>>>>>
> >>>>>> These are metrics of client-side operation performance.
> >>>>>>
> >>>>>> I think a specific user has knowledge - what are his transactions.
> >>>>>> From these metrics it can answer on the question «If my transaction 
> >>>>>> includes cacheXXX, how long it usually takes?»
> >>>>>> I think it’s very valuable knowledge.
> >>>>>>
> >>>>>>> It will be implemented for most types of messages.
> >>>>>>
> >>>>>> Good, let’s do it?
> >>>>>>
> >>>>>>> So, from my point of view, commits for get/put/remove and 
> >>>>>>> commit/rollback should be reverted.
> >>>>>>
> >>>>>> I disagree here.
> >>>>>> If you have a better approach to measure cache operations performance 
> >>>>>> - please, share your vision.
> >>>>>>
> >>>>>>> 19 дек. 2019 г., в 16:03, Andrey Gura <[email protected]> написал(а):
> >>>>>>>
> >>>>>>> From my point of view, Ignite should provide meaningful metrics for
> >>>>>>> internal components that could be useful for monitoring and analysis.
> >>>>>>> All suggested options are meaningless in a sense. Below I'll try
> >>>>>>> explain why.
> >>>>>>>
> >>>>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on 
> >>>>>>>> the caller node side.
> >>>>>>>> Implemented in [1], commit [2].
> >>>>>>>
> >>>>>>> All cache operations in Ignite are distributed. So each value measured
> >>>>>>> for some cache operation will vary depending on where actually
> >>>>>>> operation is performed. Values will the same only for cases when node
> >>>>>>> is remote relative to data (e.g. client node).
> >>>>>>>
> >>>>>>> For regular data node (server node) timing will depend on answers for 
> >>>>>>> question:
> >>>>>>>
> >>>>>>> - is node primary for particular key or not? (for all operations)
> >>>>>>> - how many backups configured for the cache? (for put and remove)
> >>>>>>> - what write synchronization mode is configured for particular cache?
> >>>>>>> (for put and remove)
> >>>>>>> - is readFromBackup enabled for the cache? (for get)
> >>>>>>>
> >>>>>>> Both Ignite users and Ignite developers can't make any decision based
> >>>>>>> on this metrics.
> >>>>>>>
> >>>>>>>> * `commit`, `rollback` time histograms. Measured for API calls on 
> >>>>>>>> the caller node side [3].
> >>>>>>>
> >>>>>>> What is transaction commit or rollback time? How it calculates in
> >>>>>>> Ignite now? What actions included into transaction? What actions not
> >>>>>>> related with cache executed during transactions?
> >>>>>>>
> >>>>>>> There is no any sense in time of transaction commit or rollback
> >>>>>>> because there are no any way to understand what transaction was
> >>>>>>> performed in particular period of time. Usually a lot of transactions
> >>>>>>> and we can't to distinguish from each other.
> >>>>>>>
> >>>>>>> Moreover, transaction usually treats as business operation. So only
> >>>>>>> way to measure performance properly is measure business operation
> >>>>>>> time. That is user should create own metrics set for some business
> >>>>>>> API.
> >>>>>>>
> >>>>>>> Further. What about cross cache transactions? At the moment tx
> >>>>>>> commit/rollback time will be added to corresponding metrics per each
> >>>>>>> cache evolved to the transaction. The *same time* for *each cache*.
> >>>>>>> Absolutely meaningless.
> >>>>>>>
> >>>>>>> Again, both Ignite users and Ignite developers can't make any decision
> >>>>>>> based on this metrics. But users can create own metrics set.
> >>>>>>>
> >>>>>>>> * histograms that measure the time of processing `get`, `put`, 
> >>>>>>>> `remove`, `commit`, `rollback` messages on affinity nodes(primary 
> >>>>>>>> and backups).
> >>>>>>>> Ticket doesn't exist for it.
> >>>>>>>
> >>>>>>> It will be implemented for most types of messages.
> >>>>>>>
> >>>>>>> Metrics, application monitoring, performance analysis and measurement
> >>>>>>> are a a little harder than it sounds. Therefore, we must approach this
> >>>>>>> issue more carefully.
> >>>>>>> Blindly adding new types of metrics will not only not improve the
> >>>>>>> situation, but will also worsen the overall performance of the system
> >>>>>>> because metric calculation always on the hot path.
> >>>>>>>
> >>>>>>> So, from my point of view, commits for get/put/remove and
> >>>>>>> commit/rollback should be reverted.
> >>>>>>>
> >>>>>>> On Mon, Dec 16, 2019 at 5:39 PM Nikita Amelchev 
> >>>>>>> <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> I think these metrics are useful.
> >>>>>>>>
> >>>>>>>> I have prepared PR [1] for commit and rollback histograms. [2]
> >>>>>>>> Nikolay, could you take a look, please?
> >>>>>>>>
> >>>>>>>> If you do not mind, I will try to add affinity-nodes cache metrics:
> >>>>>>>>>> * histograms that measure the time of processing `get`, `put`, 
> >>>>>>>>>> `remove`, `commit`, `rollback` messages on affinity nodes(primary 
> >>>>>>>>>> and backups). Ticket doesn't exist for it.
> >>>>>>>>
> >>>>>>>> I have filed a ticket for it. [3]
> >>>>>>>>
> >>>>>>>> [1] https://github.com/apache/ignite/pull/7141
> >>>>>>>> [2] https://issues.apache.org/jira/browse/IGNITE-12450
> >>>>>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12453
> >>>>>>>>
> >>>>>>>> пн, 16 дек. 2019 г. в 11:07, Alexei Scherbakov 
> >>>>>>>> <[email protected]>:
> >>>>>>>>>
> >>>>>>>>> I think they are very useful.
> >>>>>>>>>
> >>>>>>>>> пн, 16 дек. 2019 г. в 10:51, Николай Ижиков <[email protected]>:
> >>>>>>>>>
> >>>>>>>>>> Hello, Alexei.
> >>>>>>>>>>
> >>>>>>>>>> Thanks for the link on the ticket, lableled it with the IEP-35 
> >>>>>>>>>> label.
> >>>>>>>>>> What do you think about proposed metrics set?
> >>>>>>>>>>
> >>>>>>>>>>> 16 дек. 2019 г., в 10:29, Alexei Scherbakov <
> >>>>>>>>>> [email protected]> написал(а):
> >>>>>>>>>>>
> >>>>>>>>>>> Nikolay,
> >>>>>>>>>>>
> >>>>>>>>>>> What about batch operations?
> >>>>>>>>>>>
> >>>>>>>>>>> For messages processing the ticket does exist and even has an
> >>>>>>>>>>> implementation from before new metrics API times [1]
> >>>>>>>>>>>
> >>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-10418
> >>>>>>>>>>>
> >>>>>>>>>>> пн, 16 дек. 2019 г. в 10:12, Николай Ижиков <[email protected]>:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hello, Igniters.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I want to provide the user answers to the following question: 
> >>>>>>>>>>>> "How cache
> >>>>>>>>>>>> API operations perform?"
> >>>>>>>>>>>> It seems, we need to implements metrics for basic cache API 
> >>>>>>>>>>>> operations
> >>>>>>>>>>>> like get, put, remove for it.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I think we should provide the following metrics:
> >>>>>>>>>>>>
> >>>>>>>>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls 
> >>>>>>>>>>>> on the
> >>>>>>>>>>>> caller node side.
> >>>>>>>>>>>> Implemented in [1], commit [2].
> >>>>>>>>>>>>
> >>>>>>>>>>>> * `commit`, `rollback` time histograms. Measured for API calls 
> >>>>>>>>>>>> on the
> >>>>>>>>>>>> caller node side [3].
> >>>>>>>>>>>>
> >>>>>>>>>>>> * histograms that measure the time of processing `get`, `put`, 
> >>>>>>>>>>>> `remove`,
> >>>>>>>>>>>> `commit`, `rollback` messages on affinity nodes(primary and 
> >>>>>>>>>>>> backups).
> >>>>>>>>>>>> Ticket doesn't exist for it.
> >>>>>>>>>>>>
> >>>>>>>>>>>> What do you think?
> >>>>>>>>>>>>
> >>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-12219
> >>>>>>>>>>>> [2]
> >>>>>>>>>>>>
> >>>>>>>>>> https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
> >>>>>>>>>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12450
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>>
> >>>>>>>>>>> Best regards,
> >>>>>>>>>>> Alexei Scherbakov
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>>
> >>>>>>>>> Best regards,
> >>>>>>>>> Alexei Scherbakov
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Best wishes,
> >>>>>>>> Amelchev Nikita
> >>>>>>
> >>>>
> >>
>

Re: Cache operations performance metrics

Reply via email to