Re: Cache operations performance metrics

Николай Ижиков Fri, 20 Dec 2019 02:44:22 -0800

Hello, Andrey.

> Where the sense in this value? I explained why this metrics are relatively 
> useless.


I don’t agree with you.
I believe they are not useless for a user.
And I try to explain why I think so.

> But user can't distinguish one transaction from another, so his knowledge 
> doesn't make sense definitely.

Users shouldn’t distinguish.
If a cache has some percent of the relatively slow transaction this is a 
trigger to make a deeper investigation.

> 1. Measure some important internals (WAL operations, checkpoint time, etc) 
> because it can talk about real problems.

We already implement it.
What metrics are missing for internal processes?

> 2. Measure business operations in user context, not cache API operations.

Why do you think these approaches should exclude one another?
Users definitely should measure whole business transaction performance.

I think we should provide a way to measure part of the business transaction 
that relates to the Ignite.


> 20 дек. 2019 г., в 13:02, Andrey Gura <ag...@apache.org> написал(а):
> 
>> The goal of the proposed metrics is to measure whole cache operations 
>> behavior.
>> It provides some kind of statistics(histograms) for it.
> 
> Nikolay, reformulating doesn't make metrics more meaningful. Seriously :)
> 
>> Yes, metrics will evaluate API call performance
> 
> And what? Where the sense in this value? I explained why this metrics
> are relatively useless.
> 
>> These are metrics of client-side operation performance.
> 
> Again. It's just a number without any sense.
> 
>> I think a specific user has knowledge - what are his transactions.
> 
> May be. But user can't distinguish one transaction from another, so
> his knowledge doesn't make sense definitely.
> 
>> From these metrics it can answer on the question «If my transaction includes 
>> cacheXXX, how long it usually takes?»
> 
> Actually not. The same caches can be involved  in a dozen of
> transactions and there are no ways to understand what transactions are
> slow or fast. It is useless.
> 
>> I disagree here.
>> If you have a better approach to measure cache operations performance - 
>> please, share your vision.
> 
> I already wrote about better approach. Two main points:
> 
> 1. Measure some important internals (WAL operations, checkpoint time,
> etc) because it can talk about real problems.
> 2. Measure business operations in user context, not cache API operations.
> 
> So  what we have? We have useless metrics that are doubled by useless
> histograms.
> 
> We should reconsider approach to metrics and performance measuring. It
> is hard and long task. There are no need to commit tons of useless
> metrics that just decrease performance.
> 
> Sorry for some sarcasm but I really believe in my opinion. Metrics
> problem exists very very long time and existing metrics discussed many
> times. No one can explain this metrics to users because it requires
> too many additional knowledge about internals. And metric  value
> itself depends on many aspects of internals. It leads to impossibility
> of interpretation. And it's good time to remove it (in AI 3.0 due to a
> backward compatibility).
> 
> On Thu, Dec 19, 2019 at 9:09 PM Николай Ижиков <nizhikov....@gmail.com> wrote:
>> 
>> Hello, Andrey.
>> 
>> The goal of the proposed metrics is to measure whole cache operations 
>> behavior.
>> It provides some kind of statistics(histograms) for it.
>> For more fine-grained analysis one will be use tracing or other «go deeper» 
>> tools.
>> 
>>>> Measured for API calls on the caller node side
>>> Values will the same only for cases when node is remote relative to data
>> 
>> Yes, metrics will evaluate API call performance.
>> I think this is the most valuable information from a user's point of view.
>> 
>> Regular user wants to know how fast his cache operation performs.
>> And these metrics provide the answer.
>> 
>>> For regular data node (server node) timing will depend on answers for 
>>> question:
>> 
>> I think these answers are always available.
>> I barely can imagine a scenario when one monitor «black box» cluster and 
>> don’t know it.
>> Even so, all answers are provided through system view we brought to the 
>> Ignite :)
>> 
>>> What is transaction commit or rollback time?
>> 
>> These are metrics of client-side operation performance.
>> 
>> I think a specific user has knowledge - what are his transactions.
>> From these metrics it can answer on the question «If my transaction includes 
>> cacheXXX, how long it usually takes?»
>> I think it’s very valuable knowledge.
>> 
>>> It will be implemented for most types of messages.
>> 
>> Good, let’s do it?
>> 
>>> So, from my point of view, commits for get/put/remove and commit/rollback 
>>> should be reverted.
>> 
>> I disagree here.
>> If you have a better approach to measure cache operations performance - 
>> please, share your vision.
>> 
>>> 19 дек. 2019 г., в 16:03, Andrey Gura <ag...@apache.org> написал(а):
>>> 
>>> From my point of view, Ignite should provide meaningful metrics for
>>> internal components that could be useful for monitoring and analysis.
>>> All suggested options are meaningless in a sense. Below I'll try
>>> explain why.
>>> 
>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the 
>>>> caller node side.
>>>>  Implemented in [1], commit [2].
>>> 
>>> All cache operations in Ignite are distributed. So each value measured
>>> for some cache operation will vary depending on where actually
>>> operation is performed. Values will the same only for cases when node
>>> is remote relative to data (e.g. client node).
>>> 
>>> For regular data node (server node) timing will depend on answers for 
>>> question:
>>> 
>>> - is node primary for particular key or not? (for all operations)
>>> - how many backups configured for the cache? (for put and remove)
>>> - what write synchronization mode is configured for particular cache?
>>> (for put and remove)
>>> - is readFromBackup enabled for the cache? (for get)
>>> 
>>> Both Ignite users and Ignite developers can't make any decision based
>>> on this metrics.
>>> 
>>>> * `commit`, `rollback` time histograms. Measured for API calls on the 
>>>> caller node side [3].
>>> 
>>> What is transaction commit or rollback time? How it calculates in
>>> Ignite now? What actions included into transaction? What actions not
>>> related with cache executed during transactions?
>>> 
>>> There is no any sense in time of transaction commit or rollback
>>> because there are no any way to understand what transaction was
>>> performed in particular period of time. Usually a lot of transactions
>>> and we can't to distinguish from each other.
>>> 
>>> Moreover, transaction usually treats as business operation. So only
>>> way to measure performance properly is measure business operation
>>> time. That is user should create own metrics set for some business
>>> API.
>>> 
>>> Further. What about cross cache transactions? At the moment tx
>>> commit/rollback time will be added to corresponding metrics per each
>>> cache evolved to the transaction. The *same time* for *each cache*.
>>> Absolutely meaningless.
>>> 
>>> Again, both Ignite users and Ignite developers can't make any decision
>>> based on this metrics. But users can create own metrics set.
>>> 
>>>> * histograms that measure the time of processing `get`, `put`, `remove`, 
>>>> `commit`, `rollback` messages on affinity nodes(primary and backups).
>>>>  Ticket doesn't exist for it.
>>> 
>>> It will be implemented for most types of messages.
>>> 
>>> Metrics, application monitoring, performance analysis and measurement
>>> are a a little harder than it sounds. Therefore, we must approach this
>>> issue more carefully.
>>> Blindly adding new types of metrics will not only not improve the
>>> situation, but will also worsen the overall performance of the system
>>> because metric calculation always on the hot path.
>>> 
>>> So, from my point of view, commits for get/put/remove and
>>> commit/rollback should be reverted.
>>> 
>>> On Mon, Dec 16, 2019 at 5:39 PM Nikita Amelchev <nsamelc...@gmail.com> 
>>> wrote:
>>>> 
>>>> I think these metrics are useful.
>>>> 
>>>> I have prepared PR [1] for commit and rollback histograms. [2]
>>>> Nikolay, could you take a look, please?
>>>> 
>>>> If you do not mind, I will try to add affinity-nodes cache metrics:
>>>>>> * histograms that measure the time of processing `get`, `put`, `remove`, 
>>>>>> `commit`, `rollback` messages on affinity nodes(primary and backups). 
>>>>>> Ticket doesn't exist for it.
>>>> 
>>>> I have filed a ticket for it. [3]
>>>> 
>>>> [1] https://github.com/apache/ignite/pull/7141
>>>> [2] https://issues.apache.org/jira/browse/IGNITE-12450
>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12453
>>>> 
>>>> пн, 16 дек. 2019 г. в 11:07, Alexei Scherbakov 
>>>> <alexey.scherbak...@gmail.com>:
>>>>> 
>>>>> I think they are very useful.
>>>>> 
>>>>> пн, 16 дек. 2019 г. в 10:51, Николай Ижиков <nizhi...@apache.org>:
>>>>> 
>>>>>> Hello, Alexei.
>>>>>> 
>>>>>> Thanks for the link on the ticket, lableled it with the IEP-35 label.
>>>>>> What do you think about proposed metrics set?
>>>>>> 
>>>>>>> 16 дек. 2019 г., в 10:29, Alexei Scherbakov <
>>>>>> alexey.scherbak...@gmail.com> написал(а):
>>>>>>> 
>>>>>>> Nikolay,
>>>>>>> 
>>>>>>> What about batch operations?
>>>>>>> 
>>>>>>> For messages processing the ticket does exist and even has an
>>>>>>> implementation from before new metrics API times [1]
>>>>>>> 
>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-10418
>>>>>>> 
>>>>>>> пн, 16 дек. 2019 г. в 10:12, Николай Ижиков <nizhi...@apache.org>:
>>>>>>> 
>>>>>>>> Hello, Igniters.
>>>>>>>> 
>>>>>>>> I want to provide the user answers to the following question: "How 
>>>>>>>> cache
>>>>>>>> API operations perform?"
>>>>>>>> It seems, we need to implements metrics for basic cache API operations
>>>>>>>> like get, put, remove for it.
>>>>>>>> 
>>>>>>>> I think we should provide the following metrics:
>>>>>>>> 
>>>>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the
>>>>>>>> caller node side.
>>>>>>>>  Implemented in [1], commit [2].
>>>>>>>> 
>>>>>>>> * `commit`, `rollback` time histograms. Measured for API calls on the
>>>>>>>> caller node side [3].
>>>>>>>> 
>>>>>>>> * histograms that measure the time of processing `get`, `put`, 
>>>>>>>> `remove`,
>>>>>>>> `commit`, `rollback` messages on affinity nodes(primary and backups).
>>>>>>>>  Ticket doesn't exist for it.
>>>>>>>> 
>>>>>>>> What do you think?
>>>>>>>> 
>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-12219
>>>>>>>> [2]
>>>>>>>> 
>>>>>> https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
>>>>>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12450
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> 
>>>>>>> Best regards,
>>>>>>> Alexei Scherbakov
>>>>>> 
>>>>>> 
>>>>> 
>>>>> --
>>>>> 
>>>>> Best regards,
>>>>> Alexei Scherbakov
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Best wishes,
>>>> Amelchev Nikita
>>

Re: Cache operations performance metrics

Reply via email to