Re: [IEP-35] GridJobProcessorMetrics migration

2019-06-25 Thread Alex Plehanov
Hi, Nickolay

Yes, sure. I've left some comments on GitHub.

пн, 24 июн. 2019 г. в 19:15, Nikolay Izhikov :

> Hello, Alex.
>
> Based on our private discussion I've additionally migrated
> `totalExecutionTime` and `totalWaitingTime` counters.
> Can you review the PR [1]?
>
> [1] https://github.com/apache/ignite/pull/6622
>
> В Пн, 24/06/2019 в 15:14 +0300, Nikolay Izhikov пишет:
> > Hello, Alex.
> >
> > Thanks for the answer.
> >
> > 1. I, actually, don't understand your proposal :)
> > Can you write it down?
> > What numbers should be additionally migrated in this PR?
> > Or it's OK for now?
> >
> > > I think "idle time" is a useful metric
> >
> > I think "usefulness" or "uselessness" of specific metrics depends on the
> questions we can answer with it.
> > What questions we can ask about Ignite instance and answer with "idle
> time" metric?
> >
> > > About execution and waiting time , it's not the right way to calculate
> it
> > > using a jobs list.
> >
> > Same question here.
> >
> > What questions we can answer with current numbers?
> >
> > > Will jobs list contain only active jobs?
> >
> > All jobs that are scheduled for execution on the node(active + waiting)
> should be in the list.
> > I try to put more details here, to expose my way of thinking about
> metrics and lists:
> >
> > If you have some issues with the jobs on the node it can be 2 kinds of
> issues:
> >   1. You are waiting for the results of some job and want to know
> why it doesn't execute.
> >
> >   In this case, you should query "jobs list" from Ignite.
> >   You can get an answer on:
> >   * What jobs currently executes?
> >   * How many time your job waiting to be executed?
> >
> >   You can also check "activeJobs", "waitingJobs" metrics
> graphics to know changes in the jobs queue during the time.
> >   Seems, you can predict the start of your job from these
> numbers.
> >
> >   2. You want to understand the lifecycle of some finished(failed
> job).
> >
> >   In this case, you should analyze the log of the node.
> >   It should contain information about time:
> >   * node recieve job information
> >   * job added to the queue
> >   * job started execution
> >   * job finished(failed) execution.
> >
> > I don't see questions we can't ask from these sources.
> > Do we have such?
> > How numbers from current GridJobMetrics can help with these questions?
> >
> >
> > > But, what if a user doesn't use any
> > > external monitoring system and wants to know the health of Ignite
> instance?
> >
> > It depends on how we define "health".
> > And it's not trivial question :)
> >
> > > Do we have any plans to implement some simple aggregator and ship it
> with Ignite?
> >
> > I think NO.
> > We shouldn't do it.
> >
> > > Do we have plans to provide some presets for Ignite monitoring for
> > > popular monitoring systems?
> >
> > I think we shouldn't do it.
> > Because monitoring presets heavily depends on the usage scenario.
> > And it can heavily vary for the Ignite.
> >
> >
> > В Пн, 24/06/2019 в 12:46 +0300, Alex Plehanov пишет:
> > > Hi Nikolay,
> > >
> > > I think "idle time" is a useful metric, but it can be calculated
> outside of
> > > Ignite using external monitoring system.
> > >
> > > About execution and waiting time, it's not the right way to calculate
> it
> > > using a jobs list. Will jobs list contain only active jobs? In this
> case,
> > > you can't calculate these metrics at all, since you don't know the
> time of
> > > finished jobs. If the list will contain all jobs (will it be
> unlimited?),
> > > iterating over this list will be resource consuming. In any way, it's
> much
> > > simpler (and sometimes only possible) for an external monitoring
> system to
> > > just get some scalar metric than iterate over a list with some
> condition.
> > >
> > > About aggregation, yes, in an ideal world aggregation should be done
> with
> > > the external monitoring system. But, what if a user doesn't use any
> > > external monitoring system and wants to know the health of Ignite
> instance?
> > > Do we have any plans to implement some simple aggregator and ship it
> with
> > > Ignite? Do we have plans to provide some presets for Ignite monitoring
> for
> > > popular monitoring systems? (These questions not related to this PR,
> but
> > > related to IEP at all)
> > >
> > > Also, some aggregation metrics ("max" for example) can't be effectively
> > > calculated using the external system (you should iterate over a jobs
> list
> > > again and still precision of such calculation will be no more than the
> time
> > > between probes).
>


Re: [IEP-35] GridJobProcessorMetrics migration

2019-06-24 Thread Nikolay Izhikov
Hello, Alex.

Based on our private discussion I've additionally migrated `totalExecutionTime` 
and `totalWaitingTime` counters.
Can you review the PR [1]?

[1] https://github.com/apache/ignite/pull/6622

В Пн, 24/06/2019 в 15:14 +0300, Nikolay Izhikov пишет:
> Hello, Alex.
> 
> Thanks for the answer.
> 
> 1. I, actually, don't understand your proposal :)
> Can you write it down? 
> What numbers should be additionally migrated in this PR? 
> Or it's OK for now?
> 
> > I think "idle time" is a useful metric
> 
> I think "usefulness" or "uselessness" of specific metrics depends on the 
> questions we can answer with it.
> What questions we can ask about Ignite instance and answer with "idle time" 
> metric?
> 
> > About execution and waiting time , it's not the right way to calculate it
> > using a jobs list. 
> 
> Same question here.
> 
> What questions we can answer with current numbers?
> 
> > Will jobs list contain only active jobs?
> 
> All jobs that are scheduled for execution on the node(active + waiting) 
> should be in the list.
> I try to put more details here, to expose my way of thinking about metrics 
> and lists:
> 
> If you have some issues with the jobs on the node it can be 2 kinds of 
> issues: 
>   1. You are waiting for the results of some job and want to know why it 
> doesn't execute.
> 
>   In this case, you should query "jobs list" from Ignite.
>   You can get an answer on:
>   * What jobs currently executes?
>   * How many time your job waiting to be executed?
> 
>   You can also check "activeJobs", "waitingJobs" metrics graphics 
> to know changes in the jobs queue during the time.
>   Seems, you can predict the start of your job from these 
> numbers.
> 
>   2. You want to understand the lifecycle of some finished(failed job).
> 
>   In this case, you should analyze the log of the node.
>   It should contain information about time:
>   * node recieve job information
>   * job added to the queue
>   * job started execution
>   * job finished(failed) execution.
> 
> I don't see questions we can't ask from these sources.
> Do we have such?
> How numbers from current GridJobMetrics can help with these questions?
> 
> 
> > But, what if a user doesn't use any
> > external monitoring system and wants to know the health of Ignite instance?
> 
> It depends on how we define "health".
> And it's not trivial question :)
> 
> > Do we have any plans to implement some simple aggregator and ship it with 
> > Ignite?
> 
> I think NO.
> We shouldn't do it.
> 
> > Do we have plans to provide some presets for Ignite monitoring for
> > popular monitoring systems?
> 
> I think we shouldn't do it.
> Because monitoring presets heavily depends on the usage scenario.
> And it can heavily vary for the Ignite.
> 
> 
> В Пн, 24/06/2019 в 12:46 +0300, Alex Plehanov пишет:
> > Hi Nikolay,
> > 
> > I think "idle time" is a useful metric, but it can be calculated outside of
> > Ignite using external monitoring system.
> > 
> > About execution and waiting time, it's not the right way to calculate it
> > using a jobs list. Will jobs list contain only active jobs? In this case,
> > you can't calculate these metrics at all, since you don't know the time of
> > finished jobs. If the list will contain all jobs (will it be unlimited?),
> > iterating over this list will be resource consuming. In any way, it's much
> > simpler (and sometimes only possible) for an external monitoring system to
> > just get some scalar metric than iterate over a list with some condition.
> > 
> > About aggregation, yes, in an ideal world aggregation should be done with
> > the external monitoring system. But, what if a user doesn't use any
> > external monitoring system and wants to know the health of Ignite instance?
> > Do we have any plans to implement some simple aggregator and ship it with
> > Ignite? Do we have plans to provide some presets for Ignite monitoring for
> > popular monitoring systems? (These questions not related to this PR, but
> > related to IEP at all)
> > 
> > Also, some aggregation metrics ("max" for example) can't be effectively
> > calculated using the external system (you should iterate over a jobs list
> > again and still precision of such calculation will be no more than the time
> > between probes).


signature.asc
Description: This is a digitally signed message part


Re: [IEP-35] GridJobProcessorMetrics migration

2019-06-24 Thread Nikolay Izhikov

Hello, Ivan.

> Ignite is a cluster which almost every
> time assumes an external monitoring for a production use.

+1.

> 1. Are we going to preserve a compatibility with metrics present
> before? Or are we going to keep only those making sense today?

1. Backward compatibility preserved.
2. Deprecated metrics(and metric APIs) will be removed in Ignite 3.
3. We should make a decision what numbers are "make sense" and what don't.

> 2. Can we configure which supported metrics are calculated/exposed? Or
> do we calculate/expose everything every time?

1. You can configure filter for the exposed metrics. Only required subset of 
the metric will be exported.
2. For now, all metrics(not lists!) will be calculated. Please, note, that 
every metrics is the simple long(double) counter.

В Пн, 24/06/2019 в 14:43 +0300, Павлухин Иван пишет:
> Hi Nikolay, Alex,
> 
> A couple of my humble comments
> > Aggregation should be done with the metric collect system(Prometheus, 
> > Graphite, etc.).
> 
> I like that statement very much!
> 
> > But, what if a user doesn't use any external monitoring system and wants to 
> > know the health of Ignite instance?
> 
> I think that we can add more capabilities if a real user demand
> appears in future. Generally, Ignite is a cluster which almost every
> time assumes an external monitoring for a production use.
> 
> And a couple of general questions regarding monitoring. If they are
> answered in IEP you can simply redirect me there.
> 1. Are we going to preserve a compatibility with metrics present
> before? Or are we going to keep only those making sense today?
> 2. Can we configure which supported metrics are calculated/exposed? Or
> do we calculate/expose everything every time?
> 
> пн, 24 июн. 2019 г. в 12:46, Alex Plehanov :
> > 
> > Hi Nikolay,
> > 
> > I think "idle time" is a useful metric, but it can be calculated outside of
> > Ignite using external monitoring system.
> > 
> > About execution and waiting time, it's not the right way to calculate it
> > using a jobs list. Will jobs list contain only active jobs? In this case,
> > you can't calculate these metrics at all, since you don't know the time of
> > finished jobs. If the list will contain all jobs (will it be unlimited?),
> > iterating over this list will be resource consuming. In any way, it's much
> > simpler (and sometimes only possible) for an external monitoring system to
> > just get some scalar metric than iterate over a list with some condition.
> > 
> > About aggregation, yes, in an ideal world aggregation should be done with
> > the external monitoring system. But, what if a user doesn't use any
> > external monitoring system and wants to know the health of Ignite instance?
> > Do we have any plans to implement some simple aggregator and ship it with
> > Ignite? Do we have plans to provide some presets for Ignite monitoring for
> > popular monitoring systems? (These questions not related to this PR, but
> > related to IEP at all)
> > 
> > Also, some aggregation metrics ("max" for example) can't be effectively
> > calculated using the external system (you should iterate over a jobs list
> > again and still precision of such calculation will be no more than the time
> > between probes).
> 
> 
> 


signature.asc
Description: This is a digitally signed message part


Re: [IEP-35] GridJobProcessorMetrics migration

2019-06-24 Thread Nikolay Izhikov
Hello, Alex.

Thanks for the answer.

1. I, actually, don't understand your proposal :)
Can you write it down? 
What numbers should be additionally migrated in this PR? 
Or it's OK for now?

> I think "idle time" is a useful metric

I think "usefulness" or "uselessness" of specific metrics depends on the 
questions we can answer with it.
What questions we can ask about Ignite instance and answer with "idle time" 
metric?

> About execution and waiting time , it's not the right way to calculate it
> using a jobs list. 

Same question here.

What questions we can answer with current numbers?

> Will jobs list contain only active jobs?

All jobs that are scheduled for execution on the node(active + waiting) should 
be in the list.
I try to put more details here, to expose my way of thinking about metrics and 
lists:

If you have some issues with the jobs on the node it can be 2 kinds of issues: 
1. You are waiting for the results of some job and want to know why it 
doesn't execute.

In this case, you should query "jobs list" from Ignite.
You can get an answer on:
* What jobs currently executes?
* How many time your job waiting to be executed?

You can also check "activeJobs", "waitingJobs" metrics graphics 
to know changes in the jobs queue during the time.
Seems, you can predict the start of your job from these 
numbers.

2. You want to understand the lifecycle of some finished(failed job).

In this case, you should analyze the log of the node.
It should contain information about time:
* node recieve job information
* job added to the queue
* job started execution
* job finished(failed) execution.

I don't see questions we can't ask from these sources.
Do we have such?
How numbers from current GridJobMetrics can help with these questions?


> But, what if a user doesn't use any
> external monitoring system and wants to know the health of Ignite instance?

It depends on how we define "health".
And it's not trivial question :)

> Do we have any plans to implement some simple aggregator and ship it with 
> Ignite?

I think NO.
We shouldn't do it.

> Do we have plans to provide some presets for Ignite monitoring for
> popular monitoring systems?

I think we shouldn't do it.
Because monitoring presets heavily depends on the usage scenario.
And it can heavily vary for the Ignite.


В Пн, 24/06/2019 в 12:46 +0300, Alex Plehanov пишет:
> Hi Nikolay,
> 
> I think "idle time" is a useful metric, but it can be calculated outside of
> Ignite using external monitoring system.
> 
> About execution and waiting time, it's not the right way to calculate it
> using a jobs list. Will jobs list contain only active jobs? In this case,
> you can't calculate these metrics at all, since you don't know the time of
> finished jobs. If the list will contain all jobs (will it be unlimited?),
> iterating over this list will be resource consuming. In any way, it's much
> simpler (and sometimes only possible) for an external monitoring system to
> just get some scalar metric than iterate over a list with some condition.
> 
> About aggregation, yes, in an ideal world aggregation should be done with
> the external monitoring system. But, what if a user doesn't use any
> external monitoring system and wants to know the health of Ignite instance?
> Do we have any plans to implement some simple aggregator and ship it with
> Ignite? Do we have plans to provide some presets for Ignite monitoring for
> popular monitoring systems? (These questions not related to this PR, but
> related to IEP at all)
> 
> Also, some aggregation metrics ("max" for example) can't be effectively
> calculated using the external system (you should iterate over a jobs list
> again and still precision of such calculation will be no more than the time
> between probes).


signature.asc
Description: This is a digitally signed message part


Re: [IEP-35] GridJobProcessorMetrics migration

2019-06-24 Thread Павлухин Иван
Hi Nikolay, Alex,

A couple of my humble comments
> Aggregation should be done with the metric collect system(Prometheus, 
> Graphite, etc.).
I like that statement very much!

> But, what if a user doesn't use any external monitoring system and wants to 
> know the health of Ignite instance?
I think that we can add more capabilities if a real user demand
appears in future. Generally, Ignite is a cluster which almost every
time assumes an external monitoring for a production use.

And a couple of general questions regarding monitoring. If they are
answered in IEP you can simply redirect me there.
1. Are we going to preserve a compatibility with metrics present
before? Or are we going to keep only those making sense today?
2. Can we configure which supported metrics are calculated/exposed? Or
do we calculate/expose everything every time?

пн, 24 июн. 2019 г. в 12:46, Alex Plehanov :
>
> Hi Nikolay,
>
> I think "idle time" is a useful metric, but it can be calculated outside of
> Ignite using external monitoring system.
>
> About execution and waiting time, it's not the right way to calculate it
> using a jobs list. Will jobs list contain only active jobs? In this case,
> you can't calculate these metrics at all, since you don't know the time of
> finished jobs. If the list will contain all jobs (will it be unlimited?),
> iterating over this list will be resource consuming. In any way, it's much
> simpler (and sometimes only possible) for an external monitoring system to
> just get some scalar metric than iterate over a list with some condition.
>
> About aggregation, yes, in an ideal world aggregation should be done with
> the external monitoring system. But, what if a user doesn't use any
> external monitoring system and wants to know the health of Ignite instance?
> Do we have any plans to implement some simple aggregator and ship it with
> Ignite? Do we have plans to provide some presets for Ignite monitoring for
> popular monitoring systems? (These questions not related to this PR, but
> related to IEP at all)
>
> Also, some aggregation metrics ("max" for example) can't be effectively
> calculated using the external system (you should iterate over a jobs list
> again and still precision of such calculation will be no more than the time
> between probes).



-- 
Best regards,
Ivan Pavlukhin


Re: [IEP-35] GridJobProcessorMetrics migration

2019-06-24 Thread Alex Plehanov
Hi Nikolay,

I think "idle time" is a useful metric, but it can be calculated outside of
Ignite using external monitoring system.

About execution and waiting time, it's not the right way to calculate it
using a jobs list. Will jobs list contain only active jobs? In this case,
you can't calculate these metrics at all, since you don't know the time of
finished jobs. If the list will contain all jobs (will it be unlimited?),
iterating over this list will be resource consuming. In any way, it's much
simpler (and sometimes only possible) for an external monitoring system to
just get some scalar metric than iterate over a list with some condition.

About aggregation, yes, in an ideal world aggregation should be done with
the external monitoring system. But, what if a user doesn't use any
external monitoring system and wants to know the health of Ignite instance?
Do we have any plans to implement some simple aggregator and ship it with
Ignite? Do we have plans to provide some presets for Ignite monitoring for
popular monitoring systems? (These questions not related to this PR, but
related to IEP at all)

Also, some aggregation metrics ("max" for example) can't be effectively
calculated using the external system (you should iterate over a jobs list
again and still precision of such calculation will be no more than the time
between probes).