Re: [IEP-35] GridJobProcessorMetrics migration

Nikolay Izhikov Mon, 24 Jun 2019 05:12:05 -0700

Hello, Alex.

Thanks for the answer.


1. I, actually, don't understand your proposal :)
Can you write it down? 
What numbers should be additionally migrated in this PR? 
Or it's OK for now?

> I think "idle time" is a useful metric

I think "usefulness" or "uselessness" of specific metrics depends on the 
questions we can answer with it.
What questions we can ask about Ignite instance and answer with "idle time" 
metric?

> About execution and waiting time , it's not the right way to calculate it
> using a jobs list. 

Same question here.

What questions we can answer with current numbers?

> Will jobs list contain only active jobs?

All jobs that are scheduled for execution on the node(active + waiting) should 
be in the list.
I try to put more details here, to expose my way of thinking about metrics and 
lists:

If you have some issues with the jobs on the node it can be 2 kinds of issues: 
        1. You are waiting for the results of some job and want to know why it 
doesn't execute.

                In this case, you should query "jobs list" from Ignite.
                You can get an answer on:
                        * What jobs currently executes?
                        * How many time your job waiting to be executed?

                You can also check "activeJobs", "waitingJobs" metrics graphics 
to know changes in the jobs queue during the time.
                Seems, you can predict the start of your job from these 
numbers.                

        2. You want to understand the lifecycle of some finished(failed job).

                In this case, you should analyze the log of the node.
                It should contain information about time:
                        * node recieve job information
                        * job added to the queue
                        * job started execution
                        * job finished(failed) execution.

I don't see questions we can't ask from these sources.
Do we have such?
How numbers from current GridJobMetrics can help with these questions?


> But, what if a user doesn't use any
> external monitoring system and wants to know the health of Ignite instance?

It depends on how we define "health".
And it's not trivial question :)

> Do we have any plans to implement some simple aggregator and ship it with 
> Ignite?

I think NO.
We shouldn't do it.

> Do we have plans to provide some presets for Ignite monitoring for
> popular monitoring systems?

I think we shouldn't do it.
Because monitoring presets heavily depends on the usage scenario.
And it can heavily vary for the Ignite.


В Пн, 24/06/2019 в 12:46 +0300, Alex Plehanov пишет:
> Hi Nikolay,
> 
> I think "idle time" is a useful metric, but it can be calculated outside of
> Ignite using external monitoring system.
> 
> About execution and waiting time, it's not the right way to calculate it
> using a jobs list. Will jobs list contain only active jobs? In this case,
> you can't calculate these metrics at all, since you don't know the time of
> finished jobs. If the list will contain all jobs (will it be unlimited?),
> iterating over this list will be resource consuming. In any way, it's much
> simpler (and sometimes only possible) for an external monitoring system to
> just get some scalar metric than iterate over a list with some condition.
> 
> About aggregation, yes, in an ideal world aggregation should be done with
> the external monitoring system. But, what if a user doesn't use any
> external monitoring system and wants to know the health of Ignite instance?
> Do we have any plans to implement some simple aggregator and ship it with
> Ignite? Do we have plans to provide some presets for Ignite monitoring for
> popular monitoring systems? (These questions not related to this PR, but
> related to IEP at all)
> 
> Also, some aggregation metrics ("max" for example) can't be effectively
> calculated using the external system (you should iterate over a jobs list
> again and still precision of such calculation will be no more than the time
> between probes).

signature.asc
Description: This is a digitally signed message part

Re: [IEP-35] GridJobProcessorMetrics migration

Reply via email to