Hello, Alex. Based on our private discussion I've additionally migrated `totalExecutionTime` and `totalWaitingTime` counters. Can you review the PR [1]?
[1] https://github.com/apache/ignite/pull/6622 В Пн, 24/06/2019 в 15:14 +0300, Nikolay Izhikov пишет: > Hello, Alex. > > Thanks for the answer. > > 1. I, actually, don't understand your proposal :) > Can you write it down? > What numbers should be additionally migrated in this PR? > Or it's OK for now? > > > I think "idle time" is a useful metric > > I think "usefulness" or "uselessness" of specific metrics depends on the > questions we can answer with it. > What questions we can ask about Ignite instance and answer with "idle time" > metric? > > > About execution and waiting time , it's not the right way to calculate it > > using a jobs list. > > Same question here. > > What questions we can answer with current numbers? > > > Will jobs list contain only active jobs? > > All jobs that are scheduled for execution on the node(active + waiting) > should be in the list. > I try to put more details here, to expose my way of thinking about metrics > and lists: > > If you have some issues with the jobs on the node it can be 2 kinds of > issues: > 1. You are waiting for the results of some job and want to know why it > doesn't execute. > > In this case, you should query "jobs list" from Ignite. > You can get an answer on: > * What jobs currently executes? > * How many time your job waiting to be executed? > > You can also check "activeJobs", "waitingJobs" metrics graphics > to know changes in the jobs queue during the time. > Seems, you can predict the start of your job from these > numbers. > > 2. You want to understand the lifecycle of some finished(failed job). > > In this case, you should analyze the log of the node. > It should contain information about time: > * node recieve job information > * job added to the queue > * job started execution > * job finished(failed) execution. > > I don't see questions we can't ask from these sources. > Do we have such? > How numbers from current GridJobMetrics can help with these questions? > > > > But, what if a user doesn't use any > > external monitoring system and wants to know the health of Ignite instance? > > It depends on how we define "health". > And it's not trivial question :) > > > Do we have any plans to implement some simple aggregator and ship it with > > Ignite? > > I think NO. > We shouldn't do it. > > > Do we have plans to provide some presets for Ignite monitoring for > > popular monitoring systems? > > I think we shouldn't do it. > Because monitoring presets heavily depends on the usage scenario. > And it can heavily vary for the Ignite. > > > В Пн, 24/06/2019 в 12:46 +0300, Alex Plehanov пишет: > > Hi Nikolay, > > > > I think "idle time" is a useful metric, but it can be calculated outside of > > Ignite using external monitoring system. > > > > About execution and waiting time, it's not the right way to calculate it > > using a jobs list. Will jobs list contain only active jobs? In this case, > > you can't calculate these metrics at all, since you don't know the time of > > finished jobs. If the list will contain all jobs (will it be unlimited?), > > iterating over this list will be resource consuming. In any way, it's much > > simpler (and sometimes only possible) for an external monitoring system to > > just get some scalar metric than iterate over a list with some condition. > > > > About aggregation, yes, in an ideal world aggregation should be done with > > the external monitoring system. But, what if a user doesn't use any > > external monitoring system and wants to know the health of Ignite instance? > > Do we have any plans to implement some simple aggregator and ship it with > > Ignite? Do we have plans to provide some presets for Ignite monitoring for > > popular monitoring systems? (These questions not related to this PR, but > > related to IEP at all) > > > > Also, some aggregation metrics ("max" for example) can't be effectively > > calculated using the external system (you should iterate over a jobs list > > again and still precision of such calculation will be no more than the time > > between probes).
signature.asc
Description: This is a digitally signed message part