[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy

Reza Motamedi (JIRA) Mon, 26 Jun 2017 22:43:59 -0700

    [ 
https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064306#comment-16064306
 ]


Reza Motamedi commented on AURORA-1939:
---------------------------------------

The negative CPU values can be caused by a dead child process whose pid is 
reused as a descendant of the same parent. Let me explain how. First, remember 
that CPU time reported by psutil, is the total CPU time spent to progress a 
process.

Supposes at {{t_0 = 10}}, we have the following process tree forked inside a 
thermos process.

{noformat}
__ p0
   \_ p1
{noformat}

The total CPU time of the thermos process is calculated at the CPU time in all 
the processes, i.e., {{Process(p_0).cpu_time + Process(p_1).cpu_time}}, For the 
sake of argument, let's say 1 second in {{p_0}} and 5 seconds in {{p_1}}.
Now imagine that by the time to collect the next sample at {{t_1 = 20}}, p_1 
dies. However, {{p_0}} forks a new child and quite luckily the same **pid** is 
assigned to the child. Let's call this new child {{p_1'}}. Assume an additional 
1 sec is spent by {{p_0}} and 1 second is spent in {{p_1'}}. The current 
calculation leads to the following reported CPU values.

(new_sample - old_sample) / (time difference).
(2 + 1) - (1 + 5) / 10 = -3/10.

While current calculation discards the old processes that have died during the 
last time interval, it does the lookup only by pid. If we correctly identify 
the difference between {{p_1}} and {{p_1'}}, the math leads to:

(new_sample - old_sample) / (time difference).
(2 + 1) - (1) / 10 = 2/10.

---
I'm going to confirm the reason behind the negative CPU times is reused pids.




> Thermos landing (host) page reports incorrect CPU rates when it is busy
> -----------------------------------------------------------------------
>
>                 Key: AURORA-1939
>                 URL: https://issues.apache.org/jira/browse/AURORA-1939
>             Project: Aurora
>          Issue Type: Bug
>            Reporter: Reza Motamedi
>            Priority: Minor
>
> Thermos Observer uses `psutil` to monitor resource consumption of Thermos 
> Processes. On a busy machine, I have noticed negative CPU values when 
> visiting the Thermos landing page.
> In my test I reproduced this by starting many processes that constantly 
> create short lived children. This indicates that in time between 
> `process_collector_psutil` looks up the Process children and the time it 
> calculates the CPU time the pid of the child is actually reused by another 
> much younger process, which leads to negative CPU times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy

Reply via email to