[ 
https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064286#comment-16064286
 ] 

Reza Motamedi commented on AURORA-1939:
---------------------------------------

On second thought, the negative CPU values can simply be caused by a dead child 
process. Let me explain how. First, remember that CPU time reported by psutil, 
is the total CPU time spent to progress a process.

Supposes at {{t_0 = 10}}, we have the following processes forked inside a 
thermos process.

{noformat}
__ p0
   \_ p1
{noformat}

The total CPU time of the thermos process is calculated at the CPU time in all 
the processes, i.e., {{Process(p_0).cpu_time + Process(p_1).cpu_time}}, For the 
sake of argument, let's say 1 second in {{p_0}} and 5 seconds in {{p_1}}.
Now imagine that by the time to collect the next sample at {{t_1 = 20}}, 
another 5 seconds where spend in p_0, and p_0 finishes (dies) before the 
collection. Also, only an extra 1 second was spent by {{p_0}}. The current 
calculation leads to the following reported CPU values.

(sum(new_samples) - sum(old_samples)) / (time difference).
(2) - (1 + 5) / 5 = -3/10.

A perfect calculation would include the time spend in the dead processes at the 
time of their death in the new sample. What makes sense is to discard the old 
processes that have died during the last time interval.




> Thermos landing (host) page reports incorrect CPU rates when it is busy
> -----------------------------------------------------------------------
>
>                 Key: AURORA-1939
>                 URL: https://issues.apache.org/jira/browse/AURORA-1939
>             Project: Aurora
>          Issue Type: Bug
>            Reporter: Reza Motamedi
>            Priority: Minor
>
> Thermos Observer uses `psutil` to monitor resource consumption of Thermos 
> Processes. On a busy machine, I have noticed negative CPU values when 
> visiting the Thermos landing page.
> In my test I reproduced this by starting many processes that constantly 
> create short lived children. This indicates that in time between 
> `process_collector_psutil` looks up the Process children and the time it 
> calculates the CPU time the pid of the child is actually reused by another 
> much younger process, which leads to negative CPU times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to