[ https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064306#comment-16064306 ]
Reza Motamedi commented on AURORA-1939: --------------------------------------- The negative CPU values can be caused by a dead child process whose pid is reused as a descendant of the same parent. Let me explain how. First, remember that CPU time reported by psutil, is the total CPU time spent to progress a process. Supposes at {{t_0 = 10}}, we have the following process tree forked inside a thermos process. {noformat} __ p0 \_ p1 {noformat} The total CPU time of the thermos process is calculated at the CPU time in all the processes, i.e., {{Process(p_0).cpu_time + Process(p_1).cpu_time}}, For the sake of argument, let's say 1 second in {{p_0}} and 5 seconds in {{p_1}}. Now imagine that by the time to collect the next sample at {{t_1 = 20}}, p_1 dies. However, {{p_0}} forks a new child and quite luckily the same **pid** is assigned to the child. Let's call this new child {{p_1'}}. Assume an additional 1 sec is spent by {{p_0}} and 1 second is spent in {{p_1'}}. The current calculation leads to the following reported CPU values. (new_sample - old_sample) / (time difference). (2 + 1) - (1 + 5) / 10 = -3/10. While current calculation discards the old processes that have died during the last time interval, it does the lookup only by pid. If we correctly identify the difference between {{p_1}} and {{p_1'}}, the math leads to: (new_sample - old_sample) / (time difference). (2 + 1) - (1) / 10 = 2/10. --- I'm going to confirm the reason behind the negative CPU times is reused pids. > Thermos landing (host) page reports incorrect CPU rates when it is busy > ----------------------------------------------------------------------- > > Key: AURORA-1939 > URL: https://issues.apache.org/jira/browse/AURORA-1939 > Project: Aurora > Issue Type: Bug > Reporter: Reza Motamedi > Priority: Minor > > Thermos Observer uses `psutil` to monitor resource consumption of Thermos > Processes. On a busy machine, I have noticed negative CPU values when > visiting the Thermos landing page. > In my test I reproduced this by starting many processes that constantly > create short lived children. This indicates that in time between > `process_collector_psutil` looks up the Process children and the time it > calculates the CPU time the pid of the child is actually reused by another > much younger process, which leads to negative CPU times. -- This message was sent by Atlassian JIRA (v6.4.14#64029)