[ https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093763#comment-16093763 ]
Reza Motamedi edited comment on AURORA-1939 at 7/19/17 8:56 PM: ---------------------------------------------------------------- I added more debug logs to trace what is kept in the process tree: {noformat} D0719 20:23:05.758181 24474 process_collector_psutil.py:105] Calculated rate for pid=62328 and children: 0.000365496747142 (old: 885.570000, new: 885.670000) {62328: 1500007014.22, 866: 1500007038.32} -> {62328: 1500007014.22, 866: 1500007038.32} [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=315.39, system=570.16, rss=36704256, vms=1601019904, nice=0, status='sleeping', threads=20)}] [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=314.96, system=570.69, rss=36134912, vms=1601019904, nice=0, status='sleeping', threads=20)}] D0719 20:27:44.454570 24474 process_collector_psutil.py:105] Calculated rate for pid=62328 and children: 2.10562464635 (old: 885.670000, new: 1472.500000) {62328: 1500007014.22, 866: 1500007038.32} -> {62328: 1500007014.22, 866: 1500007038.32} [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=314.96, system=570.69, rss=36134912, vms=1601019904, nice=0, status='sleeping', threads=20)}] [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=520.93, system=951.55, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] D0719 20:32:29.451004 24474 process_collector_psutil.py:105] Calculated rate for pid=62328 and children: 0.00294740555899 (old: 1472.500000, new: 1473.340000) {62328: 1500007014.22, 866: 1500007038.32} -> {62328: 1500007014.22, 866: 1500007038.32} [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=520.93, system=951.55, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=521.23, system=952.09, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] D0719 20:37:10.152525 24474 process_collector_psutil.py:105] Calculated rate for pid=62328 and children: 0.00299250250927 (old: 1473.340000, new: 1474.180000) {62328: 1500007014.22, 866: 1500007038.32} -> {62328: 1500007014.22, 866: 1500007038.32} [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=521.23, system=952.09, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=521.55, system=952.61, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] D0719 20:42:58.449567 24474 process_collector_psutil.py:105] Calculated rate for pid=62328 and children: -1.64913818086 (old: 1474.180000, new: 899.790000) {62328: 1500007014.22, 866: 1500007038.32} -> {62328: 1500007014.22, 866: 1500007038.32} [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=521.55, system=952.61, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2732032, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=322.65, system=577.12, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] {noformat} Note: {noformat} {62328: 1500007014.22, 866: 1500007038.32} -> {62328: 1500007014.22, 866: 1500007038.32} {noformat} means what processes where in the old and new sample, as in old -> new. The map also includes the pid creation time. Therefor, I can role out the problem with pid reuse. The line after that is the ProcessSamples in the old pstree, kept in the object as "pid => ProcessSample", and the line after is the new sample. was (Author: rezam): I added more debug logs to trace what is kept in the process tree: {noformat} D0719 20:23:05.758181 24474 process_collector_psutil.py:105] Calculated rate for pid=62328 and children: 0.000365496747142 (old: 885.570000, new: 885.670000) {62328: 1500007014.22, 866: 1500007038.32} -> {62328: 1500007014.22, 866: 1500007038.32} [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=315.39, system=570.16, rss=36704256, vms=1601019904, nice=0, status='sleeping', threads=20)}] [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=314.96, system=570.69, rss=36134912, vms=1601019904, nice=0, status='sleeping', threads=20)}] D0719 20:27:44.454570 24474 process_collector_psutil.py:105] Calculated rate for pid=62328 and children: 2.10562464635 (old: 885.670000, new: 1472.500000) {62328: 1500007014.22, 866: 1500007038.32} -> {62328: 1500007014.22, 866: 1500007038.32} [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=314.96, system=570.69, rss=36134912, vms=1601019904, nice=0, status='sleeping', threads=20)}] [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=520.93, system=951.55, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] D0719 20:32:29.451004 24474 process_collector_psutil.py:105] Calculated rate for pid=62328 and children: 0.00294740555899 (old: 1472.500000, new: 1473.340000) {62328: 1500007014.22, 866: 1500007038.32} -> {62328: 1500007014.22, 866: 1500007038.32} [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=520.93, system=951.55, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=521.23, system=952.09, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] D0719 20:37:10.152525 24474 process_collector_psutil.py:105] Calculated rate for pid=62328 and children: 0.00299250250927 (old: 1473.340000, new: 1474.180000) {62328: 1500007014.22, 866: 1500007038.32} -> {62328: 1500007014.22, 866: 1500007038.32} [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=521.23, system=952.09, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=521.55, system=952.61, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] D0719 20:42:58.449567 24474 process_collector_psutil.py:105] Calculated rate for pid=62328 and children: -1.64913818086 (old: 1474.180000, new: 899.790000) {62328: 1500007014.22, 866: 1500007038.32} -> {62328: 1500007014.22, 866: 1500007038.32} [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2609152, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=521.55, system=952.61, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] [{62328: ProcessSample(rate=0.0, user=0.0, system=0.02, rss=2732032, vms=11919360, nice=0, status='sleeping', threads=1), 866: ProcessSample(rate=0.0, user=322.65, system=577.12, rss=35430400, vms=1600204800, nice=0, status='sleeping', threads=20)}] {noformat} Note: {62328: 1500007014.22, 866: 1500007038.32} -> {62328: 1500007014.22, 866: 1500007038.32} what processes where in the old and new sample, as in old -> new. The line after that is the ProcessSamples in the old pstree, kept in the object as "pid => ProcessSample", and the line after is the new sample. > Thermos landing (host) page reports incorrect CPU rates when it is busy > ----------------------------------------------------------------------- > > Key: AURORA-1939 > URL: https://issues.apache.org/jira/browse/AURORA-1939 > Project: Aurora > Issue Type: Bug > Reporter: Reza Motamedi > Priority: Minor > > Thermos Observer uses `psutil` to monitor resource consumption of Thermos > Processes. On a busy machine, I have noticed negative CPU values when > visiting the Thermos landing page. > In my test I reproduced this by starting many processes that constantly > create short lived children. This indicates that in time between > `process_collector_psutil` looks up the Process children and the time it > calculates the CPU time the pid of the child is actually reused by another > much younger process, which leads to negative CPU times. -- This message was sent by Atlassian JIRA (v6.4.14#64029)