If we are in an API server that uses AnyEvent, we currently postpone the
task cleanup because the `active_workers` call may lead to a problem
described in [1]. This, however, is only relevant to the
`log_task_result` part of the `worker_reaper` sub, since that calls
`active_workers`.

The problem with postponing `waitpid` is that our task status endpoint
checks the proc file for the PID, and without the `waitpid` call, the
process sticks around as a zombie (`Z`). This leads to the status being
reported as 'running' even though it is not.

If the API is called rapidly, this postponement is pushed back
potentially indefinitely. This is generally problematic, but
specifically for a use-case where a task is started and the `/status`
endpoint is polled to check for task completion. In that case, if the
polling is fast enough, it prevents the cleanup and therefore the
correct status reporting.

I could reproduce this pretty consistently with:

```perl
my $upid = self->client()->put('/some/task/starting/endpoint', {});
do {
    $res = $self->client()->get("/nodes/localhost/tasks/$upid/status");
    # sleep 1
} while ($res->{status} eq "running");
```

Adding a (long enough) sleep prevented the problem. With `0.5`, the task
was correctly reported as done about 30-60s after it actually finished;
with `0.1`, it took a couple of minutes. Without a `sleep`, tasks would
never (>5-10 mins) be reported as done.

[1] 6870afa4 ("RESTEnvironment: better SIGCHLD handling in AnyEvent event loop")

Signed-off-by: Hannes Laimer <[email protected]>
---
 src/PVE/RESTEnvironment.pm | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/src/PVE/RESTEnvironment.pm b/src/PVE/RESTEnvironment.pm
index 81d7e29..4ed5c05 100644
--- a/src/PVE/RESTEnvironment.pm
+++ b/src/PVE/RESTEnvironment.pm
@@ -78,7 +78,16 @@ my $worker_reaper = sub {
         if (defined($waitpid) && ($waitpid == $pid)) {
             my $info = $WORKER_PIDS->{$pid};
             if ($info && $info->{upid} && $info->{user}) {
-                &$log_task_result($info->{upid}, $info->{user}, $?);
+                # when we're using AnyEvent, we have to postpone the call to 
log_task_result, otherwise it
+                # might interfere with running api calls
+                my $exit_code = $?;
+                if (defined($AnyEvent::MODEL)) {
+                    AnyEvent::postpone {
+                        $log_task_result->($info->{upid}, $info->{user}, 
$exit_code)
+                    };
+                } else {
+                    $log_task_result->($info->{upid}, $info->{user}, 
$exit_code);
+                }
             }
             delete($WORKER_PIDS->{$pid});
         }
@@ -115,13 +124,7 @@ sub init {
         if !$type || $type !~ m/^(cli|pub|priv|ha)$/;
 
     $SIG{CHLD} = sub {
-        # when we're using AnyEvent, we have to postpone the call to 
worker_reaper, otherwise it
-        # might interfere with running api calls
-        if (defined($AnyEvent::MODEL)) {
-            AnyEvent::postpone { $worker_reaper->() };
-        } else {
-            $worker_reaper->();
-        }
+        $worker_reaper->();
     };
 
     # environment types
-- 
2.47.3




Reply via email to