andreahlert commented on PR #61627: URL: https://github.com/apache/airflow/pull/61627#issuecomment-3910446088
> I would really like @ashb @kaxil and @amoghrajesh to review this and figure out what is the desired signal handling in this case. > > It might have quite profound impact on how our signal handling is implemented in Airlfow. > > So far, we assumed that when we run any component except `celery worker` we relied on signal propagation to all processes in process group via `dumb-init` entrypoint process - and all our images had entrypoint (thus iniit process) set as `dumb-init`. > > Our images have `dumb-init` set at entry point and default behaviour is that signals received by dumb-init are forwarded to all the processes in the main proces group. This means that if the proces does not create a new process group when it forks it's children, all such created processes will receive the signal. > > This was a deliberate decision, because a lot of potential tasks and parsing process libraries might not properly propagate signals - typical example is when we run bash tasks, bash will silently swallow signals and will not propagate it to it's children processes unless specific handlers are configured in the bash script. A lot of people does not know that it happens and they are starting bash subprocesses, not knowing that signals will not be propagated - and since airflow is pretty versatile and you can generally write any code, we decided to send signals to all the precesses in the group (which is defailt dumb-init behaviour). > > The only exception is celery worker, because celery worker has a deliberate handling of the signals - the first signal they receive puts the worker in offline mode, where it does not accept any new tasks - and waits for all tasks to complete. This is then used by KEDA autoscaler for example to automatically scale celery workers down. Celery handles propagating of signals to it's subprocesses, including signal escalation in case the signal is not propagated - with timeuts, so even if the bash subprocess swallows the signal, celery will escalate eventually to SIGKILL and kill such tasks (but if tasks are handling signals, it will nicely and cleanly stop the before escalation). > > So instead of sending signals to process group, celery is configured to only receive signals by the master worker and it will take care of the propagation - DUMB_INIT_SETSID should be set to 0 for celery pods. And this is how it is implemented in our helm chart as well, and documentation of Airflow Docker image describes this behaviour: > > https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation > > I am not sure what was the thought about supervisor's behaviour when it was implemented, whether it handles propagation of signals to child processes or whether it relies on process group receiving the signals, but I guess the latter. It would however be great to document the behaviour and expectations for signal handling in this case - because > > I see two assumptions here that are wrong: > > 1. The supervisor is assumed to be init process - which I think is pretty wrong. Supervisor in python should never be an init process - unless it implements all that the init process should do. Init process should do more than signal propagation, it should also reap orphan processes by other subprocesses (so called zombies), it should wait for all of the processes it sent signals to before exiting - otherwise it wil not cleanly by the kernel - which might mean filesystem corruption for example. This is why all docker containers (that includes airflow reference image) should have a "real" init process as entrypoint - in our case this is `dumb-init` and it's absolutely not recommended by any production image to have `init` process that is not a `real` init process (some people set bash interpreter as entrypoint and it's absolutely wrong). > 2. the assumption is that the init process does not forward the signal to the whole process group of forked command (which happens in Airflow reference image) . > > I think before we continue with this PR we should at the very least make some assumptions here set and agreed. Hey @potiuk, thanks for adding here. Thanks, this makes sense and I agree we should align on assumptions first. My intent with this PR is only to guarantee `on_kill()` is reached when supervisor receives SIGTERM in Kubernetes worker scenarios. I did not intend to make supervisor an init replacement. I propose we explicitly document the expected model for this path: 1) `dumb-init` remains responsible for process-group signal behavior 2) supervisor behavior is documented for task subprocess signal forwarding expectations 3) we confirm whether this PR should rely only on process-group propagation, or keep explicit forwarding as a safety mechanism If you all agree, I can update this PR with docs/tests aligned to the chosen model. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
