I can confirm that wrapping airflow with tini/dumb-init works pretty well. We've been relying on it at Astronomer for the past year. We run exclusively on k8s and are restarting airflow pods very frequently. Here's our production 1.10.2 image using tini on alpine, for example https://github.com/astronomer/astronomer/blob/master/docker/airflow/1.10.2/Dockerfile#L102 .
On Sun, Mar 10, 2019 at 3:49 PM Jarek Potiuk <jarek.pot...@polidea.com> wrote: > NOTE! I changed the subject to not pollute the AIP-12 thread > > Fokko, > > I think I know why TERM signals do not work in the current POD > implementation. I already experienced that several times - dockerized app > not receiving TERM signal. The reason was always the same. It is not a bug > actually - it is expected behaviour in case your ENTRYPOINT is in the SHELL > form ("/binary arg1 arg2") or when you use shell script as ENTRYPOINT first > argument in [ "shell script", "arg" ] form. > > In those cases "shell" process becomes the "0" init process. Unfortunately > shell process is not prepared to do all the stuff that proper init process > should be doing: > > * Inherit orphaned child processes > * reap them > * Handle and propagate signals properly > * Wait until all subprocesses are terminated before terminating itself > > What happens is that the TERM signal just kills the init "shell" process > but, then the signal does not reach any of its children and the container > continues to run. It's well known problem in docker world and there are a > number of solutions (including exec-ing from shell or - better - using a > dumb-init/tini in your ENTRYPOINT- very tiny "proper" init implementations > that do what they should do. > > You can read more for example here: > https://hynek.me/articles/docker-signals > . > > J. > > On Sun, Mar 10, 2019 at 7:29 PM Driesprong, Fokko <fo...@driesprong.frl> > wrote: > > Ps. Jarek, interesting idea. It shouldn't be too hard to make Airflow more > > k8s native. You could package your dags within your container, and do a > > rolling update. Add the DAGs as the last layer, and then point the DAGs > > folder to the same location. The hard part here is that you need to > > gracefuly restart the workers. Currently AFAIK the signals given to the > pod > > aren't respected. So when the scheduler/webserver/worker receives a > > SIGTERM, it should stop the jobs nicely and then exit the container, > before > > k8s kills the container using a SIGKILL. This will be challenging with > the > > workers, which they are potentially long-running. Maybe stop kicking off > > new jobs, and let the old ones finish, will be good enough, but then we > > need to increase the standard kill timeout substantially. Having this > would > > also enable the autoscaling of the workers > > > > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/> | Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > E: jarek.pot...@polidea.com > -- *Greg Neiheisel* / CTO Astronomer.io