[DISCUSS] Improving diagnostic of resource problems with K8S

Jarek Potiuk Fri, 28 Jan 2022 07:45:50 -0800

Hello everyone,

One thing bugged me for quite some time and I wanted to start a discussion
on a problem I keep on seeing in the issues/discussions raised - both in
the Github Issues/Discussions and Stack overflow - re- dependency of
Airflow on resources available in K8S and difficulty of our user
with diagnosing it.


TL;DRl I wonder if there is anything that we can do to provide a
better diagnostics of resource problems to some of our less experienced
users.

I could provide quite a few links with some detailed issues where I am
either quite sure, or it's very likely the problem is with resources (and
Memory seems to be the most likely reason) but I also have my own past
experiences to share (not only with Airflow).

This is - obviously -  not an Airflow problem and we might say so and
ignore it.

The root cause is relative complexity of K8S environment, lack of
understanding of our users on how to monitor and check resource failures in
K8S and also the fact that it's likely that things like "here get the
airflow up and running on our K8S cluster" task might be given to
relatively inexperienced employees who might not yet have a lot of
understanding what they do.

But I also have not forgotten the times when I have been running stuff on
K8S for the first time with a helm , where they kept on failing in
mysterious ways when not enough memory was given to the K8S cluster.  If
you knew where to look - yeah you could find it out but if you are
relatively new to the K8S ecosystem, more often than not you are left in
the dark.

Many people might expect **something** to be printed in the logs of airflow
to tell them what's wrong (this is what I heard quite a few times from our
user's comments), especially when they use "helm" to install Airflow they
treat it often as "package manager" (and they are pretty much right about
this).

Of course - this is mostly for the users who are new and try Airflow, and
those are not the "biggest" users of Airflow. We might choose to not care
about it too much and leave it to the users to learn. But I think we might
miss a lot of opportunities for growing our user base here and also it
might impact Airflow's impression somehow (for those users it's really
"Airflow" that does not work, no matter what the root cause is).

I wonder - what others think about it - is it really as big of a problem as
I think it is ? Maybe someone has past experiences or knows about some ways
how we could mitigate that and give better, immediate and more precise
feedback to such users and handle those errors nicely (even if it might
mean some extra monitoring components etc.) . Does anyone have experience
with that and can suggest something?

J.

[DISCUSS] Improving diagnostic of resource problems with K8S

Reply via email to