Hello everyone, One thing bugged me for quite some time and I wanted to start a discussion on a problem I keep on seeing in the issues/discussions raised - both in the Github Issues/Discussions and Stack overflow - re- dependency of Airflow on resources available in K8S and difficulty of our user with diagnosing it.
TL;DRl I wonder if there is anything that we can do to provide a better diagnostics of resource problems to some of our less experienced users. I could provide quite a few links with some detailed issues where I am either quite sure, or it's very likely the problem is with resources (and Memory seems to be the most likely reason) but I also have my own past experiences to share (not only with Airflow). This is - obviously - not an Airflow problem and we might say so and ignore it. The root cause is relative complexity of K8S environment, lack of understanding of our users on how to monitor and check resource failures in K8S and also the fact that it's likely that things like "here get the airflow up and running on our K8S cluster" task might be given to relatively inexperienced employees who might not yet have a lot of understanding what they do. But I also have not forgotten the times when I have been running stuff on K8S for the first time with a helm , where they kept on failing in mysterious ways when not enough memory was given to the K8S cluster. If you knew where to look - yeah you could find it out but if you are relatively new to the K8S ecosystem, more often than not you are left in the dark. Many people might expect **something** to be printed in the logs of airflow to tell them what's wrong (this is what I heard quite a few times from our user's comments), especially when they use "helm" to install Airflow they treat it often as "package manager" (and they are pretty much right about this). Of course - this is mostly for the users who are new and try Airflow, and those are not the "biggest" users of Airflow. We might choose to not care about it too much and leave it to the users to learn. But I think we might miss a lot of opportunities for growing our user base here and also it might impact Airflow's impression somehow (for those users it's really "Airflow" that does not work, no matter what the root cause is). I wonder - what others think about it - is it really as big of a problem as I think it is ? Maybe someone has past experiences or knows about some ways how we could mitigate that and give better, immediate and more precise feedback to such users and handle those errors nicely (even if it might mean some extra monitoring components etc.) . Does anyone have experience with that and can suggest something? J.