After the meeting (and to keep the record) we had some discussions following the meeting with Tomek and Kamil, and I think we should not pursue the "Kombu Executor" any more (at least not now, but possibly even not at all).
Rather than that - we should focus on making the Celery executor better and fully support auto-scalability (including "scale-workers-to-zero" feature (!) and utilising Kubernetes auto-scaling under the hood to scale Kubernetes nodes up/down as needed. We think it is fully achievable in Celery executor running in auto-scaling Kubernetes cluster and we even have a working Proof of Concept for that. Tomek tested it for auto-scalability in past weeks and he seemed to achieve rock-solid behaviours when it came to autoscaling. I wonder what others think? Also Ahmed - maybe you can share your thoughts / feelings about Fargate Executor and its applicability now (also look at my thoughts below to see the context). Same with KNative - I think we went a bit back-forth recently whether KNative should be implemented at all, whether it is worth the effort and our focus. I'd love to hear some thoughts about that for others to chime-in. There were some super-valid points made by Ash about rather small difference between Celery and Kombu executors. Those made me think differently (It's really interesting for me how my thinking of Airflow changed over the last year from "let's go Kubernetes" to "let's go Python"). After the last month or two of investigations we did around running Celery executor on top of Kubernetes we started to kinda like Celery and we can make a bit more effort to get it use scaling capabilities of Kubernetes. So after a bit of thinking and weighing pros/cons, here are my thoughts: - *Using just some features of Celery* While we are not using all the features of Celery yet - we might as well do it for future versions of Celery. There are various things Celery community is already working for the upcoming Celery 5 that we are also discussing - asyncio, becoming kubernetes-native are the top of the list I believe. They are also dropping python 2 support (yay!) and break backwards compatibility in Celery 5 (similar to what we do in Airflow 2.0). That might make them move faster and bring new features and optimizations at a much faster rate. And we can continue "standing on the shoulders of giants" and let them do all the heavy lifting for running tasks and focus our work on what is really important and differentiating for Airflow. I doubt we can be better than them at the same time. - *Being cloud native vs. Python-native*: That follows our earlier discussions about Airflow being Python-centric rather than Cloud-Native or Container-centric. Celery seems like perfect (and default) task queue for us. It focuses on Python so it opens up various optimisation options (starting from asyncio, sharing loaded python interpreter in memory - by using forks). Those might be more difficult to achieve if we do "generic" task runner - being it KNative or Kubernetes or whatever else that has "container" level granularity. - *Optimisations capabilities* - We all know that containers are great as deployment model and isolation, but we also know that they have certain limitations. I think especially (shared) memory usage is problematic. In case you have a lot of shared/heavy libraries loaded in memory - the behaviour depends on the container drivers you use and in order to achieve it you might need to do some tricks (separate containers with shared libraries etc.). This makes the deployment models much more complex and kind of defeat the isolation property of containers. We have multiple running tasks which are very, very similar in terms of required dependencies, python interpreter, loaded libraries, and "process" isolation is what we really need, not "container" isolation. I think the benefits we have from container isolation are few and far between compared to the benefits we can have from better optimisation of resource usage between a number of tasks running in the same container and leveraging some new python features (such as asyncio for example). Side comment - after discussions with Kamil and Tomek on asyncio - I think we are not ready for it yet and we should not pursue it for Airflow 2.0 as it will require changing of a number of operators, but I think this is definitely something we could consider past Airflow 2.0 and one thing that might bring huge improvements in Airflow's worker performance. But if we can get Celery5 support for AsyncIO built in, then I think that might be really interesting to combine it for AsyncIO support in Airflow. Still - I think it's not something we should actively work on yet. It will bring a lot of disruption in the current "operator's" area and we have no capacity for that as a project I am afraid. - *Easy deployment models for our users* - We can implement a number of best practices and some incremental improvements around the way Celery executor is implemented. We already started some work there they revolve around signal handling, message batching, possibly merging the Celery and Airflow Databases into one. And we have a working implementation of auto-scaling celery Airflow on top of auto-scaling Kubernetes cluster. In the near future we can have an official, production-ready helm chart based on future production-ready Apache Airflow Docker image. This is all doable in Airflow 2.0 timeframe and I think it is the best use of our time in the "scalability" area rather than developing a new executor. Adding new executor adds significant complexity, requires a lot of documentation, best practices, etc. etc. so this is way more effort than improving the Celery one. Also - it is nice that "Kombu executor" is generic from developer point of view, but it's not at all benefit from our users point of view. Being a little more opinionated for Deployment models available makes it actually easier for the users to make stable and reliable deployments. There is a host of new problems that users might hit when using "any" queue as queue backend for example. - *Thinking about current users* - Last but not least - from the last survey we already have a significant user base using Celery executor - without even plans to move to Kubernetes or other deployment models, I think as a community we are obliged to support those users and bring improvements in a way that will be easiest for them to incorporate. Clearly, changing deployment models completely is not an easy thing to swallow by the existing users. Also focusing our efforts on new areas will drag out our focus on improving the Celery executor for existing users. This is not nice for the users using it, and might undermine their trust in Airflow and stability of the project. Keeping focus on where most of our users have vested their interest is really good for the existing users - especially that there are no really big problems with Celery except slightly more complexity and one more "entity" to focus on. Let me know what you think! J. On Thu, Jan 9, 2020 at 2:36 AM Daniel Imberman <[email protected]> wrote: > Hello all! > > We will be meeting tomorrow for #sig-autoscaling monthly meetup tomorrow. > In this chat we’ll be discussing progress in using Knative, Kombu, and > Kubernetes task-pooling to improve the story of airflow > autoscaling/efficiency! > > We will be meeting at 8:00AM PST > > Zoom link here > > https://astronomer.zoom.us/j/275854216 [ > https://www.google.com/url?q=https://astronomer.zoom.us/j/275854216&sa=D&usd=2&usg=AOvVaw0Ro-i70aaKrDQBBysGMXPd > ] > > See you there! -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>
