Re: SIG-Autoscaling meetup tomorrow

Jarek Potiuk Sat, 11 Jan 2020 03:19:14 -0800

After the meeting (and to keep the record) we had some discussions
following the meeting with Tomek and Kamil, and I think we should not
pursue the "Kombu Executor" any more (at least not now, but possibly even
not at all).


Rather than that - we should focus on making the Celery executor better and
fully support auto-scalability (including "scale-workers-to-zero" feature
(!) and utilising Kubernetes auto-scaling under the hood to scale
Kubernetes nodes up/down as needed. We think it is fully achievable in
Celery executor running in auto-scaling Kubernetes cluster and we even have
a working Proof of Concept for that. Tomek tested it for auto-scalability
in past weeks and he seemed to achieve rock-solid behaviours when it came
to autoscaling.

I wonder what others think? Also Ahmed - maybe you can share your thoughts
/ feelings about Fargate Executor and its applicability now (also look at
my thoughts below to see the context). Same with KNative - I think we went
a bit back-forth recently whether KNative should be implemented at all,
whether it is worth the effort and our focus. I'd love to hear some
thoughts about that for others to chime-in.

There were some super-valid points made by Ash about rather small
difference between Celery and Kombu executors. Those made me think
differently (It's really interesting for me how my thinking of Airflow
changed over the last year from "let's go Kubernetes" to "let's go
Python"). After the last month or two of investigations we did around
running Celery executor on top of Kubernetes we started to kinda like
Celery and we can make a bit more effort to get it use scaling capabilities
of Kubernetes.

So after a bit of thinking and weighing pros/cons, here are my thoughts:

   - *Using just some features of Celery* While we are not using all the
   features of Celery yet - we might as well do it for future versions of
   Celery. There are various things Celery community is already working for
   the upcoming Celery 5 that we are also discussing - asyncio, becoming
   kubernetes-native are the top of the list I believe. They are also dropping
   python 2 support (yay!) and break backwards compatibility in Celery 5
   (similar to what we do in Airflow 2.0). That might make them move faster
   and bring new features and optimizations at a much faster rate. And we can
   continue "standing on the shoulders of giants" and let them do all the
   heavy lifting for running tasks and focus our work on what is really
   important and differentiating for Airflow. I doubt we can be better than
   them at the same time.


   - *Being cloud native vs. Python-native*: That follows our earlier
   discussions about Airflow being Python-centric rather than Cloud-Native or
   Container-centric. Celery seems like perfect (and default) task queue for
   us. It focuses on Python so it opens up various optimisation options
   (starting from asyncio, sharing loaded python interpreter in memory - by
   using forks). Those might be more difficult to achieve if we do "generic"
   task runner - being it KNative or Kubernetes or whatever else that has
   "container" level granularity.


   - *Optimisations capabilities* - We all know that containers are great
   as deployment model and isolation, but we also know that they have certain
   limitations. I think especially (shared) memory usage is problematic. In
   case you have a lot of shared/heavy libraries loaded in memory - the
   behaviour depends on the container drivers you use and in order to achieve
   it you might need to do some tricks (separate containers with shared
   libraries etc.). This makes the deployment models much more complex and
   kind of defeat the isolation property of containers. We have multiple
   running tasks which are very, very similar in terms of required
   dependencies, python interpreter, loaded libraries, and "process" isolation
   is what we really need, not "container" isolation. I think the benefits we
   have from container isolation are few and far between compared to the
   benefits we can have from better optimisation of resource usage between a
   number of tasks running in the same container and leveraging some new
   python features (such as asyncio for example). Side comment - after
   discussions with Kamil and Tomek on asyncio - I think we are not ready for
   it yet and we should not pursue it for Airflow 2.0 as it will require
   changing of a number of operators, but I think this is definitely something
   we could consider past Airflow 2.0 and one thing that might bring huge
   improvements in Airflow's worker performance. But if we can get Celery5
   support for AsyncIO built in, then I think that might be really interesting
   to combine it for AsyncIO support in Airflow. Still - I think it's not
   something we should actively work on yet. It will bring a lot of disruption
   in the current "operator's" area and we have no capacity for that as a
   project I am afraid.


   - *Easy deployment models for our users* - We can implement a number of
   best practices and some incremental improvements around the way Celery
   executor is implemented. We already started some work there they revolve
   around signal handling, message batching, possibly merging the Celery and
   Airflow Databases into one. And we have a working implementation of
   auto-scaling celery Airflow on top of auto-scaling Kubernetes cluster. In
   the near future we can have an official, production-ready helm chart based
   on future production-ready Apache Airflow Docker image. This is all doable
   in Airflow 2.0 timeframe and I think it is the best use of our time in the
   "scalability" area rather than developing a new executor. Adding new
   executor adds significant complexity, requires a lot of documentation, best
   practices, etc. etc. so this is way more effort than improving the Celery
   one. Also - it is nice that "Kombu executor" is generic from developer
   point of view, but it's not at all benefit from our users point of view.
   Being a little more opinionated for Deployment models available makes it
   actually easier for the users to make stable and reliable deployments.
   There is a host of new problems that users might hit when using "any" queue
   as queue backend for example.


   - *Thinking about current users* - Last but not least - from the last
   survey we already have a significant user base using Celery executor -
   without even plans to move to Kubernetes or other deployment models, I
   think as a community we are obliged to support those users and bring
   improvements in a way that will be easiest for them to incorporate.
   Clearly, changing deployment models completely is not an easy thing to
   swallow by the existing users. Also focusing our efforts on new areas will
   drag out our focus on improving the Celery executor for existing users.
   This is not nice for the users using it, and might undermine their trust in
   Airflow and stability of the project. Keeping focus on where most of our
   users have vested their interest is really good for the existing users -
   especially that there are no really big problems with Celery except
   slightly more complexity and one more "entity" to focus on.

Let me know what you think!

J.



On Thu, Jan 9, 2020 at 2:36 AM Daniel Imberman <[email protected]>
wrote:

> Hello all!
>
> We will be meeting tomorrow for #sig-autoscaling monthly meetup tomorrow.
> In this chat we’ll be discussing progress in using Knative, Kombu, and
> Kubernetes task-pooling to improve the story of airflow
> autoscaling/efficiency!
>
> We will be meeting at 8:00AM PST
>
> Zoom link here
>
> https://astronomer.zoom.us/j/275854216 [
> https://www.google.com/url?q=https://astronomer.zoom.us/j/275854216&sa=D&usd=2&usg=AOvVaw0Ro-i70aaKrDQBBysGMXPd
> ]
>
> See you there!



-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: SIG-Autoscaling meetup tomorrow

Reply via email to