On the topic of using Docker, I highly recommend looking at Argo Workflows and some of their sample code: https://github.com/argoproj/argo
tl;dr is that it's a workflow management tool where DAGs are expressed as YAML manifests, and tasks are just containers run on Kubernetes. I think that there's a lot of value in Airflow's use of Python rather than a YAML-based DSL. But I do think that containers are the future, and I'm hopeful that Airflow develops in the direction of focusing on being a principled Python framework for managing tasks/data executed in containers, and the resulting execution state. On Tue, Feb 26, 2019 at 8:55 PM Maxime Beauchemin < maximebeauche...@gmail.com> wrote: > Related thoughts: > > * on the topic of serialization, let's be clear whether we're talking about > unidirectional serialization and *not* deserialization back to the object. > This works for making the web server stateless, but isn't a solution around > how DAG definition get shipped around on the cluster (which would be nice > to have from a system standpoint, but we'd have to break lots of dynamic > features, things like callbacks and attaching complex objects to DAGs, ...) > > * docker as "serialization" is interesting, I looked into "pex" format in > the past. It's pretty cool to think of DAGs as micro docker application > that get shipped around and executed. The challenge with this is that it > makes it hard to control Airflow's core. Upgrading Airflow becomes [also] > about upgrading the DAG docker images. We had similar concerns with "pex". > The data platform team looses their handle on the core, or has to get in > the docker building business, which is atypical. For an upgrade, you'd have > to ask/force the people who own the DAG dockers to upgrade their images, > else they won't run or something. Contract could be like "we'll only run > your Airflow-docker-dag container if it's in a certain version range" or > something like that. I think it's a cool idea. It gets intricate for the > stateless web server though, it's a bit of a mind bender :) You could ask > the docker to render the page (isn't that crazy?!) or ask the docker for a > serialized version of the DAG that allows you to render the page (similar > to point 1). > > * About storing in the db, for efficiency, the pk should be the SHA of the > deterministic serialized DAG. Only store a new entry if the DAG has > changed, and stamp the DagRun to a FK of that serialized DAG table. If > people have shapeshifting DAG within DagRuns we just do best effort, show > them the last one or something like that > > * everyone hates pickles (including me), but it really almost works, might > be worth revisiting, or at least I think it's good for me to list out the > blockers: > * JinjaTemplate objects are not serializable for some odd obscure > reason, I think the community can solve that easily, if someone wants a > full brain dump on this I can share what I know > * Size: as you pickle something, someone might have attached things > that recurse into hundreds of GBs-size pickle. Like some > on_failure_callback may bring in the whole Slack api library. That can be > solved or mitigated in different ways. At some point I thought I'd have a > DAG.validate() method that makes sure that the DAG can be pickled, and > serialized to a reasonable size pickle. I also think we'd have to make sure > operators are defined as more "abstract" otherwise the pickle includes > things like the whole pyhive lib and all sorts of other deps. It could be > possible to limit what gets attached to the pickle (whitelist classes), and > dehydrate objects during serialization / and rehydrate them on the other > size (assuming classes are on the worker too). If that sounds crazy to you, > it's because it is. > > * the other crazy idea is thinking of git repo (the code itself) as the > serialized DAG. There are git filesystem in userspace [fuse] that allow > dynamically accessing the git history like it's just a folder, as in > `REPO/{ANY_GIT_REF}/dags/mydag.py` . Beautifully hacky. A company with a > blue logo with a big F on it that I used to work at did that. Talking about > embracing config-as-code! The DagRun can just stamp the git SHA it's > running with. > > Sorry about the confusion, config as code gets tricky around the corners. > But it's all worth it, right? Right!? :) > > On Tue, Feb 26, 2019 at 3:09 AM Kevin Yang <yrql...@gmail.com> wrote: > > > My bad, I was misunderstanding a bit and mixing up two issues. I was > > thinking about the multiple runs for one DagRun issue( e.g. after we > clear > > the DagRun). > > > > This is an orthogonal issue. So the current implementation can work in > the > > long term plan. > > > > Cheers, > > Kevin Y > > > > On Tue, Feb 26, 2019 at 2:34 AM Ash Berlin-Taylor <a...@apache.org> > wrote: > > > > > > > > > On 26 Feb 2019, at 09:37, Kevin Yang <yrql...@gmail.com> wrote: > > > > > > > > Now since we're already trying to have multiple graphs for one > > > > execution_date, maybe we should just have multiple DagRun. > > > > > > I thought that there is exactly 1 graph for a DAG run - dag_run has a > > > "graph_id" column > > >