Re: [DISCUSS] AIP-12 Persist DAG into DB

Maxime Beauchemin Tue, 26 Feb 2019 17:56:13 -0800

Related thoughts:

* on the topic of serialization, let's be clear whether we're talking about
unidirectional serialization and *not* deserialization back to the object.
This works for making the web server stateless, but isn't a solution around
how DAG definition get shipped around on the cluster (which would be nice
to have from a system standpoint, but we'd have to break lots of dynamic
features, things like callbacks and attaching complex objects to DAGs, ...)

* docker as "serialization" is interesting, I looked into "pex" format in
the past. It's pretty cool to think of DAGs as micro docker application
that get shipped around and executed. The challenge with this is that it
makes it hard to control Airflow's core. Upgrading Airflow becomes [also]
about upgrading the DAG docker images. We had similar concerns with "pex".
The data platform team looses their handle on the core, or has to get in
the docker building business, which is atypical. For an upgrade, you'd have
to ask/force the people who own the DAG dockers to upgrade their images,
else they won't run or something. Contract could be like "we'll only run
your Airflow-docker-dag container if it's in a certain version range" or
something like that. I think it's a cool idea. It gets intricate for the
stateless web server though, it's a bit of a mind bender :) You could ask
the docker to render the page (isn't that crazy?!) or ask the docker for a
serialized version of the DAG that allows you to render the page (similar
to point 1).

* About storing in the db, for efficiency, the pk should be the SHA of the
deterministic serialized DAG. Only store a new entry if the DAG has
changed, and stamp the DagRun to a FK of that serialized DAG table. If
people have shapeshifting DAG within DagRuns we just do best effort, show
them the last one or something like that

* everyone hates pickles (including me), but it really almost works, might
be worth revisiting, or at least I think it's good for me to list out the
blockers:
    * JinjaTemplate objects are not serializable for some odd obscure
reason, I think the community can solve that easily, if someone wants a
full brain dump on this I can share what I know
    * Size: as you pickle something, someone might have attached things
that recurse into hundreds of GBs-size pickle. Like some
on_failure_callback may bring in the whole Slack api library. That can be
solved or mitigated in different ways. At some point I thought I'd have a
DAG.validate() method that makes sure that the DAG can be pickled, and
serialized to a reasonable size pickle. I also think we'd have to make sure
operators are defined as more "abstract" otherwise the pickle includes
things like the whole pyhive lib and all sorts of other deps. It could be
possible to limit what gets attached to the pickle (whitelist classes), and
dehydrate objects during serialization / and rehydrate them on the other
size (assuming classes are on the worker too). If that sounds crazy to you,
it's because it is.

* the other crazy idea is thinking of git repo (the code itself) as the
serialized DAG. There are git filesystem in userspace [fuse] that allow
dynamically accessing the git history like it's just a folder, as in
`REPO/{ANY_GIT_REF}/dags/mydag.py` . Beautifully hacky. A company with a
blue logo with a big F on it that I used to work at did that. Talking about
embracing config-as-code! The DagRun can just stamp the git SHA it's
running with.

Sorry about the confusion, config as code gets tricky around the corners.
But it's all worth it, right? Right!? :)

On Tue, Feb 26, 2019 at 3:09 AM Kevin Yang <yrql...@gmail.com> wrote:

> My bad, I was misunderstanding a bit and mixing up two issues. I was
> thinking about the multiple runs for one DagRun issue( e.g. after we clear
> the DagRun).
>
> This is an orthogonal issue. So the current implementation can work in the
> long term plan.
>
> Cheers,
> Kevin Y
>
> On Tue, Feb 26, 2019 at 2:34 AM Ash Berlin-Taylor <a...@apache.org> wrote:
>
> >
> > > On 26 Feb 2019, at 09:37, Kevin Yang <yrql...@gmail.com> wrote:
> > >
> > > Now since we're already trying to have multiple graphs for one
> > > execution_date, maybe we should just have multiple DagRun.
> >
> > I thought that there is exactly 1 graph for a DAG run - dag_run has a
> > "graph_id" column
>

Re: [DISCUSS] AIP-12 Persist DAG into DB

Reply via email to