Re: [DISCUSS] AIP-12 Persist DAG into DB

James Meickle Wed, 27 Feb 2019 10:53:19 -0800

On the topic of using Docker, I highly recommend looking at Argo Workflows
and some of their sample code: https://github.com/argoproj/argo


tl;dr is that it's a workflow management tool where DAGs are expressed as
YAML manifests, and tasks are just containers run on Kubernetes.

I think that there's a lot of value in Airflow's use of Python rather than
a YAML-based DSL. But I do think that containers are the future, and I'm
hopeful that Airflow develops in the direction of focusing on being a
principled Python framework for managing tasks/data executed in containers,
and the resulting execution state.


On Tue, Feb 26, 2019 at 8:55 PM Maxime Beauchemin <
maximebeauche...@gmail.com> wrote:

> Related thoughts:
>
> * on the topic of serialization, let's be clear whether we're talking about
> unidirectional serialization and *not* deserialization back to the object.
> This works for making the web server stateless, but isn't a solution around
> how DAG definition get shipped around on the cluster (which would be nice
> to have from a system standpoint, but we'd have to break lots of dynamic
> features, things like callbacks and attaching complex objects to DAGs, ...)
>
> * docker as "serialization" is interesting, I looked into "pex" format in
> the past. It's pretty cool to think of DAGs as micro docker application
> that get shipped around and executed. The challenge with this is that it
> makes it hard to control Airflow's core. Upgrading Airflow becomes [also]
> about upgrading the DAG docker images. We had similar concerns with "pex".
> The data platform team looses their handle on the core, or has to get in
> the docker building business, which is atypical. For an upgrade, you'd have
> to ask/force the people who own the DAG dockers to upgrade their images,
> else they won't run or something. Contract could be like "we'll only run
> your Airflow-docker-dag container if it's in a certain version range" or
> something like that. I think it's a cool idea. It gets intricate for the
> stateless web server though, it's a bit of a mind bender :) You could ask
> the docker to render the page (isn't that crazy?!) or ask the docker for a
> serialized version of the DAG that allows you to render the page (similar
> to point 1).
>
> * About storing in the db, for efficiency, the pk should be the SHA of the
> deterministic serialized DAG. Only store a new entry if the DAG has
> changed, and stamp the DagRun to a FK of that serialized DAG table. If
> people have shapeshifting DAG within DagRuns we just do best effort, show
> them the last one or something like that
>
> * everyone hates pickles (including me), but it really almost works, might
> be worth revisiting, or at least I think it's good for me to list out the
> blockers:
>     * JinjaTemplate objects are not serializable for some odd obscure
> reason, I think the community can solve that easily, if someone wants a
> full brain dump on this I can share what I know
>     * Size: as you pickle something, someone might have attached things
> that recurse into hundreds of GBs-size pickle. Like some
> on_failure_callback may bring in the whole Slack api library. That can be
> solved or mitigated in different ways. At some point I thought I'd have a
> DAG.validate() method that makes sure that the DAG can be pickled, and
> serialized to a reasonable size pickle. I also think we'd have to make sure
> operators are defined as more "abstract" otherwise the pickle includes
> things like the whole pyhive lib and all sorts of other deps. It could be
> possible to limit what gets attached to the pickle (whitelist classes), and
> dehydrate objects during serialization / and rehydrate them on the other
> size (assuming classes are on the worker too). If that sounds crazy to you,
> it's because it is.
>
> * the other crazy idea is thinking of git repo (the code itself) as the
> serialized DAG. There are git filesystem in userspace [fuse] that allow
> dynamically accessing the git history like it's just a folder, as in
> `REPO/{ANY_GIT_REF}/dags/mydag.py` . Beautifully hacky. A company with a
> blue logo with a big F on it that I used to work at did that. Talking about
> embracing config-as-code! The DagRun can just stamp the git SHA it's
> running with.
>
> Sorry about the confusion, config as code gets tricky around the corners.
> But it's all worth it, right? Right!? :)
>
> On Tue, Feb 26, 2019 at 3:09 AM Kevin Yang <yrql...@gmail.com> wrote:
>
> > My bad, I was misunderstanding a bit and mixing up two issues. I was
> > thinking about the multiple runs for one DagRun issue( e.g. after we
> clear
> > the DagRun).
> >
> > This is an orthogonal issue. So the current implementation can work in
> the
> > long term plan.
> >
> > Cheers,
> > Kevin Y
> >
> > On Tue, Feb 26, 2019 at 2:34 AM Ash Berlin-Taylor <a...@apache.org>
> wrote:
> >
> > >
> > > > On 26 Feb 2019, at 09:37, Kevin Yang <yrql...@gmail.com> wrote:
> > > >
> > > > Now since we're already trying to have multiple graphs for one
> > > > execution_date, maybe we should just have multiple DagRun.
> > >
> > > I thought that there is exactly 1 graph for a DAG run - dag_run has a
> > > "graph_id" column
> >
>

Re: [DISCUSS] AIP-12 Persist DAG into DB

Reply via email to