Re: [DISCUSS] AIP-12 Persist DAG into DB

Maxime Beauchemin Wed, 27 Feb 2019 13:46:26 -0800

Agreed, I wasn't trying to inflate the scope of the AIP, just raising
related topics to see how it all fits together.


Max

On Wed, Feb 27, 2019 at 1:17 PM Bolke de Bruin <[email protected]> wrote:

> I agree with Fokko here. We have been discussing serialisation for a
> veerrryyy long time and nothing has come of it :-). Probably because we are
> making it too big.
>
> As Fokko states, there are several goals and each of them probably warrants
> an AIP:
>
> 1) Make the webserver stateless: needs the graph of the *current* dag
> 2) Version dags: for consistency mainly and not requiring parsing of the
> dag on every loop
> 3) Make the scheduler not require DAG files. This could be done if the
> edges contain all information when to trigger the next task. We can then
> have event driven dag parsing outside of the scheduler loop, ie. by the
> cli. Storage can also be somewhere else (git, artifactory, filesystem,
> whatever).
> 4) Fully serialise the dag so it becomes transferable to workers
>
> 1-3 are related, 4 isn’t.
>
> It would be awesome if we could tackle 1 first, which I think the PR is
> doing as a first iteration. It would takes us a long way to the others.
>
> B.
>
>
> On 27 February 2019 at 22:00:59, Driesprong, Fokko ([email protected])
> wrote:
>
> I feel we're going a bit off topic here. Although it is good to discuss the
> possibilities.
>
> From my perspective the AIP tries to kill two birds with one stone:
>
> 1. Decoupling the web-server from the actual Python DAG files
> 2. Versioning the DAGs so we can have a historical view of the dags as
> they are executed. For example, if you now deploy a new dag, and you rename
> an operator, the old name will disappear from the tree view, and you will
> get a new row which has no status (until you trigger a run).
>
> They are related but require need different solutions. Personally, I'm most
> interested in the first so we can make the webserver (finally) stateless.
> The PR tries to remove the need for having the actual python for the
> different views. Some of them are trivial, such as the graph and tree view.
> Some of them are more tricky, such as the task instance details and the
> rendered view because they pull a lot of information from the DAG object,
> and it involves jinja templating.
> The main goal of the PR is to store this kind of metadata when the
> scheduler kicks off the dag. So when a new DagRun is being created, the
> latest version of the dag is loaded from the actual python file on disk by
> the scheduler. The file is executed and the DAG is being persisted into the
> database in a structured way. By moving this into the database, we can
> eventually decouple the webserver from the actual dag files, which greatly
> simplifies deployment and removes the state since this is now in the (ACID
> compliant) database. Furthermore, instead of whitelisting certain classes,
> we control the serialization ourselves by knowing what we push to the
> database instead of having to push a pickled blob. Besides that, from a
> performance perspective, as Max already pointed out, pickling can
> potentially be expensive in terms of memory and CPU.
>
> > Since it's pretty clear we need SimpleDAG serialization, and we can see
> > through the requirements, people can pretty much get started on this.
>
> I'm afraid if we go this route, then the SimpleDag will be the actual Dag
> in the end (but then a slightly smaller and serializable version of it). My
> preference would be to simplify the DAG object and get rid of the BaseDag
> and SimpleDag to simplify the object hierarchy.
>
> Cheers, Fokko
>
> Op wo 27 feb. 2019 om 21:23 schreef Maxime Beauchemin <
> [email protected]>:
>
> > I fully agree on all your points Dan.
> >
> > First about PEX vs Docker, Docker is a clear choice here. It's a superset
> > of what PEX can do (PEX is limited to python env) and a great standard
> that
> > has awesome tooling around it, works natively in k8s, which is becoming
> the
> > preferred executor. PEX is a bit of a hack and has failed to become a
> > standard. I don't think anyone would argue for PEX over Docker in almost
> > any context where this question would show up nowadays (beyond Airflow
> > too).
> >
> > And pickles have a lot of disadvantages over docker:
> > * some objects are not picklable (JinjaTemplates!)
> > * lack of visibility / tooling on building them, you might make one call
> > and get a 500mb pickle, and have not clue why it's so big
> > * unlike docker, they are impossible to introspect (as far as I know),
> you
> > have to intercept the __deepcopy__ method, good luck!
> > * pickles require a very similar environment to be rehydrated (same GCC
> > version, same modules/version available?)
> > * weird side effects and unknown boundaries. If I pickled a DAG on an
> older
> > version of airflow and code logic got included, and restore it on a
> newer,
> > could the new host be oddly downgraded as a result? when restored, did it
> > affect the whole environment / other DAGs?
> >
> > My vote goes towards something like SimpleDAG serialization (for web
> server
> > and similar use cases) + docker images built with top of a lightweight
> SDK
> > as the way to go.
> >
> > Since it's pretty clear we need SimpleDAG serialization, and we can see
> > through the requirements, people can pretty much get started on this.
> >
> > The question of how to bring native Docker support to Airflow is a bit
> more
> > complex. I think the k8s executor has support for custom DAG containers
> but
> > idk how it works as I haven't look deeply into this recently. Docker
> > support in Airflow is a tricky and important question, and it offers an
> > opportunity to provide solutions to long standing issues around
> versioning,
> > test/dev user workflows, and more.
> >
> > Related docker questions:
> > * what's the docker interface contract? what entry point have to exist in
> > the image?
> > * does airflow provide tooling to help bake images that enforce that
> > contract?
> > * do we need docker tag semantics in the DSL? does it look something like
> > `DAG(id='mydag', docker_tag=hub.docker.com://org/repo/tag', ...)`
> > * is docker optional or required? (probably optional at first)
> > * should docker support be k8s-executor centric? work the same across
> > executor? are we running docker-in-docker as a result / penalty in k8s?
> >
> > Max
> >
> > On Wed, Feb 27, 2019 at 11:38 AM Dan Davydov
> <[email protected]
> > >
> > wrote:
> >
> > > >
> > > > * on the topic of serialization, let's be clear whether we're talking
> > > about
> > > > unidirectional serialization and *not* deserialization back to the
> > > object.
> > > > This works for making the web server stateless, but isn't a solution
> > > around
> > > > how DAG definition get shipped around on the cluster (which would be
> > nice
> > > > to have from a system standpoint, but we'd have to break lots of
> > dynamic
> > > > features, things like callbacks and attaching complex objects to
> DAGs,
> > > ...)
> > >
> > > I feel these dynamic features are not worth the tradeoffs, and in most
> > > cases have alternatives, e.g. on_failure_callback can be replaced by a
> > task
> > > with a ONE_FAILURE trigger rule, which gives additional advantages that
> > > first-class Airflow tasks have like retries. That being said, we should
> > > definitely do our due diligence weighing the trade-offs and coming up
> > with
> > > alternatives for any feature we disable (jinja templating related to
> > > webserver rendering, callbacks, etc). I remember speaking to Alex about
> > > this and he agreed that the consistency/auditing/isolation guarantees
> > were
> > > worth losing some features, I think Paul did as well. Certainly we will
> > > need to have a discussion/vote with the rest of the committers.
> > >
> > > My initial thinking is that both the DAG topology serialization (i.e.
> > > generating and storing SimpleDag in the DB for each DAG), and linking
> > each
> > > DAG with a pex/docker image/etc as well as authentication tokens should
> > > happen at the same place, probably the client runs some command that
> will
> > > generate SimpleDag as well as a container, and then sends it to some
> > > Airflow Service that stores all of this information appropriately. Then
> > > Scheduler/Webserver/Worker consume the stored SimpleDAgs, and Workers
> > > consume the containers in addition.
> > >
> > > * docker as "serialization" is interesting, I looked into "pex" format
> in
> > > > the past. It's pretty cool to think of DAGs as micro docker
> application
> > > > that get shipped around and executed. The challenge with this is that
> > it
> > > > makes it hard to control Airflow's core. Upgrading Airflow becomes
> > [also]
> > > > about upgrading the DAG docker images. We had similar concerns with
> > > "pex".
> > > > The data platform team looses their handle on the core, or has to get
> > in
> > > > the docker building business, which is atypical. For an upgrade,
> you'd
> > > have
> > > > to ask/force the people who own the DAG dockers to upgrade their
> > images,
> > >
> > > The container vs Airflow versioning problem I believe is just an API
> > > versioning problem. I.e. you don't necessarily have to rebuild all
> > > containers when you bump version of airflow as long as the API is
> > backwards
> > > compatible). I think this is reasonable for a platform like Airflow,
> and
> > > not sure there is a great way to avoid it if we want other nice system
> > > guarantees (e.g. reproducibility).
> > >
> > > Contract could be like "we'll only run
> > > > your Airflow-docker-dag container if it's in a certain version range"
> > or
> > > > something like that. I think it's a cool idea. It gets intricate for
> > the
> > > > stateless web server though, it's a bit of a mind bender :) You could
> > ask
> > > > the docker to render the page (isn't that crazy?!) or ask the docker
> > for
> > > a
> > > > serialized version of the DAG that allows you to render the page
> > (similar
> > > > to point 1).
> > >
> > > If the webserver uses the SimpleDag representation that is generated at
> > the
> > > time of DAG creation, then you can avoid having Docker needing to
> provide
> > > this serialized version, i.e. you push the responsibility to the client
> > to
> > > have the right dependencies in order to build the DAG which I feel is
> > good.
> > > One tricky thing I can think of is if you have special UI elements
> > related
> > > to the operator type of a task (I saw a PR out for this recently), you
> > > would need to solve the API versioning problem separately for this as
> > well
> > > (i.e. make sure the serialized DAG representation works with the
> version
> > of
> > > the newest operator UI).
> > >
> > > * About storing in the db, for efficiency, the pk should be the SHA of
> > the
> > > > deterministic serialized DAG. Only store a new entry if the DAG has
> > > > changed, and stamp the DagRun to a FK of that serialized DAG table.
> If
> > > > people have shapeshifting DAG within DagRuns we just do best effort,
> > show
> > > > them the last one or something like that
> > >
> > > If we link each dagrun to it's "container" and "serialized
> > representation"
> > > then the web UI can actually iterate through each dagrun and even
> render
> > > changes in topology. I think at least for v1 we can just use the
> current
> > > solution as you mentioned (best effort using the latest version).
> > >
> > > * everyone hates pickles (including me), but it really almost works,
> > might
> > > > be worth revisiting, or at least I think it's good for me to list out
> > the
> > > > blockers:
> > > > * JinjaTemplate objects are not serializable for some odd obscure
> > > > reason, I think the community can solve that easily, if someone wants
> a
> > > > full brain dump on this I can share what I know
> > >
> > > What was the preference for using Pickle over Docker/PEX for
> > serialization?
> > > I think we discussed this a long time ago with Paul but I forget the
> > > rationale and it would be good to have the information shared publicly
> > too.
> > > One big problem is you don't get isolation at the binary dependency
> > level,
> > > i.e. .so/.dll dependencies, along with all of the other problems you
> > > listed.
> > >
> > > On Tue, Feb 26, 2019 at 8:55 PM Maxime Beauchemin <
> > > [email protected]> wrote:
> > >
> > > > Related thoughts:
> > > >
> > > > * on the topic of serialization, let's be clear whether we're talking
> > > about
> > > > unidirectional serialization and *not* deserialization back to the
> > > object.
> > > > This works for making the web server stateless, but isn't a solution
> > > around
> > > > how DAG definition get shipped around on the cluster (which would be
> > nice
> > > > to have from a system standpoint, but we'd have to break lots of
> > dynamic
> > > > features, things like callbacks and attaching complex objects to
> DAGs,
> > > ...)
> > > >
> > > > * docker as "serialization" is interesting, I looked into "pex"
> format
> > in
> > > > the past. It's pretty cool to think of DAGs as micro docker
> application
> > > > that get shipped around and executed. The challenge with this is that
> > it
> > > > makes it hard to control Airflow's core. Upgrading Airflow becomes
> > [also]
> > > > about upgrading the DAG docker images. We had similar concerns with
> > > "pex".
> > > > The data platform team looses their handle on the core, or has to get
> > in
> > > > the docker building business, which is atypical. For an upgrade,
> you'd
> > > have
> > > > to ask/force the people who own the DAG dockers to upgrade their
> > images,
> > > > else they won't run or something. Contract could be like "we'll only
> > run
> > > > your Airflow-docker-dag container if it's in a certain version range"
> > or
> > > > something like that. I think it's a cool idea. It gets intricate for
> > the
> > > > stateless web server though, it's a bit of a mind bender :) You could
> > ask
> > > > the docker to render the page (isn't that crazy?!) or ask the docker
> > for
> > > a
> > > > serialized version of the DAG that allows you to render the page
> > (similar
> > > > to point 1).
> > > >
> > > > * About storing in the db, for efficiency, the pk should be the SHA
> of
> > > the
> > > > deterministic serialized DAG. Only store a new entry if the DAG has
> > > > changed, and stamp the DagRun to a FK of that serialized DAG table.
> If
> > > > people have shapeshifting DAG within DagRuns we just do best effort,
> > show
> > > > them the last one or something like that
> > > >
> > > > * everyone hates pickles (including me), but it really almost works,
> > > might
> > > > be worth revisiting, or at least I think it's good for me to list out
> > the
> > > > blockers:
> > > > * JinjaTemplate objects are not serializable for some odd obscure
> > > > reason, I think the community can solve that easily, if someone wants
> a
> > > > full brain dump on this I can share what I know
> > > > * Size: as you pickle something, someone might have attached things
> > > > that recurse into hundreds of GBs-size pickle. Like some
> > > > on_failure_callback may bring in the whole Slack api library. That
> can
> > be
> > > > solved or mitigated in different ways. At some point I thought I'd
> > have a
> > > > DAG.validate() method that makes sure that the DAG can be pickled,
> and
> > > > serialized to a reasonable size pickle. I also think we'd have to
> make
> > > sure
> > > > operators are defined as more "abstract" otherwise the pickle
> includes
> > > > things like the whole pyhive lib and all sorts of other deps. It
> could
> > be
> > > > possible to limit what gets attached to the pickle (whitelist
> classes),
> > > and
> > > > dehydrate objects during serialization / and rehydrate them on the
> > other
> > > > size (assuming classes are on the worker too). If that sounds crazy
> to
> > > you,
> > > > it's because it is.
> > > >
> > > > * the other crazy idea is thinking of git repo (the code itself) as
> the
> > > > serialized DAG. There are git filesystem in userspace [fuse] that
> allow
> > > > dynamically accessing the git history like it's just a folder, as in
> > > > `REPO/{ANY_GIT_REF}/dags/mydag.py` . Beautifully hacky. A company
> with
> > a
> > > > blue logo with a big F on it that I used to work at did that. Talking
> > > about
> > > > embracing config-as-code! The DagRun can just stamp the git SHA it's
> > > > running with.
> > > >
> > > > Sorry about the confusion, config as code gets tricky around the
> > corners.
> > > > But it's all worth it, right? Right!? :)
> > > >
> > > > On Tue, Feb 26, 2019 at 3:09 AM Kevin Yang <[email protected]>
> wrote:
> > > >
> > > > > My bad, I was misunderstanding a bit and mixing up two issues. I
> was
> > > > > thinking about the multiple runs for one DagRun issue( e.g. after
> we
> > > > clear
> > > > > the DagRun).
> > > > >
> > > > > This is an orthogonal issue. So the current implementation can work
> > in
> > > > the
> > > > > long term plan.
> > > > >
> > > > > Cheers,
> > > > > Kevin Y
> > > > >
> > > > > On Tue, Feb 26, 2019 at 2:34 AM Ash Berlin-Taylor <[email protected]>
> > > > wrote:
> > > > >
> > > > > >
> > > > > > > On 26 Feb 2019, at 09:37, Kevin Yang <[email protected]>
> wrote:
> > > > > > >
> > > > > > > Now since we're already trying to have multiple graphs for one
> > > > > > > execution_date, maybe we should just have multiple DagRun.
> > > > > >
> > > > > > I thought that there is exactly 1 graph for a DAG run - dag_run
> > has a
> > > > > > "graph_id" column
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] AIP-12 Persist DAG into DB

Reply via email to