Re: [DISCUSS] AIP-12 Persist DAG into DB

Kevin Yang Fri, 08 Mar 2019 03:39:05 -0800

Hi all,
When I was preparing some work related to this AIP I found something very
concerning. I noticed this JIRA ticket
<https://issues.apache.org/jira/browse/AIRFLOW-3562> is trying to remove
the dependency of dagbag from webserver, which is awesome--we wanted badly
but never got to start work on. However when I looked at some subtasks of
it, which try to remove dagbag dependency from each endpoint, I found the
way we remove the dependency of dagbag is not very ideal. For example this
PR <https://github.com/apache/airflow/pull/4867/files> will require us to
parse the dag file each time we hit the endpoint.

If we go down this path, we indeed can get rid of the dagbag dependency
easily, but we will have to 1. increase the DB load( not too concerning at
the moment ), 2. wait the DAG file to be parsed before getting the page
back, potentially multiple times. DAG file can sometimes take quite a while
to parse, e.g. we have some framework DAG files generating large number of
DAGs from some static config files or even jupyter notebooks and they can
take 30+ seconds to parse. Yes we don't like large DAG files but people do
see the beauty of code as config and sometimes heavily abuseleverage it.
Assuming all users have the same nice small python file that can be parsed
fast, I'm still a bit worried about this approach. Continuing on this path
means we've chosen DagModel to be the serialized representation of DAG and
DB columns to hold different properties--it can be one candidate but I
don't know if we should settle on that now. I would personally prefer a
more compact, e.g. JSON5, and easy to scale representation( such that
serializing new fields != DB upgrade).

In my imagination we would have to collect the list of dynamic features
depending on unserializable fields of a DAG and start a discussion/vote on
dropping support of them( I'm working on this but if anyone has already
done so please take over), decide on the serialized representation of a DAG
and then replace dagbag with it in webserver. Per previous discussion and
some offline discussions with Dan, one future of DAG serialization that I
like would look similar to this:
[image: airflow_new_arch.jpg]
We can still discuss/vote which approach we want to take but I don't want
the door to above design to be shut right now or we have to spend a lot
effort switch path later.

Bas and Peter, I'm very sorry to extend the discussion but I do think this
is tightly related to the AIP and PRs behind it. And my sincere apology for
bringing this up so late( I only pull the open PR list occasionally, if
there's a way to subscribe to new PR event I'd love to know how).

Cheers,
Kevin Y

On Thu, Feb 28, 2019 at 1:36 PM Peter van t Hof <[email protected]>
wrote:

> Hi all,
>
> Just some comments one the point Bolke dit give in relation of my PR.
>
> At first, the main focus is: making the webserver stateless.
>
> > 1) Make the webserver stateless: needs the graph of the *current* dag
>
> This is the main goal but for this a lot more PR’s will be coming once my
> current is merged. For edges and graph view this is covered in my PR
> already.
>
> > 2) Version dags: for consistency mainly and not requiring parsing of the
> > dag on every loop
>
> In my PR the historical graphs will be stored for each DagRun. This means
> that you can see if an older DagRun was the same graph structure, even if
> some tasks does not exists anymore in the current graph. Especially for
> dynamic DAG’s this is very useful.
>
> > 3) Make the scheduler not require DAG files. This could be done if the
> > edges contain all information when to trigger the next task. We can then
> > have event driven dag parsing outside of the scheduler loop, ie. by the
> > cli. Storage can also be somewhere else (git, artifactory, filesystem,
> > whatever).
>
> The scheduler is almost untouched in this PR. The only thing that is added
> is that this edges are saved to the database but the scheduling itself
> din’t change. The scheduler depends now still on the DAG object.
>
> > 4) Fully serialise the dag so it becomes transferable to workers
>
> It nice to see that people has a lot of idea’s about this. But as Fokko
> already mentioned this is out of scope for the issue what we are trying to
> solve. I also have some idea’s about this but I like to limit this PR/AIP
> to the webserver.
>
> For now my PR does solve 1 and 2 and the rest of the behaviour (like
> scheduling) is untouched.
>
> Gr,
> Peter
>
>

Re: [DISCUSS] AIP-12 Persist DAG into DB

Reply via email to