Re: [DISCUSS] AIP-12 Persist DAG into DB

Julian De Ruiter Fri, 08 Mar 2019 05:44:40 -0800

Hi all,

Personally I don’t understand why people are pushing for a JSON-based DAG 
representation. For serializing DAGs between processes it may make some sense, 
if that allows us to decouple the different components from the main database. 
However, that still requires some (large!) architectural changes to the status 
quo. However, w.r.t. to the database representation of DAGs, I would much 
rather have a table structure with well-defined fields in the database than 
some serialized JSON field with a potential minefield of different fields. 
Switching to JSON is simply moving the problem elsewhere.

In the proposed PR’s we (Peter, Bas and me) aim to avoid re-parsing DAG files 
by querying all the required information from the database. In one or two cases 
this may however not be possible, in which case we might either have to fall 
back on the DAG file or add the missing information into the database. We can 
tackle these problems as we encounter them.

The end goal is to make the webserver independent of DAG parsing.

Best regards / met vriendelijke groet,

Julian de Ruiter

From: Kevin Yang <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Friday, 8 March 2019 at 12:38
To: "[email protected]" <[email protected]>
Subject: Re: [DISCUSS] AIP-12 Persist DAG into DB

Hi all,
When I was preparing some work related to this AIP I found something very 
concerning. I noticed this JIRA 
ticket<https://issues.apache.org/jira/browse/AIRFLOW-3562> is trying to remove 
the dependency of dagbag from webserver, which is awesome--we wanted badly but 
never got to start work on. However when I looked at some subtasks of it, which 
try to remove dagbag dependency from each endpoint, I found the way we remove 
the dependency of dagbag is not very ideal. For example this 
PR<https://github.com/apache/airflow/pull/4867/files> will require us to parse 
the dag file each time we hit the endpoint.

If we go down this path, we indeed can get rid of the dagbag dependency easily, 
but we will have to 1. increase the DB load( not too concerning at the moment 
), 2. wait the DAG file to be parsed before getting the page back, potentially 
multiple times. DAG file can sometimes take quite a while to parse, e.g. we 
have some framework DAG files generating large number of DAGs from some static 
config files or even jupyter notebooks and they can take 30+ seconds to parse. 
Yes we don't like large DAG files but people do see the beauty of code as 
config and sometimes heavily abuseleverage it. Assuming all users have the same 
nice small python file that can be parsed fast, I'm still a bit worried about 
this approach. Continuing on this path means we've chosen DagModel to be the 
serialized representation of DAG and DB columns to hold different 
properties--it can be one candidate but I don't know if we should settle on 
that now. I would personally prefer a more compact, e.g. JSON5, and easy to 
scale representation( such that serializing new fields != DB upgrade).

In my imagination we would have to collect the list of dynamic features 
depending on unserializable fields of a DAG and start a discussion/vote on 
dropping support of them( I'm working on this but if anyone has already done so 
please take over), decide on the serialized representation of a DAG and then 
replace dagbag with it in webserver. Per previous discussion and some offline 
discussions with Dan, one future of DAG serialization that I like would look 
similar to this:
[airflow_new_arch.jpg]
We can still discuss/vote which approach we want to take but I don't want the 
door to above design to be shut right now or we have to spend a lot effort 
switch path later.

Bas and Peter, I'm very sorry to extend the discussion but I do think this is 
tightly related to the AIP and PRs behind it. And my sincere apology for 
bringing this up so late( I only pull the open PR list occasionally, if there's 
a way to subscribe to new PR event I'd love to know how).

Cheers,
Kevin Y

On Thu, Feb 28, 2019 at 1:36 PM Peter van t Hof 
<[email protected]<mailto:[email protected]>> wrote:
Hi all,

Just some comments one the point Bolke dit give in relation of my PR.

At first, the main focus is: making the webserver stateless.

> 1) Make the webserver stateless: needs the graph of the *current* dag

This is the main goal but for this a lot more PR’s will be coming once my 
current is merged. For edges and graph view this is covered in my PR already.

> 2) Version dags: for consistency mainly and not requiring parsing of the
> dag on every loop

In my PR the historical graphs will be stored for each DagRun. This means that 
you can see if an older DagRun was the same graph structure, even if some tasks 
does not exists anymore in the current graph. Especially for dynamic DAG’s this 
is very useful.

> 3) Make the scheduler not require DAG files. This could be done if the
> edges contain all information when to trigger the next task. We can then
> have event driven dag parsing outside of the scheduler loop, ie. by the
> cli. Storage can also be somewhere else (git, artifactory, filesystem,
> whatever).

The scheduler is almost untouched in this PR. The only thing that is added is 
that this edges are saved to the database but the scheduling itself din’t 
change. The scheduler depends now still on the DAG object.

> 4) Fully serialise the dag so it becomes transferable to workers

It nice to see that people has a lot of idea’s about this. But as Fokko already 
mentioned this is out of scope for the issue what we are trying to solve. I 
also have some idea’s about this but I like to limit this PR/AIP to the 
webserver.

For now my PR does solve 1 and 2 and the rest of the behaviour (like 
scheduling) is untouched.

Gr,
Peter

Re: [DISCUSS] AIP-12 Persist DAG into DB

Reply via email to