Re: [DISCUSS] AIP-12 Persist DAG into DB

Deng Xiaodong Fri, 08 Mar 2019 03:42:40 -0800

Hi Kevin,

The image you attached is not displayed properly. May you consider
uploading it somewhere then provide a link instead?


Thanks!

XD

On Fri, Mar 8, 2019 at 19:38 Kevin Yang <[email protected]> wrote:

> Hi all,
> When I was preparing some work related to this AIP I found something very
> concerning. I noticed this JIRA ticket
> <https://issues.apache.org/jira/browse/AIRFLOW-3562> is trying to remove
> the dependency of dagbag from webserver, which is awesome--we wanted badly
> but never got to start work on. However when I looked at some subtasks of
> it, which try to remove dagbag dependency from each endpoint, I found the
> way we remove the dependency of dagbag is not very ideal. For example this
> PR <https://github.com/apache/airflow/pull/4867/files> will require us to
> parse the dag file each time we hit the endpoint.
>
> If we go down this path, we indeed can get rid of the dagbag dependency
> easily, but we will have to 1. increase the DB load( not too concerning at
> the moment ), 2. wait the DAG file to be parsed before getting the page
> back, potentially multiple times. DAG file can sometimes take quite a while
> to parse, e.g. we have some framework DAG files generating large number of
> DAGs from some static config files or even jupyter notebooks and they can
> take 30+ seconds to parse. Yes we don't like large DAG files but people do
> see the beauty of code as config and sometimes heavily abuseleverage it.
> Assuming all users have the same nice small python file that can be parsed
> fast, I'm still a bit worried about this approach. Continuing on this path
> means we've chosen DagModel to be the serialized representation of DAG and
> DB columns to hold different properties--it can be one candidate but I
> don't know if we should settle on that now. I would personally prefer a
> more compact, e.g. JSON5, and easy to scale representation( such that
> serializing new fields != DB upgrade).
>
> In my imagination we would have to collect the list of dynamic features
> depending on unserializable fields of a DAG and start a discussion/vote on
> dropping support of them( I'm working on this but if anyone has already
> done so please take over), decide on the serialized representation of a DAG
> and then replace dagbag with it in webserver. Per previous discussion and
> some offline discussions with Dan, one future of DAG serialization that I
> like would look similar to this:
> [image: airflow_new_arch.jpg]
> We can still discuss/vote which approach we want to take but I don't want
> the door to above design to be shut right now or we have to spend a lot
> effort switch path later.
>
> Bas and Peter, I'm very sorry to extend the discussion but I do think this
> is tightly related to the AIP and PRs behind it. And my sincere apology for
> bringing this up so late( I only pull the open PR list occasionally, if
> there's a way to subscribe to new PR event I'd love to know how).
>
> Cheers,
> Kevin Y
>
> On Thu, Feb 28, 2019 at 1:36 PM Peter van t Hof <[email protected]>
> wrote:
>
>> Hi all,
>>
>> Just some comments one the point Bolke dit give in relation of my PR.
>>
>> At first, the main focus is: making the webserver stateless.
>>
>> > 1) Make the webserver stateless: needs the graph of the *current* dag
>>
>> This is the main goal but for this a lot more PR’s will be coming once my
>> current is merged. For edges and graph view this is covered in my PR
>> already.
>>
>> > 2) Version dags: for consistency mainly and not requiring parsing of the
>> > dag on every loop
>>
>> In my PR the historical graphs will be stored for each DagRun. This means
>> that you can see if an older DagRun was the same graph structure, even if
>> some tasks does not exists anymore in the current graph. Especially for
>> dynamic DAG’s this is very useful.
>>
>> > 3) Make the scheduler not require DAG files. This could be done if the
>> > edges contain all information when to trigger the next task. We can then
>> > have event driven dag parsing outside of the scheduler loop, ie. by the
>> > cli. Storage can also be somewhere else (git, artifactory, filesystem,
>> > whatever).
>>
>> The scheduler is almost untouched in this PR. The only thing that is
>> added is that this edges are saved to the database but the scheduling
>> itself din’t change. The scheduler depends now still on the DAG object.
>>
>> > 4) Fully serialise the dag so it becomes transferable to workers
>>
>> It nice to see that people has a lot of idea’s about this. But as Fokko
>> already mentioned this is out of scope for the issue what we are trying to
>> solve. I also have some idea’s about this but I like to limit this PR/AIP
>> to the webserver.
>>
>> For now my PR does solve 1 and 2 and the rest of the behaviour (like
>> scheduling) is untouched.
>>
>> Gr,
>> Peter
>>
>>

Re: [DISCUSS] AIP-12 Persist DAG into DB

Reply via email to