Re: [DISCUSS] AIP-12 Persist DAG into DB

Kevin Yang Fri, 08 Mar 2019 17:21:34 -0800

Hi Julian, I'm definitely aligned with you guys on making the webserver
independent of DAG parsing, just the end goal to me would be to build a
complete story around serializing DAG--and move with the story in mind. I
feel like you guys may already have a list of dynamic features we need to
deprecate/change, if that is the case feel free to open the discussion on
what we do to them with DAG serialization.


Julian, Ash, Dan, on 2nd thought I do agree that if we can meet the
requirements Dan mentioned, it would be nice to have them stored in the DB.
Some combined solutions like having a column of serialized graph in the
serialized dag table can potentially meet all requirements. What format we
end up using to represent DAG between components is now less important
IMO--fine to refactor those endpoints only need DagModel to use only
DagModel, easy to do a batch replacement if we decide otherwise later. More
important is to define this source of truth for serialized DAG.

Ash, ty for the email list, I'll tune my filters accordingly :D I'm leaning
towards having a separate process for the parser so we got no scheduler
dependency etc for this parser but we can discuss this in another thread.

On Fri, Mar 8, 2019 at 8:57 AM Dan Davydov <ddavy...@twitter.com.invalid>
wrote:

> >
> > Personally I don’t understand why people are pushing for a JSON-based DAG
> > representation
>
> It sounds like you agree that DAGs should be serialized (just in the DB
> instead of JSON), so will only address why JSON is better than MySQL (AKA
> serializing at the DAG level vs the task level) as far as I can see, and
> not why we need serialization. If you zoom out and look at all the use
> cases of serialized DAGs, e.g. having the scheduler use them instead of
> parsing DAGs directly, then it becomes clear that we need all appropriate
> metadata in these DAGs, (operator params, DAG properties, etc), in which
> case it's not clear how it will fit nicely into a DB table (unless you
> wanted to do something like (parent_task_id, task_id, task_params), also
> keep in mind that we will need to store different versions of each DAG in
> the future so that we can ensure consistency in a dagrun, i.e. each task in
> a dagrun uses the same version of a DAG.
>
> I think some of our requirements should be:
> 1. The data model will lead to acceptable performance in all of its
> consumers (scheduler, webserver, workers), i.e. no n+1 access patterns (my
> biggest concern about serializing at task level as you propose vs at DAG
> level)
> 2. We can have versioning of serialized DAGs
> 3. The ability to separate DAGs into their own data store (e.g. no reliance
> on joins between the new table and the old one)
> 4. One source of truth/serialized representation for DAGs (currently we
> have SimpleDAG)
>
> If we can full-fill all of these requirements and serialize at the task
> level rather than the DAG level in the DB, then I agree that probably makes
> more sense.
>
>
> > In the proposed PR’s we (Peter, Bas and me) aim to avoid re-parsing DAG
> > files by querying all the required information from the database. In one
> or
> > two cases this may however not be possible, in which case we might either
> > have to fall back on the DAG file or add the missing information into the
> > database. We can tackle these problems as we encounter them.
>
> I think you would have the support of many of committers in removing any
> use-cases that stand in the way of full serialization, that being said if
> we need to remove features we need to do this carefully and thoughtfully,
> and ideally with proposed alternatives/work-arounds to cover the removals.
>
> The counter argument: this PR removes the need for the confusing "Refresh"
> > button from the UI, and in general you only pay the cost for the
> expensive
> > DAGs when you ask about them. (I don't know what/when we call the
> > /pickle_info endpoint of the top of my head)
>
> Probably worth splitting out into a separate thread, but I'm actually not
> sure the refresh button does anything, I think we should double check... I
> think about 2 years ago there was a commit made that made gunicorn
> webservers automatically rotate underneath flask (each one would reparse
> the DAGbag). Even if it works we should probably remove it since the
> webserver refresh interval is pretty fast, and it just causes confusion to
> users and implies that the DAGs are not refreshed automatically.
>
> Do you mean https://json5.org/ or is this a typo? That might be okay for a
> > nicer user front end, but the "canonical" version stored in the DB should
> > be something "plainer" like just JSON.
>
> I think he got this from my reply, and it was just an example, but you are
> right, I agree JSON would be better than JSON5.
>
> On Fri, Mar 8, 2019 at 8:53 AM Ash Berlin-Taylor <a...@apache.org> wrote:
>
> > Comments inline.
> >
> > > On 8 Mar 2019, at 11:28, Kevin Yang <yrql...@gmail.com> wrote:
> > >
> > > Hi all,
> > > When I was preparing some work related to this AIP I found something
> > very concerning. I noticed this JIRA ticket <
> > https://issues.apache.org/jira/browse/AIRFLOW-3562> is trying to remove
> > the dependency of dagbag from webserver, which is awesome--we wanted
> badly
> > but never got to start work on. However when I looked at some subtasks of
> > it, which try to remove dagbag dependency from each endpoint, I found the
> > way we remove the dependency of dagbag is not very ideal. For example
> this
> > PR <https://github.com/apache/airflow/pull/4867/files> will require us
> to
> > parse the dag file each time we hit the endpoint.
> >
> > The counter argument: this PR removes the need for the confusing
> "Refresh"
> > button from the UI, and in general you only pay the cost for the
> expensive
> > DAGs when you ask about them. (I don't know what/when we call the
> > /pickle_info endpoint of the top of my head)
> >
> > This end point may be one to hold off on (as it can ask for multiple
> dags)
> > but there are some that def don't need a full dag bag or to even parse
> the
> > dag file, the current DAG model has enough info.
> >
> > >
> > >
> > > If we go down this path, we indeed can get rid of the dagbag dependency
> > easily, but we will have to 1. increase the DB load( not too concerning
> at
> > the moment ), 2. wait the DAG file to be parsed before getting the page
> > back, potentially multiple times. DAG file can sometimes take quite a
> while
> > to parse, e.g. we have some framework DAG files generating large number
> of
> > DAGs from some static config files or even jupyter notebooks and they can
> > take 30+ seconds to parse. Yes we don't like large DAG files but people
> do
> > see the beauty of code as config and sometimes heavily abuseleverage it.
> > Assuming all users have the same nice small python file that can be
> parsed
> > fast, I'm still a bit worried about this approach. Continuing on this
> path
> > means we've chosen DagModel to be the serialized representation of DAG
> and
> > DB columns to hold different properties--it can be one candidate but I
> > don't know if we should settle on that now. I would personally prefer a
> > more compact, e.g. JSON5, and easy to scale representation( such that
> > serializing new fields != DB upgrade).
> >
> > Do you mean https://json5.org/ or is this a typo? That might be okay for
> > a nicer user front end, but the "canonical" version stored in the DB
> should
> > be something "plainer" like just JSON.
> >
> > I'm not sure that "serializing new fields != DB upgrade" is that big of a
> > concern, as we don't add fields that often. One possible way of dealing
> > with it if we do is to have a hybrid approach - a few distinct columns,
> but
> > then a JSON blob. (and if we were only to support postgres we could just
> > use JSONb. But I think our friends at Google may object ;) )
> >
> > Adding a new column in a DB migration with a default NULL shouldn't be an
> > expensive operation, or difficult to achieve.
> >
> >
> > >
> > > In my imagination we would have to collect the list of dynamic features
> > depending on unserializable fields of a DAG and start a discussion/vote
> on
> > dropping support of them( I'm working on this but if anyone has already
> > done so please take over), decide on the serialized representation of a
> DAG
> > and then replace dagbag with it in webserver. Per previous discussion and
> > some offline discussions with Dan, one future of DAG serialization that I
> > like would look similar to this:
> > >
> >
> > > https://imgur.com/ncqqQgc
> >
> > Something I've thought about before for other things was to embed an API
> > server _into_ the scheduler - this would be useful for k8s healthchecks,
> > native Prometheus metrics without needed statsd bridge, and could have
> > endpoints to get information such as this directly.
> >
> > I was thinking it would be _in_ the scheduler process using either
> threads
> > (ick. Python's still got a GIL doesn't it?) or using async/twisted etc.
> > (not a side-car process like we have with the logs webserver for `airflow
> > worker`).
> >
> > (This is possibly an unrelated discussion, but might be worth talking
> > about?)
> >
> > > We can still discuss/vote which approach we want to take but I don't
> > want the door to above design to be shut right now or we have to spend a
> > lot effort switch path later.
> > >
> > > Bas and Peter, I'm very sorry to extend the discussion but I do think
> > this is tightly related to the AIP and PRs behind it. And my sincere
> > apology for bringing this up so late( I only pull the open PR list
> > occasionally, if there's a way to subscribe to new PR event I'd love to
> > know how).
> >
> > It's noisy, but you can subscribe to comm...@airflow.apache.org (but be
> > warned, this also includes all Jira tickets, edits of every comment on
> > github etc.).
> >
> >
> > >
> > > Cheers,
> > > Kevin Y
> > >
> > > On Thu, Feb 28, 2019 at 1:36 PM Peter van t Hof <pjrvant...@gmail.com
> > <mailto:pjrvant...@gmail.com>> wrote:
> > > Hi all,
> > >
> > > Just some comments one the point Bolke dit give in relation of my PR.
> > >
> > > At first, the main focus is: making the webserver stateless.
> > >
> > > > 1) Make the webserver stateless: needs the graph of the *current* dag
> > >
> > > This is the main goal but for this a lot more PR’s will be coming once
> > my current is merged. For edges and graph view this is covered in my PR
> > already.
> > >
> > > > 2) Version dags: for consistency mainly and not requiring parsing of
> > the
> > > > dag on every loop
> > >
> > > In my PR the historical graphs will be stored for each DagRun. This
> > means that you can see if an older DagRun was the same graph structure,
> > even if some tasks does not exists anymore in the current graph.
> Especially
> > for dynamic DAG’s this is very useful.
> > >
> > > > 3) Make the scheduler not require DAG files. This could be done if
> the
> > > > edges contain all information when to trigger the next task. We can
> > then
> > > > have event driven dag parsing outside of the scheduler loop, ie. by
> the
> > > > cli. Storage can also be somewhere else (git, artifactory,
> filesystem,
> > > > whatever).
> > >
> > > The scheduler is almost untouched in this PR. The only thing that is
> > added is that this edges are saved to the database but the scheduling
> > itself din’t change. The scheduler depends now still on the DAG object.
> > >
> > > > 4) Fully serialise the dag so it becomes transferable to workers
> > >
> > > It nice to see that people has a lot of idea’s about this. But as Fokko
> > already mentioned this is out of scope for the issue what we are trying
> to
> > solve. I also have some idea’s about this but I like to limit this PR/AIP
> > to the webserver.
> > >
> > > For now my PR does solve 1 and 2 and the rest of the behaviour (like
> > scheduling) is untouched.
> > >
> > > Gr,
> > > Peter
> > >
> >
> >
>

Re: [DISCUSS] AIP-12 Persist DAG into DB

Reply via email to