Re: [VOTE] AIP-24: Persisting serialized DAG in DB for webserver scalability

Kevin Yang Thu, 24 Oct 2019 02:41:36 -0700

Oops, I guess I'm too slow reading the PRs to let me vote in time 😛
Congrats Zhou, Kaxil and Ash for getting it accepted 🎉


I'm pretty excited about this long-anticipated change and thank you guys
for the great work. Though I do have a few comments.

I agree with Dan's concern about maintaining two representations would
potentially create a lot pain in the future. If the scheduler rework
mentioned in Ash's rough long term plan can be a more immediate action,
doesn't matter if it is in the same PR, then I won't be so worried.

With records in this new separate table, I agree the cost of
rollback/upgrade won't be too much of a headache. We have those approaches
Ash mentioned or can even create another table and let them coexist for a
while if the change is dramastic.

The added DB load seems pretty negligible. But just to be extra safe, we'll
try to test it within our cluster, where DB load is at the boarder line,
and report back.

About the template rendering, I think we should still try to pre-render it
and persist it into the DB. Rendered is a very frequently accessed endpoint
and for large DAG files( we have DAG files taking multiple mins to parse)
it would take too long to load, block the webserver workers if we got too
many rendered endpoint requests and create quite some confusions given that
we act differently on different endpoints. In AIP-17, we pre-render the
template when creating the task instances where we have execution_date. If
XCOM is the blocker, we can maybe pre-render everything else besides the
XCOM and render the XCOM part on the fly. Since people can access the
rendered endpoint before upstream tasks finished, we'll probly have to
exclude XCOM in pre-rendering anyway. I would be very happy if this can be
a high priority follow up of the AIP.


Cheers,
Kevin Y

On Wed, Oct 23, 2019 at 3:39 PM Kaxil Naik <[email protected]> wrote:

> This vote passed (although not unanimous) and I'll mark this AIP as
> accepted.
>
> *Result*:
> +1 votes: 7 (6 binding and 1 non-binding vote)
> -1 votes: 2 (2 binding and 0 non-binding votes)
>
> *+1 (binding)*:
> Kaxil Naik
> Ash-Berlin Taylor
> Jarek Potiuk
> Kamil Breguła
> Fokko Driesprong
> Sumit Maheshwari
>
> *+1 (non-binding):*
> Philippe Gagnon
>
> *-1 (binding)*:
> Dan Davydov
> Alex Guziel
>
>
>
> On Thu, Oct 17, 2019 at 5:29 PM Dan Davydov <[email protected]> wrote:
>
> > Not sure I'm convinced, I think my core concern of maintaining two
> > representations and not having a thought-out future plan of how the
> schema
> > will evolve/how easy this migration will be still stands. It feels like
> we
> > are trying to chase some short-term wins here. But hey it's a democracy,
> > don't let my vote block you :)!
> >
> > On Thu, Oct 17, 2019 at 10:15 AM Kaxil Naik <[email protected]> wrote:
> >
> >> Hi Dan, I understand your concern. Your +1 and suggestions are very
> >> important for us so let me try to explain in more details
> >> and see if I can convince you.
> >>
> >> Please check replies in-line
> >>
> >> I think this is the kind of core change we should make from the get-go,
> my
> >>> feeling is that we will be paying for this decision otherwise, and the
> >>> cost
> >>> will surpass the immediate value. Ideally we would fully understand the
> >>> long-term plan before deciding on this serialization format, but I
> think
> >>> at
> >>> least having a single serialized representation for now would be a
> decent
> >>> compromise for me.
> >>> This is still a blocker to me agreeing.
> >>
> >>
> >> Trying to do all (Serialization for Webserver, Scheduler, etc) at once
> >> would be error prone as we would need to cover all cases as well and
> >> it is very difficult to test and review such a large chunk of work even
> >> if we have multiple people reviewing it.
> >>
> >> Doing it iteratively would allow us to roll out serialisation while
> >> keeping backwards compatibility. Keeping backwards compatibility for now
> >> is very important atleast for Websever as the users would have to wait
> >> for months or at-worse 1 more year (plus the upgrade path to 1.10* to
> 2.0 is
> >> already looking quite huge).
> >>
> >> Airflow 2.0 is still going to take months and is already going to have
> >> number of breaking changes that is going to make
> >> updating to it cumbersome task:
> >>
> >> - No non-RBAC UI
> >> - No Py2
> >> - Import paths have (or in the process of being) changed from all/most
> >> contrib objects
> >> - CLI reorganization
> >> - and various others:
> >>
> https://github.com/apache/airflow/blob/master/UPDATING.md#airflow-master
> >>
> >> I am listing the above for 1 reason - "How long would the users have to
> >> deal with Webserver scalability issue" before they upgrade to 2.0 !
> >>
> >> However, I agree with your argument that we should ideally understand
> the
> >> long-term plan.
> >>
> >> The current serialized representation of DAG looks like the following:
> >>
> >>
> https://github.com/apache/airflow/blob/a6f37cd0da0774dff6461d3fdcb94dee9d101fda/tests/serialization/test_dag_serialization.py#L40-L93
> >>
> >> As you can see most of the information required by SimpleTask is already
> >> present. Currently, SimpleTask is used by the Scheduler. The next phase
> >> of this AIP will be to use Serialization in the Scheduler and removing
> >> SimpleTask altogether.
> >>
> >> As you said that at least having a single serialized representation, for
> >> now, would be a decent compromise for me, how about you suggest
> >>  or let us know if you think we are missing anything in the serialized
> >> blob example above.
> >>
> >> ------------------------
> >>
> >> As long as all of the changes are in the camp of "easy to migrate/undo
> in
> >>> the future" then these are not blockers for me. I'm not 100% convinced
> >>> that
> >>> they are, but you have more context so I'll defer to your judgment.
> >>
> >>
> >> Currently, as the Serialization applies only to the webserver, if
> >> something goes wrong, changing `store_serialized_dags=False` would
> restore
> >> no-serialization behavior.
> >>
> >> We are not touching Scheduler for 1.10* as it is an actual Heart of
> >> Airflow and wouldn't feel comfortable in changing that core in a
> >> minor/patch fix.
> >>
> >> ------------------------
> >>
> >> I'm happy with JSON, my main point about "what if we get it wrong" is
> that
> >>> I feel like more planning work could have been put into this project vs
> >>> just a couple of rough high-level target "milestones". This
> significantly
> >>> increases the risk of the project long term in my eyes.
> >>
> >>
> >> Getting all the things right all at once would be harder than taking it
> >> piece by piece.
> >> We have discovered many things in Webserver that we initially did not
> >> take into account like "Operator Links".
> >> Even Scheduler would have something similar which we won't know now but
> >> find out later which would then require re-working and going
> back-and-forth.
> >>
> >> ------------------------
> >>
> >>> I think as long as we combine the serialized DAG representations and
> have
> >>> some confidence that the data format will be easy to switch to future
> >>> potential formats we would want (kind of hard to do without a more
> >>> detailed
> >>> future plan...), then I would happily change my vote.
> >>
> >>
> >> For storing versioned dags, this is just a rough idea and it can change:
> >>
> >> We have a last_updated_time column in the Serialized Dag table.
> Currently
> >> we just update that dag row when we process the Dag again,
> >> instead, we would just store it as a different row if the DAG has
> changed
> >> and if it already had a DagRun with previous version i.e
> >> we don't need to store it every time it changes if it hasn't run in the
> >> meantime.
> >>
> >> We can assign a unique number (version) to each row for a DAG.
> >> Sorting by `last_updated_time` column who gives us how DAG updated over
> >> time.
> >>
> >> Regards,
> >> Kaxil
> >>
> >> On Wed, Oct 16, 2019 at 4:43 PM Dan Davydov
> <[email protected]>
> >> wrote:
> >>
> >>> Responses inline.
> >>>
> >>> On Wed, Oct 16, 2019 at 6:07 AM Ash Berlin-Taylor <[email protected]>
> >>> wrote:
> >>>
> >>> > Thanks for the feedback and discussion everyone - it's nice to know
> >>> people
> >>> > feel strongly about this. Let's make sure we build this right!
> >>> >
> >>> > As a reminder: the main goals of this AIP:
> >>> >
> >>> > - To speed up bootup/worker recycle time of the webserver
> >>> >
> >>> >   If you have a large number of DAGs or highly dynamic DAGs it is
> >>> possible
> >>> > that starting your webserver needs a Gunicorn worker timeout of 5 or
> 10
> >>> > minutes. (It is worse at start up because each of the gunicorn
> workers
> >>> > parses the dag itself, in exactly the same order, at the same time as
> >>> the
> >>> > other workers, leading to disk contention, and CPU burst.)
> >>> >
> >>> >   And just yesterday I helped two people diagnose a problem that was
> >>> > caused by slow NFS: > time airflow list_dags` took 3:35mins to go
> over
> >>> 43
> >>> > dags.
> >>> >
> >>> >   At Astronomer, and Google (Cloud Composer) we've seen this slow
> >>> loading
> >>> > many times for our customer's installs.
> >>> >
> >>> >   And not to mention this makes the webserver snappier
> >>> >
> >>> > - To make the webserver almost "stateless" (not the best term for it:
> >>> > Diskless perhaps? Dagbag-less?).
> >>> >
> >>> >   This makes actions on the webserver quicker. We had a number of PRs
> >>> in
> >>> > the 1.10 series that made views and API endpoints in the webserver
> >>> quicker
> >>> > By making them only load one dag from disk. This takes it further and
> >>> means
> >>> > that (as of the PR) only the Rendered tab needs to go to the code on
> >>> disk.
> >>> > As a reminder the "trigger_dag" endpoint would previously load all
> >>> dags.
> >>> > Other's still do. Multiple people on have complained about this
> Slack.
> >>> >
> >>> >   (The only remaining view that has to hit the code on disk is the
> >>> > Rendered template tab, as right now a DAG can define custom macros or
> >>> > filters, and we don't want to have to deal with serializing code. The
> >>> only
> >>> > "real" way of doing it is CloudPickle. No thank you)
> >>> >
> >>> >
> >>> > To address the points .
> >>> >
> >>> >
> >>> > On 15 Oct 2019, at 21:14, Driesprong, Fokko <[email protected]>
> >>> wrote
> >>> > > - Are we going to extend the existing data model, to allow the
> RDBMS
> >>> to
> >>> > > optimize queries on fields that we use a lot?
> >>> >
> >>> > Yes, as needed. Not really needed for this PR yet (it doesn't change
> >>> any
> >>> > query patterns in the webserver, so they are using the existing
> tables)
> >>> >
> >>> > > - How are we going to do state evolution when we extend the JSON
> >>> model
> >>> > >
> >>> >
> >>> > The top level object we're storing has a __version field (for example
> >>> > `{"__version": 1, "dag": { ... } }`) so we can detect older versions
> >>> and
> >>> > either run a db "migration" on install, or upgrade at load time.
> >>> >
> >>> >
> >>> >
> >>> > On 15 Oct 2019, at 20:04, Dan Davydov <[email protected]>
> >>> > wrote:
> >>> > > Having both a SimpleDagBag representation and the JSON
> representation
> >>> > > doesn't make sense to me at the moment: *"**Quoting from Airflow
> >>> code, it
> >>> > > is “a simplified representation of a DAG that contains all
> attributes
> >>> > > required for instantiating and scheduling its associated tasks.”.
> It
> >>> does
> >>> > > not contain enough information required by the webserver.". *Why
> not
> >>> > create
> >>> > > a representation that can be used by both? This is going to be a
> big
> >>> > > headache to both understand and work with in the codebase since it
> >>> will
> >>> > be
> >>> > > another definition that we need to keep in sync.
> >>> >
> >>> > Honestly: because that makes the change bigger, harder to review and
> >>> > harder to back port. If this is a serious blocker to you agreeing we
> >>> can
> >>> > update it to use the same structure as part of this PR, but I would
> >>> rather
> >>> > it was a separate change.
> >>> >
> >>> I think this is the kind of core change we should make from the get-go,
> >>> my
> >>> feeling is that we will be paying for this decision otherwise, and the
> >>> cost
> >>> will surpass the immediate value. Ideally we would fully understand the
> >>> long-term plan before deciding on this serialization format, but I
> think
> >>> at
> >>> least having a single serialized representation for now would be a
> decent
> >>> compromise for me.
> >>>
> >>> This is still a blocker to me agreeing.
> >>>
> >>>
> >>> >
> >>> > > Not sure if fileloc/fileloc_hash is the right solution, the
> longterm
> >>> > > solution I am imagining has clients responsible for uploading DAGs
> >>> rather
> >>> > > than retrieving them from the filesystem so fileloc/fileloc_hash
> >>> wouldn't
> >>> > > even exist (dag_id would be used for indexing here).
> >>> >
> >>> > Then we can remove it later, when that code exists. For now so much
> of
> >>> the
> >>> > rest of the code base assumes DAGs from files on disk so.
> >>> >
> >>> >
> >>> >
> >>> > > Versioning isn't really addressed either (e.g. if a DAG topology
> >>> changes
> >>> > > with some update you want to be able to show both the old and new
> >>> ones,
> >>> > or
> >>> > > at least have a way to deal with them), there is an argument that
> >>> this is
> >>> > > acceptable since it isn't currently addressed now, but I'm worried
> >>> that
> >>> > > large schema changes should think through the long term plan a bit
> >>> more.
> >>> >
> >>> > Correct, and this is by design to make the change smaller and easier
> to
> >>> > review. Adding versioning using the existing serialization format is
> >>> not
> >>> > that much work once the ground work is here and the webserver is not
> >>> > relying on DAGs from disk. There's also a UI question of how to
> handle
> >>> this
> >>> > on the Tree view that I don't have an immediate answer for. But yes,
> >>> > versioning of dag run history was in our minds when working on this
> PR.
> >>> >
> >>> > As long as all of the changes are in the camp of "easy to
> migrate/undo
> >>> in
> >>> the future" then these are not blockers for me. I'm not 100% convinced
> >>> that
> >>> they are, but you have more context so I'll defer to your judgment.
> >>>
> >>>
> >>> > > I feel like getting this wrong is going to make it very hard to
> >>> migrate
> >>> > > things in the future, and make the codebase worse (another
> >>> representation
> >>> > > of DAGs that we need to support/understand/keep parity for). If I'm
> >>> wrong
> >>> > > about this then I would be more willing to +1 this change. This doc
> >>> is a
> >>> > > 1-2 pager and I feel like it is not thorough or deep enough and
> >>> doesn't
> >>> > > give me enough confidence that the work in the PR is going to make
> it
> >>> > > easier to complete the future milestones instead of harder.
> >>> >
> >>> > "But what if we get it wrong" is not something we can act on. So long
> >>> as
> >>> > you are happy with the choice of JSON as the serialization then since
> >>> we've
> >>> > got a version in the record we can change this relatively easily in
> the
> >>> > future - we'll know when we're dealing with an older record (we won't
> >>> have
> >>> > to guess) and can write a format migrator (either to upgrade in
> >>> memory, or
> >>> > to upgrade the record in place. And as a db "migration" or as a
> upgrade
> >>> > command)
> >>> >
> >>> I'm happy with JSON, my main point about "what if we get it wrong" is
> >>> that
> >>> I feel like more planning work could have been put into this project vs
> >>> just a couple of rough high-level target "milestones". This
> significantly
> >>> increases the risk of the project long term in my eyes.
> >>>
> >>>
> >>> >
> >>> > We've based our serialization on Uber's Piper work <
> >>> > https://eng.uber.com/managing-data-workflows-at-scale/ <
> >>> > https://eng.uber.com/managing-data-workflows-at-scale/>> (we had a
> >>> call
> >>> > with Alex at Uber before starting any of this work) so we've piggy
> >>> backed
> >>> > of their hard work and months of running something very close to this
> >>> > format in production.
> >>>
> >>>
> >>> >
> >>> This is great to hear!
> >>>
> >>>
> >>> > Also let's not forget: right now we aren't storing versions, and we
> >>> still
> >>> > have DAGs on disk. If we do get it massively wrong (which I don't
> >>> think we
> >>> > have) we can just drop this entire table and start again. This is all
> >>> an
> >>> > internal implementation detail that is not exposed via an API. What
> >>> better
> >>> > way to see if this is right than by running it at scale in production
> >>> > across 1000s of customers (Astronomer+Cloud Composer)
> >>> >
> >>> If we have thought through future milestones and made sure that we can
> >>> switch to different formats that could required for them easily, then
> I'm
> >>> not a blocker on this front.
> >>>
> >>>
> >>> > Can you be more specific about you are worried about here? It's a bit
> >>> hard
> >>> > to put general "but what if we get it wrong" fears to rest.
> >>> >
> >>> >
> >>> >
> >>> > On Tue, Oct 15, 2019 at 9:10 PM Alex Guziel <[email protected]
> >>> > .invalid>
> >>> > > We don't need to have the future plan implemented
> >>> > > completely but it would be nice to see more detailed notes about
> how
> >>> this
> >>> > > will play out in the future.
> >>> >
> >>> > Fair point. Our rough long-term plan:
> >>> >
> >>> > - Add versioning to the UI, likely by storing a new version linked
> from
> >>> > the DagRun, (but only when it changes, not every time to avoid DB
> >>> bloat -
> >>> > multiple dag_run rows will point at the same serialized blob).
> >>> >
> >>> > - Rework the scheduler use the serialized version, not the Simple*
> >>> > version. assuming we can do it and keep the scheduler performing
> well.
> >>> > (i.e. that ser/deser time isn't significant for the scheduler loop)
> >>> >
> >>> >
> >>> > - Massively rework how scheduler creates TaskInstances (right now
> this
> >>> is
> >>> > done in the Dag parsing process, not the main scheduler) and dag
> runs.
> >>> We
> >>> > have to keep an eye on scheduler performance as we do this. If we can
> >>> pull
> >>> > this back in to the main scheduler then this leads us towards the
> next
> >>> > points:
> >>> >
> >>> > - Split out the updating of serialized dag/parsing of DAGs from the
> >>> > scheduler loop so it can be a separate component/subprocess (which is
> >>> on by
> >>> > default). This gets us towards being able to submit DAG definitions
> >>> via an
> >>> > API if we wanted. (There's a lot to work out to get that working,
> >>> mostly
> >>> > around running the DAG).
> >>> >
> >>> > - Once the scheduler loop/logic is not tied to parsing the DAGs then
> >>> this
> >>> > also heads us towards being able to run multiple schedulers. (This
> >>> isn't a
> >>> > requirement on HA scheduler, but it makes them easier to "scale
> out".)
> >>> >
> >>> > Hope this answers people's questions, and thanks for the feedback
> >>> >
> >>> > Alex and Dan: Does this give enough information for you to change
> your
> >>> > vote?
> >>> >
> >>> I think as long as we combine the serialized DAG representations and
> have
> >>> some confidence that the data format will be easy to switch to future
> >>> potential formats we would want (kind of hard to do without a more
> >>> detailed
> >>> future plan...), then I would happily change my vote.
> >>>
> >>>
> >>> > -ash
> >>> >
> >>> > > On 15 Oct 2019, at 21:14, Driesprong, Fokko <[email protected]>
> >>> > wrote:
> >>> > >
> >>> > > Big +1 from my side, looking forward to make this happen.
> >>> > >
> >>> > > Two sides that aren't completely clear to me:
> >>> > >
> >>> > >   - Are we going to extend the existing data model, to allow the
> >>> RDBMS to
> >>> > >   optimize queries on fields that we use a lot?
> >>> > >   - How are we going to do state evolution when we extend the JSON
> >>> model
> >>> > >
> >>> > > I have good confidence that we'll solve this along the way.
> >>> > >
> >>> > > Cheers, Fokko
> >>> > >
> >>> > > Op di 15 okt. 2019 om 21:29 schreef Dan Davydov
> >>> > > <[email protected]>:
> >>> > >
> >>> > >> I have been following it from the beginning as well. I understand
> >>> there
> >>> > >> would be short-term wins for some users (I don't think a huge
> >>> amount of
> >>> > >> users?), but I still feel like we are being a bit short-sighted
> >>> here and
> >>> > >> that we are creating more work for ourselves and potentially our
> >>> users
> >>> > in
> >>> > >> the future. I also feel like there will be side effects to users
> as
> >>> > well,
> >>> > >> many of which don't care about the webserver scalability, such as
> >>> bugs
> >>> > >> caused by the addition of the new webserver representation. I
> think
> >>> > without
> >>> > >> a design that is much larger in scope I wouldn't feel comfortable
> >>> moving
> >>> > >> forward with this AIP.
> >>> > >>
> >>> > >> On Tue, Oct 15, 2019 at 3:21 PM Jarek Potiuk <
> >>> [email protected]>
> >>> > >> wrote:
> >>> > >>
> >>> > >>> Hello Dan, Alex,
> >>> > >>>
> >>> > >>> I believe all the points you make are super-valid ones. But maybe
> >>> you
> >>> > are
> >>> > >>> missing the full context a bit.
> >>> > >>>
> >>> > >>> I followed the original discussion
> >>> > >>> <
> >>> > >>>
> >>> > >>
> >>> >
> >>>
> https://lists.apache.org/thread.html/a2d426f93c0f4e5f0347627308638b59ca4072fd022a42af1163e34a@%3Cdev.airflow.apache.org%3E
> >>> > >>>>
> >>> > >>> from the very beginning and took part in the initial discussions
> >>> when
> >>> > >> this
> >>> > >>> topic was raised. From the discussion it is quite clear to me
> that
> >>> this
> >>> > >>> mostly a "tactical" approach to implement something that is
> >>> > backportable
> >>> > >> to
> >>> > >>> 1.10 and rather quick to implement. This is targeted to make
> users
> >>> more
> >>> > >>> happy with their 1.10 version without the timing uncertainty and
> >>> effort
> >>> > >> of
> >>> > >>> migration to 2.0. It solves the major pain point of stability of
> >>> the UI
> >>> > >> in
> >>> > >>> case there are complex DAGs for which parsing crashes the
> >>> webserver.
> >>> > >>> Like in "being nice to your users".
> >>> > >>>
> >>> > >>> There will be a separate effort to make pretty much all of the
> >>> things
> >>> > you
> >>> > >>> mentioned in 2.0 in a non-backportable way as it requires far too
> >>> many
> >>> > >>> changes in the way how Airflow works internally.
> >>> > >>>
> >>> > >>> Maybe it needs some more explanation + long term plan that
> follows
> >>> in
> >>> > the
> >>> > >>> AIP itself to explain it to those who have not followed the
> initial
> >>> > >>> discussion, but I think it's fully justified change.
> >>> > >>>
> >>> > >>> J.
> >>> > >>>
> >>> > >>> On Tue, Oct 15, 2019 at 9:10 PM Alex Guziel <
> >>> [email protected]
> >>> > >>> .invalid>
> >>> > >>> wrote:
> >>> > >>>
> >>> > >>>> -1 (binding)
> >>> > >>>> Good points made by Dan. We don't need to have the future plan
> >>> > >>> implemented
> >>> > >>>> completely but it would be nice to see more detailed notes about
> >>> how
> >>> > >> this
> >>> > >>>> will play out in the future. We shouldn't walk into a system
> that
> >>> > >> causes
> >>> > >>>> more pain in the future. (I can't say for sure that it does,
> but I
> >>> > >> can't
> >>> > >>>> say that it doesn't either). I don't think the proposal is
> >>> necessarily
> >>> > >>>> wrong or bad, but I think we need some more detailed planning
> >>> around
> >>> > >>> future
> >>> > >>>> milestones.
> >>> > >>>>
> >>> > >>>> On Tue, Oct 15, 2019 at 12:04 PM Dan Davydov
> >>> > >>> <[email protected]
> >>> > >>>>>
> >>> > >>>> wrote:
> >>> > >>>>
> >>> > >>>>> -1 (binding), this may sound a bit FUD-y but I don't feel this
> >>> has
> >>> > >> been
> >>> > >>>>> thought through enough...
> >>> > >>>>>
> >>> > >>>>> Having both a SimpleDagBag representation and the JSON
> >>> representation
> >>> > >>>>> doesn't make sense to me at the moment: *"**Quoting from
> Airflow
> >>> > >> code,
> >>> > >>> it
> >>> > >>>>> is “a simplified representation of a DAG that contains all
> >>> attributes
> >>> > >>>>> required for instantiating and scheduling its associated
> >>> tasks.”. It
> >>> > >>> does
> >>> > >>>>> not contain enough information required by the webserver.".
> *Why
> >>> not
> >>> > >>>> create
> >>> > >>>>> a representation that can be used by both? This is going to be
> a
> >>> big
> >>> > >>>>> headache to both understand and work with in the codebase since
> >>> it
> >>> > >> will
> >>> > >>>> be
> >>> > >>>>> another definition that we need to keep in sync.
> >>> > >>>>>
> >>> > >>>>> Not sure if fileloc/fileloc_hash is the right solution, the
> >>> longterm
> >>> > >>>>> solution I am imagining has clients responsible for uploading
> >>> DAGs
> >>> > >>> rather
> >>> > >>>>> than retrieving them from the filesystem so
> fileloc/fileloc_hash
> >>> > >>> wouldn't
> >>> > >>>>> even exist (dag_id would be used for indexing here).
> >>> > >>>>>
> >>> > >>>>> Versioning isn't really addressed either (e.g. if a DAG
> topology
> >>> > >>> changes
> >>> > >>>>> with some update you want to be able to show both the old and
> new
> >>> > >> ones,
> >>> > >>>> or
> >>> > >>>>> at least have a way to deal with them), there is an argument
> that
> >>> > >> this
> >>> > >>> is
> >>> > >>>>> acceptable since it isn't currently addressed now, but I'm
> >>> worried
> >>> > >> that
> >>> > >>>>> large schema changes should think through the long term plan a
> >>> bit
> >>> > >>> more.
> >>> > >>>>>
> >>> > >>>>> I feel like getting this wrong is going to make it very hard to
> >>> > >> migrate
> >>> > >>>>> things in the future, and make the codebase worse (another
> >>> > >>> representation
> >>> > >>>>> of DAGs that we need to support/understand/keep parity for). If
> >>> I'm
> >>> > >>> wrong
> >>> > >>>>> about this then I would be more willing to +1 this change. This
> >>> doc
> >>> > >> is
> >>> > >>> a
> >>> > >>>>> 1-2 pager and I feel like it is not thorough or deep enough and
> >>> > >> doesn't
> >>> > >>>>> give me enough confidence that the work in the PR is going to
> >>> make it
> >>> > >>>>> easier to complete the future milestones instead of harder.
> >>> > >>>>>
> >>> > >>>>> On Tue, Oct 15, 2019 at 11:26 AM Kamil Breguła <
> >>> > >>>> [email protected]>
> >>> > >>>>> wrote:
> >>> > >>>>>
> >>> > >>>>>> +1 (binding)
> >>> > >>>>>>
> >>> > >>>>>> On Tue, Oct 15, 2019 at 2:57 AM Kaxil Naik <
> [email protected]
> >>> >
> >>> > >>>> wrote:
> >>> > >>>>>>
> >>> > >>>>>>> Hello, Airflow community,
> >>> > >>>>>>>
> >>> > >>>>>>> This email calls for a vote to add the DAG Serialization
> >>> feature
> >>> > >> at
> >>> > >>>>>>> https://github.com/apache/airflow/pull/5743.
> >>> > >>>>>>>
> >>> > >>>>>>> *AIP*:
> >>> > >>>>>>>
> >>> > >>>>>>
> >>> > >>>>>
> >>> > >>>>
> >>> > >>>
> >>> > >>
> >>> >
> >>>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler
> >>> > >>>>>>>
> >>> > >>>>>>> *Previous Mailing List discussion*:
> >>> > >>>>>>>
> >>> > >>>>>>
> >>> > >>>>>
> >>> > >>>>
> >>> > >>>
> >>> > >>
> >>> >
> >>>
> https://lists.apache.org/thread.html/65d282368e0a7c19815badb8b1c6c8d72b0975ce94f601e13af44f74@%3Cdev.airflow.apache.org%3E
> >>> > >>>>>>> .
> >>> > >>>>>>>
> >>> > >>>>>>> *Authors*: Kaxil Naik, Zhou Fang, Ash-Berlin Taylor
> >>> > >>>>>>>
> >>> > >>>>>>> *Summary*:
> >>> > >>>>>>>
> >>> > >>>>>>>   - DAGs are serialized using JSON format and stored in a
> >>> > >>>>> SerializedDag
> >>> > >>>>>>>   table
> >>> > >>>>>>>   - The Webserver now instead of having to parse the DAG file
> >>> > >>> again,
> >>> > >>>>>>>   reads the serialized DAGs in JSON, de-serializes them and
> >>> > >>> creates
> >>> > >>>>> the
> >>> > >>>>>>>   DagBag and uses it to show in the UI.
> >>> > >>>>>>>   - Instead of loading an entire DagBag when the WebServer
> >>> > >> starts
> >>> > >>> we
> >>> > >>>>>>>   only load each DAG on demand from the Serialized Dag table.
> >>> > >> This
> >>> > >>>>> helps
> >>> > >>>>>>>   reduce Webserver startup time and memory. The reduction is
> >>> > >>> notable
> >>> > >>>>>> when you
> >>> > >>>>>>>   have a large number of DAGs.
> >>> > >>>>>>>   - A JSON Schema has been defined and we validate the
> >>> > >> serialized
> >>> > >>>> dag
> >>> > >>>>>>>   before writing it to the database
> >>> > >>>>>>>
> >>> > >>>>>>> [image: image.png]
> >>> > >>>>>>>
> >>> > >>>>>>> A PR (https://github.com/apache/airflow/pull/5743) is ready
> >>> for
> >>> > >>>> review
> >>> > >>>>>>> from the committers and community.
> >>> > >>>>>>>
> >>> > >>>>>>> We also have a WIP PR (
> >>> > >> https://github.com/apache/airflow/pull/5992
> >>> > >>> )
> >>> > >>>> to
> >>> > >>>>>>> backport this feature to 1.10.* branch.
> >>> > >>>>>>>
> >>> > >>>>>>> A big thank you to Zhou and Ash for their continuous help in
> >>> > >>>> improving
> >>> > >>>>>>> this feature/PR.
> >>> > >>>>>>>
> >>> > >>>>>>> This email is formally calling for a vote to accept the AIP
> and
> >>> > >> PR.
> >>> > >>>>>> Please
> >>> > >>>>>>> note that we will update the PR / feature to fix bugs if we
> >>> find
> >>> > >>> any.
> >>> > >>>>>>>
> >>> > >>>>>>> Cheers,
> >>> > >>>>>>> Kaxil
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>
> >>> > >>>>>
> >>> > >>>>
> >>> > >>>
> >>> > >>>
> >>> > >>> --
> >>> > >>>
> >>> > >>> Jarek Potiuk
> >>> > >>> Polidea <https://www.polidea.com/> | Principal Software Engineer
> >>> > >>>
> >>> > >>> M: +48 660 796 129 <+48660796129>
> >>> > >>> [image: Polidea] <https://www.polidea.com/>
> >>> > >>>
> >>> > >>
> >>> >
> >>> >
> >>>
> >>
>

Re: [VOTE] AIP-24: Persisting serialized DAG in DB for webserver scalability

Reply via email to