Re: [VOTE] AIP-24: Persisting serialized DAG in DB for webserver scalability

Kaxil Naik Wed, 23 Oct 2019 15:39:55 -0700

This vote passed (although not unanimous) and I'll mark this AIP as
accepted.


*Result*:
+1 votes: 7 (6 binding and 1 non-binding vote)
-1 votes: 2 (2 binding and 0 non-binding votes)

*+1 (binding)*:
Kaxil Naik
Ash-Berlin Taylor
Jarek Potiuk
Kamil Breguła
Fokko Driesprong
Sumit Maheshwari

*+1 (non-binding):*
Philippe Gagnon

*-1 (binding)*:
Dan Davydov
Alex Guziel



On Thu, Oct 17, 2019 at 5:29 PM Dan Davydov <ddavy...@twitter.com> wrote:

> Not sure I'm convinced, I think my core concern of maintaining two
> representations and not having a thought-out future plan of how the schema
> will evolve/how easy this migration will be still stands. It feels like we
> are trying to chase some short-term wins here. But hey it's a democracy,
> don't let my vote block you :)!
>
> On Thu, Oct 17, 2019 at 10:15 AM Kaxil Naik <kaxiln...@gmail.com> wrote:
>
>> Hi Dan, I understand your concern. Your +1 and suggestions are very
>> important for us so let me try to explain in more details
>> and see if I can convince you.
>>
>> Please check replies in-line
>>
>> I think this is the kind of core change we should make from the get-go, my
>>> feeling is that we will be paying for this decision otherwise, and the
>>> cost
>>> will surpass the immediate value. Ideally we would fully understand the
>>> long-term plan before deciding on this serialization format, but I think
>>> at
>>> least having a single serialized representation for now would be a decent
>>> compromise for me.
>>> This is still a blocker to me agreeing.
>>
>>
>> Trying to do all (Serialization for Webserver, Scheduler, etc) at once
>> would be error prone as we would need to cover all cases as well and
>> it is very difficult to test and review such a large chunk of work even
>> if we have multiple people reviewing it.
>>
>> Doing it iteratively would allow us to roll out serialisation while
>> keeping backwards compatibility. Keeping backwards compatibility for now
>> is very important atleast for Websever as the users would have to wait
>> for months or at-worse 1 more year (plus the upgrade path to 1.10* to 2.0 is
>> already looking quite huge).
>>
>> Airflow 2.0 is still going to take months and is already going to have
>> number of breaking changes that is going to make
>> updating to it cumbersome task:
>>
>> - No non-RBAC UI
>> - No Py2
>> - Import paths have (or in the process of being) changed from all/most
>> contrib objects
>> - CLI reorganization
>> - and various others:
>> https://github.com/apache/airflow/blob/master/UPDATING.md#airflow-master
>>
>> I am listing the above for 1 reason - "How long would the users have to
>> deal with Webserver scalability issue" before they upgrade to 2.0 !
>>
>> However, I agree with your argument that we should ideally understand the
>> long-term plan.
>>
>> The current serialized representation of DAG looks like the following:
>>
>> https://github.com/apache/airflow/blob/a6f37cd0da0774dff6461d3fdcb94dee9d101fda/tests/serialization/test_dag_serialization.py#L40-L93
>>
>> As you can see most of the information required by SimpleTask is already
>> present. Currently, SimpleTask is used by the Scheduler. The next phase
>> of this AIP will be to use Serialization in the Scheduler and removing
>> SimpleTask altogether.
>>
>> As you said that at least having a single serialized representation, for
>> now, would be a decent compromise for me, how about you suggest
>>  or let us know if you think we are missing anything in the serialized
>> blob example above.
>>
>> ------------------------
>>
>> As long as all of the changes are in the camp of "easy to migrate/undo in
>>> the future" then these are not blockers for me. I'm not 100% convinced
>>> that
>>> they are, but you have more context so I'll defer to your judgment.
>>
>>
>> Currently, as the Serialization applies only to the webserver, if
>> something goes wrong, changing `store_serialized_dags=False` would restore
>> no-serialization behavior.
>>
>> We are not touching Scheduler for 1.10* as it is an actual Heart of
>> Airflow and wouldn't feel comfortable in changing that core in a
>> minor/patch fix.
>>
>> ------------------------
>>
>> I'm happy with JSON, my main point about "what if we get it wrong" is that
>>> I feel like more planning work could have been put into this project vs
>>> just a couple of rough high-level target "milestones". This significantly
>>> increases the risk of the project long term in my eyes.
>>
>>
>> Getting all the things right all at once would be harder than taking it
>> piece by piece.
>> We have discovered many things in Webserver that we initially did not
>> take into account like "Operator Links".
>> Even Scheduler would have something similar which we won't know now but
>> find out later which would then require re-working and going back-and-forth.
>>
>> ------------------------
>>
>>> I think as long as we combine the serialized DAG representations and have
>>> some confidence that the data format will be easy to switch to future
>>> potential formats we would want (kind of hard to do without a more
>>> detailed
>>> future plan...), then I would happily change my vote.
>>
>>
>> For storing versioned dags, this is just a rough idea and it can change:
>>
>> We have a last_updated_time column in the Serialized Dag table. Currently
>> we just update that dag row when we process the Dag again,
>> instead, we would just store it as a different row if the DAG has changed
>> and if it already had a DagRun with previous version i.e
>> we don't need to store it every time it changes if it hasn't run in the
>> meantime.
>>
>> We can assign a unique number (version) to each row for a DAG.
>> Sorting by `last_updated_time` column who gives us how DAG updated over
>> time.
>>
>> Regards,
>> Kaxil
>>
>> On Wed, Oct 16, 2019 at 4:43 PM Dan Davydov <ddavy...@twitter.com.invalid>
>> wrote:
>>
>>> Responses inline.
>>>
>>> On Wed, Oct 16, 2019 at 6:07 AM Ash Berlin-Taylor <a...@apache.org>
>>> wrote:
>>>
>>> > Thanks for the feedback and discussion everyone - it's nice to know
>>> people
>>> > feel strongly about this. Let's make sure we build this right!
>>> >
>>> > As a reminder: the main goals of this AIP:
>>> >
>>> > - To speed up bootup/worker recycle time of the webserver
>>> >
>>> >   If you have a large number of DAGs or highly dynamic DAGs it is
>>> possible
>>> > that starting your webserver needs a Gunicorn worker timeout of 5 or 10
>>> > minutes. (It is worse at start up because each of the gunicorn workers
>>> > parses the dag itself, in exactly the same order, at the same time as
>>> the
>>> > other workers, leading to disk contention, and CPU burst.)
>>> >
>>> >   And just yesterday I helped two people diagnose a problem that was
>>> > caused by slow NFS: > time airflow list_dags` took 3:35mins to go over
>>> 43
>>> > dags.
>>> >
>>> >   At Astronomer, and Google (Cloud Composer) we've seen this slow
>>> loading
>>> > many times for our customer's installs.
>>> >
>>> >   And not to mention this makes the webserver snappier
>>> >
>>> > - To make the webserver almost "stateless" (not the best term for it:
>>> > Diskless perhaps? Dagbag-less?).
>>> >
>>> >   This makes actions on the webserver quicker. We had a number of PRs
>>> in
>>> > the 1.10 series that made views and API endpoints in the webserver
>>> quicker
>>> > By making them only load one dag from disk. This takes it further and
>>> means
>>> > that (as of the PR) only the Rendered tab needs to go to the code on
>>> disk.
>>> > As a reminder the "trigger_dag" endpoint would previously load all
>>> dags.
>>> > Other's still do. Multiple people on have complained about this Slack.
>>> >
>>> >   (The only remaining view that has to hit the code on disk is the
>>> > Rendered template tab, as right now a DAG can define custom macros or
>>> > filters, and we don't want to have to deal with serializing code. The
>>> only
>>> > "real" way of doing it is CloudPickle. No thank you)
>>> >
>>> >
>>> > To address the points .
>>> >
>>> >
>>> > On 15 Oct 2019, at 21:14, Driesprong, Fokko <fo...@driesprong.frl>
>>> wrote
>>> > > - Are we going to extend the existing data model, to allow the RDBMS
>>> to
>>> > > optimize queries on fields that we use a lot?
>>> >
>>> > Yes, as needed. Not really needed for this PR yet (it doesn't change
>>> any
>>> > query patterns in the webserver, so they are using the existing tables)
>>> >
>>> > > - How are we going to do state evolution when we extend the JSON
>>> model
>>> > >
>>> >
>>> > The top level object we're storing has a __version field (for example
>>> > `{"__version": 1, "dag": { ... } }`) so we can detect older versions
>>> and
>>> > either run a db "migration" on install, or upgrade at load time.
>>> >
>>> >
>>> >
>>> > On 15 Oct 2019, at 20:04, Dan Davydov <ddavy...@twitter.com.invalid>
>>> > wrote:
>>> > > Having both a SimpleDagBag representation and the JSON representation
>>> > > doesn't make sense to me at the moment: *"**Quoting from Airflow
>>> code, it
>>> > > is “a simplified representation of a DAG that contains all attributes
>>> > > required for instantiating and scheduling its associated tasks.”. It
>>> does
>>> > > not contain enough information required by the webserver.". *Why not
>>> > create
>>> > > a representation that can be used by both? This is going to be a big
>>> > > headache to both understand and work with in the codebase since it
>>> will
>>> > be
>>> > > another definition that we need to keep in sync.
>>> >
>>> > Honestly: because that makes the change bigger, harder to review and
>>> > harder to back port. If this is a serious blocker to you agreeing we
>>> can
>>> > update it to use the same structure as part of this PR, but I would
>>> rather
>>> > it was a separate change.
>>> >
>>> I think this is the kind of core change we should make from the get-go,
>>> my
>>> feeling is that we will be paying for this decision otherwise, and the
>>> cost
>>> will surpass the immediate value. Ideally we would fully understand the
>>> long-term plan before deciding on this serialization format, but I think
>>> at
>>> least having a single serialized representation for now would be a decent
>>> compromise for me.
>>>
>>> This is still a blocker to me agreeing.
>>>
>>>
>>> >
>>> > > Not sure if fileloc/fileloc_hash is the right solution, the longterm
>>> > > solution I am imagining has clients responsible for uploading DAGs
>>> rather
>>> > > than retrieving them from the filesystem so fileloc/fileloc_hash
>>> wouldn't
>>> > > even exist (dag_id would be used for indexing here).
>>> >
>>> > Then we can remove it later, when that code exists. For now so much of
>>> the
>>> > rest of the code base assumes DAGs from files on disk so.
>>> >
>>> >
>>> >
>>> > > Versioning isn't really addressed either (e.g. if a DAG topology
>>> changes
>>> > > with some update you want to be able to show both the old and new
>>> ones,
>>> > or
>>> > > at least have a way to deal with them), there is an argument that
>>> this is
>>> > > acceptable since it isn't currently addressed now, but I'm worried
>>> that
>>> > > large schema changes should think through the long term plan a bit
>>> more.
>>> >
>>> > Correct, and this is by design to make the change smaller and easier to
>>> > review. Adding versioning using the existing serialization format is
>>> not
>>> > that much work once the ground work is here and the webserver is not
>>> > relying on DAGs from disk. There's also a UI question of how to handle
>>> this
>>> > on the Tree view that I don't have an immediate answer for. But yes,
>>> > versioning of dag run history was in our minds when working on this PR.
>>> >
>>> > As long as all of the changes are in the camp of "easy to migrate/undo
>>> in
>>> the future" then these are not blockers for me. I'm not 100% convinced
>>> that
>>> they are, but you have more context so I'll defer to your judgment.
>>>
>>>
>>> > > I feel like getting this wrong is going to make it very hard to
>>> migrate
>>> > > things in the future, and make the codebase worse (another
>>> representation
>>> > > of DAGs that we need to support/understand/keep parity for). If I'm
>>> wrong
>>> > > about this then I would be more willing to +1 this change. This doc
>>> is a
>>> > > 1-2 pager and I feel like it is not thorough or deep enough and
>>> doesn't
>>> > > give me enough confidence that the work in the PR is going to make it
>>> > > easier to complete the future milestones instead of harder.
>>> >
>>> > "But what if we get it wrong" is not something we can act on. So long
>>> as
>>> > you are happy with the choice of JSON as the serialization then since
>>> we've
>>> > got a version in the record we can change this relatively easily in the
>>> > future - we'll know when we're dealing with an older record (we won't
>>> have
>>> > to guess) and can write a format migrator (either to upgrade in
>>> memory, or
>>> > to upgrade the record in place. And as a db "migration" or as a upgrade
>>> > command)
>>> >
>>> I'm happy with JSON, my main point about "what if we get it wrong" is
>>> that
>>> I feel like more planning work could have been put into this project vs
>>> just a couple of rough high-level target "milestones". This significantly
>>> increases the risk of the project long term in my eyes.
>>>
>>>
>>> >
>>> > We've based our serialization on Uber's Piper work <
>>> > https://eng.uber.com/managing-data-workflows-at-scale/ <
>>> > https://eng.uber.com/managing-data-workflows-at-scale/>> (we had a
>>> call
>>> > with Alex at Uber before starting any of this work) so we've piggy
>>> backed
>>> > of their hard work and months of running something very close to this
>>> > format in production.
>>>
>>>
>>> >
>>> This is great to hear!
>>>
>>>
>>> > Also let's not forget: right now we aren't storing versions, and we
>>> still
>>> > have DAGs on disk. If we do get it massively wrong (which I don't
>>> think we
>>> > have) we can just drop this entire table and start again. This is all
>>> an
>>> > internal implementation detail that is not exposed via an API. What
>>> better
>>> > way to see if this is right than by running it at scale in production
>>> > across 1000s of customers (Astronomer+Cloud Composer)
>>> >
>>> If we have thought through future milestones and made sure that we can
>>> switch to different formats that could required for them easily, then I'm
>>> not a blocker on this front.
>>>
>>>
>>> > Can you be more specific about you are worried about here? It's a bit
>>> hard
>>> > to put general "but what if we get it wrong" fears to rest.
>>> >
>>> >
>>> >
>>> > On Tue, Oct 15, 2019 at 9:10 PM Alex Guziel <alex.guz...@airbnb.com
>>> > .invalid>
>>> > > We don't need to have the future plan implemented
>>> > > completely but it would be nice to see more detailed notes about how
>>> this
>>> > > will play out in the future.
>>> >
>>> > Fair point. Our rough long-term plan:
>>> >
>>> > - Add versioning to the UI, likely by storing a new version linked from
>>> > the DagRun, (but only when it changes, not every time to avoid DB
>>> bloat -
>>> > multiple dag_run rows will point at the same serialized blob).
>>> >
>>> > - Rework the scheduler use the serialized version, not the Simple*
>>> > version. assuming we can do it and keep the scheduler performing well.
>>> > (i.e. that ser/deser time isn't significant for the scheduler loop)
>>> >
>>> >
>>> > - Massively rework how scheduler creates TaskInstances (right now this
>>> is
>>> > done in the Dag parsing process, not the main scheduler) and dag runs.
>>> We
>>> > have to keep an eye on scheduler performance as we do this. If we can
>>> pull
>>> > this back in to the main scheduler then this leads us towards the next
>>> > points:
>>> >
>>> > - Split out the updating of serialized dag/parsing of DAGs from the
>>> > scheduler loop so it can be a separate component/subprocess (which is
>>> on by
>>> > default). This gets us towards being able to submit DAG definitions
>>> via an
>>> > API if we wanted. (There's a lot to work out to get that working,
>>> mostly
>>> > around running the DAG).
>>> >
>>> > - Once the scheduler loop/logic is not tied to parsing the DAGs then
>>> this
>>> > also heads us towards being able to run multiple schedulers. (This
>>> isn't a
>>> > requirement on HA scheduler, but it makes them easier to "scale out".)
>>> >
>>> > Hope this answers people's questions, and thanks for the feedback
>>> >
>>> > Alex and Dan: Does this give enough information for you to change your
>>> > vote?
>>> >
>>> I think as long as we combine the serialized DAG representations and have
>>> some confidence that the data format will be easy to switch to future
>>> potential formats we would want (kind of hard to do without a more
>>> detailed
>>> future plan...), then I would happily change my vote.
>>>
>>>
>>> > -ash
>>> >
>>> > > On 15 Oct 2019, at 21:14, Driesprong, Fokko <fo...@driesprong.frl>
>>> > wrote:
>>> > >
>>> > > Big +1 from my side, looking forward to make this happen.
>>> > >
>>> > > Two sides that aren't completely clear to me:
>>> > >
>>> > >   - Are we going to extend the existing data model, to allow the
>>> RDBMS to
>>> > >   optimize queries on fields that we use a lot?
>>> > >   - How are we going to do state evolution when we extend the JSON
>>> model
>>> > >
>>> > > I have good confidence that we'll solve this along the way.
>>> > >
>>> > > Cheers, Fokko
>>> > >
>>> > > Op di 15 okt. 2019 om 21:29 schreef Dan Davydov
>>> > > <ddavy...@twitter.com.invalid>:
>>> > >
>>> > >> I have been following it from the beginning as well. I understand
>>> there
>>> > >> would be short-term wins for some users (I don't think a huge
>>> amount of
>>> > >> users?), but I still feel like we are being a bit short-sighted
>>> here and
>>> > >> that we are creating more work for ourselves and potentially our
>>> users
>>> > in
>>> > >> the future. I also feel like there will be side effects to users as
>>> > well,
>>> > >> many of which don't care about the webserver scalability, such as
>>> bugs
>>> > >> caused by the addition of the new webserver representation. I think
>>> > without
>>> > >> a design that is much larger in scope I wouldn't feel comfortable
>>> moving
>>> > >> forward with this AIP.
>>> > >>
>>> > >> On Tue, Oct 15, 2019 at 3:21 PM Jarek Potiuk <
>>> jarek.pot...@polidea.com>
>>> > >> wrote:
>>> > >>
>>> > >>> Hello Dan, Alex,
>>> > >>>
>>> > >>> I believe all the points you make are super-valid ones. But maybe
>>> you
>>> > are
>>> > >>> missing the full context a bit.
>>> > >>>
>>> > >>> I followed the original discussion
>>> > >>> <
>>> > >>>
>>> > >>
>>> >
>>> https://lists.apache.org/thread.html/a2d426f93c0f4e5f0347627308638b59ca4072fd022a42af1163e34a@%3Cdev.airflow.apache.org%3E
>>> > >>>>
>>> > >>> from the very beginning and took part in the initial discussions
>>> when
>>> > >> this
>>> > >>> topic was raised. From the discussion it is quite clear to me that
>>> this
>>> > >>> mostly a "tactical" approach to implement something that is
>>> > backportable
>>> > >> to
>>> > >>> 1.10 and rather quick to implement. This is targeted to make users
>>> more
>>> > >>> happy with their 1.10 version without the timing uncertainty and
>>> effort
>>> > >> of
>>> > >>> migration to 2.0. It solves the major pain point of stability of
>>> the UI
>>> > >> in
>>> > >>> case there are complex DAGs for which parsing crashes the
>>> webserver.
>>> > >>> Like in "being nice to your users".
>>> > >>>
>>> > >>> There will be a separate effort to make pretty much all of the
>>> things
>>> > you
>>> > >>> mentioned in 2.0 in a non-backportable way as it requires far too
>>> many
>>> > >>> changes in the way how Airflow works internally.
>>> > >>>
>>> > >>> Maybe it needs some more explanation + long term plan that follows
>>> in
>>> > the
>>> > >>> AIP itself to explain it to those who have not followed the initial
>>> > >>> discussion, but I think it's fully justified change.
>>> > >>>
>>> > >>> J.
>>> > >>>
>>> > >>> On Tue, Oct 15, 2019 at 9:10 PM Alex Guziel <
>>> alex.guz...@airbnb.com
>>> > >>> .invalid>
>>> > >>> wrote:
>>> > >>>
>>> > >>>> -1 (binding)
>>> > >>>> Good points made by Dan. We don't need to have the future plan
>>> > >>> implemented
>>> > >>>> completely but it would be nice to see more detailed notes about
>>> how
>>> > >> this
>>> > >>>> will play out in the future. We shouldn't walk into a system that
>>> > >> causes
>>> > >>>> more pain in the future. (I can't say for sure that it does, but I
>>> > >> can't
>>> > >>>> say that it doesn't either). I don't think the proposal is
>>> necessarily
>>> > >>>> wrong or bad, but I think we need some more detailed planning
>>> around
>>> > >>> future
>>> > >>>> milestones.
>>> > >>>>
>>> > >>>> On Tue, Oct 15, 2019 at 12:04 PM Dan Davydov
>>> > >>> <ddavy...@twitter.com.invalid
>>> > >>>>>
>>> > >>>> wrote:
>>> > >>>>
>>> > >>>>> -1 (binding), this may sound a bit FUD-y but I don't feel this
>>> has
>>> > >> been
>>> > >>>>> thought through enough...
>>> > >>>>>
>>> > >>>>> Having both a SimpleDagBag representation and the JSON
>>> representation
>>> > >>>>> doesn't make sense to me at the moment: *"**Quoting from Airflow
>>> > >> code,
>>> > >>> it
>>> > >>>>> is “a simplified representation of a DAG that contains all
>>> attributes
>>> > >>>>> required for instantiating and scheduling its associated
>>> tasks.”. It
>>> > >>> does
>>> > >>>>> not contain enough information required by the webserver.". *Why
>>> not
>>> > >>>> create
>>> > >>>>> a representation that can be used by both? This is going to be a
>>> big
>>> > >>>>> headache to both understand and work with in the codebase since
>>> it
>>> > >> will
>>> > >>>> be
>>> > >>>>> another definition that we need to keep in sync.
>>> > >>>>>
>>> > >>>>> Not sure if fileloc/fileloc_hash is the right solution, the
>>> longterm
>>> > >>>>> solution I am imagining has clients responsible for uploading
>>> DAGs
>>> > >>> rather
>>> > >>>>> than retrieving them from the filesystem so fileloc/fileloc_hash
>>> > >>> wouldn't
>>> > >>>>> even exist (dag_id would be used for indexing here).
>>> > >>>>>
>>> > >>>>> Versioning isn't really addressed either (e.g. if a DAG topology
>>> > >>> changes
>>> > >>>>> with some update you want to be able to show both the old and new
>>> > >> ones,
>>> > >>>> or
>>> > >>>>> at least have a way to deal with them), there is an argument that
>>> > >> this
>>> > >>> is
>>> > >>>>> acceptable since it isn't currently addressed now, but I'm
>>> worried
>>> > >> that
>>> > >>>>> large schema changes should think through the long term plan a
>>> bit
>>> > >>> more.
>>> > >>>>>
>>> > >>>>> I feel like getting this wrong is going to make it very hard to
>>> > >> migrate
>>> > >>>>> things in the future, and make the codebase worse (another
>>> > >>> representation
>>> > >>>>> of DAGs that we need to support/understand/keep parity for). If
>>> I'm
>>> > >>> wrong
>>> > >>>>> about this then I would be more willing to +1 this change. This
>>> doc
>>> > >> is
>>> > >>> a
>>> > >>>>> 1-2 pager and I feel like it is not thorough or deep enough and
>>> > >> doesn't
>>> > >>>>> give me enough confidence that the work in the PR is going to
>>> make it
>>> > >>>>> easier to complete the future milestones instead of harder.
>>> > >>>>>
>>> > >>>>> On Tue, Oct 15, 2019 at 11:26 AM Kamil Breguła <
>>> > >>>> kamil.breg...@polidea.com>
>>> > >>>>> wrote:
>>> > >>>>>
>>> > >>>>>> +1 (binding)
>>> > >>>>>>
>>> > >>>>>> On Tue, Oct 15, 2019 at 2:57 AM Kaxil Naik <kaxiln...@gmail.com
>>> >
>>> > >>>> wrote:
>>> > >>>>>>
>>> > >>>>>>> Hello, Airflow community,
>>> > >>>>>>>
>>> > >>>>>>> This email calls for a vote to add the DAG Serialization
>>> feature
>>> > >> at
>>> > >>>>>>> https://github.com/apache/airflow/pull/5743.
>>> > >>>>>>>
>>> > >>>>>>> *AIP*:
>>> > >>>>>>>
>>> > >>>>>>
>>> > >>>>>
>>> > >>>>
>>> > >>>
>>> > >>
>>> >
>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler
>>> > >>>>>>>
>>> > >>>>>>> *Previous Mailing List discussion*:
>>> > >>>>>>>
>>> > >>>>>>
>>> > >>>>>
>>> > >>>>
>>> > >>>
>>> > >>
>>> >
>>> https://lists.apache.org/thread.html/65d282368e0a7c19815badb8b1c6c8d72b0975ce94f601e13af44f74@%3Cdev.airflow.apache.org%3E
>>> > >>>>>>> .
>>> > >>>>>>>
>>> > >>>>>>> *Authors*: Kaxil Naik, Zhou Fang, Ash-Berlin Taylor
>>> > >>>>>>>
>>> > >>>>>>> *Summary*:
>>> > >>>>>>>
>>> > >>>>>>>   - DAGs are serialized using JSON format and stored in a
>>> > >>>>> SerializedDag
>>> > >>>>>>>   table
>>> > >>>>>>>   - The Webserver now instead of having to parse the DAG file
>>> > >>> again,
>>> > >>>>>>>   reads the serialized DAGs in JSON, de-serializes them and
>>> > >>> creates
>>> > >>>>> the
>>> > >>>>>>>   DagBag and uses it to show in the UI.
>>> > >>>>>>>   - Instead of loading an entire DagBag when the WebServer
>>> > >> starts
>>> > >>> we
>>> > >>>>>>>   only load each DAG on demand from the Serialized Dag table.
>>> > >> This
>>> > >>>>> helps
>>> > >>>>>>>   reduce Webserver startup time and memory. The reduction is
>>> > >>> notable
>>> > >>>>>> when you
>>> > >>>>>>>   have a large number of DAGs.
>>> > >>>>>>>   - A JSON Schema has been defined and we validate the
>>> > >> serialized
>>> > >>>> dag
>>> > >>>>>>>   before writing it to the database
>>> > >>>>>>>
>>> > >>>>>>> [image: image.png]
>>> > >>>>>>>
>>> > >>>>>>> A PR (https://github.com/apache/airflow/pull/5743) is ready
>>> for
>>> > >>>> review
>>> > >>>>>>> from the committers and community.
>>> > >>>>>>>
>>> > >>>>>>> We also have a WIP PR (
>>> > >> https://github.com/apache/airflow/pull/5992
>>> > >>> )
>>> > >>>> to
>>> > >>>>>>> backport this feature to 1.10.* branch.
>>> > >>>>>>>
>>> > >>>>>>> A big thank you to Zhou and Ash for their continuous help in
>>> > >>>> improving
>>> > >>>>>>> this feature/PR.
>>> > >>>>>>>
>>> > >>>>>>> This email is formally calling for a vote to accept the AIP and
>>> > >> PR.
>>> > >>>>>> Please
>>> > >>>>>>> note that we will update the PR / feature to fix bugs if we
>>> find
>>> > >>> any.
>>> > >>>>>>>
>>> > >>>>>>> Cheers,
>>> > >>>>>>> Kaxil
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>
>>> > >>>>>
>>> > >>>>
>>> > >>>
>>> > >>>
>>> > >>> --
>>> > >>>
>>> > >>> Jarek Potiuk
>>> > >>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>> > >>>
>>> > >>> M: +48 660 796 129 <+48660796129>
>>> > >>> [image: Polidea] <https://www.polidea.com/>
>>> > >>>
>>> > >>
>>> >
>>> >
>>>
>>

Re: [VOTE] AIP-24: Persisting serialized DAG in DB for webserver scalability

Reply via email to