Re: [VOTE] AIP-24: Persisting serialized DAG in DB for webserver scalability

Kaxil Naik Thu, 24 Oct 2019 02:38:30 -0700

Hi guys,


I have addressed all comments on the PR:
https://github.com/apache/airflow/pull/5743

Can we merge this PR please if everything looks good and approved by the
committers? It is becoming increasingly difficult to rebase on master and
resolve conflicts.

I also have the backport PR (https://github.com/apache/airflow/pull/5992)
ready and that too is becoming nightmare to maintain.

Regards,
Kaxil

On Wed, Oct 23, 2019, 23:39 Kaxil Naik <kaxiln...@gmail.com> wrote:

> This vote passed (although not unanimous) and I'll mark this AIP as
> accepted.
>
> *Result*:
> +1 votes: 7 (6 binding and 1 non-binding vote)
> -1 votes: 2 (2 binding and 0 non-binding votes)
>
> *+1 (binding)*:
> Kaxil Naik
> Ash-Berlin Taylor
> Jarek Potiuk
> Kamil Breguła
> Fokko Driesprong
> Sumit Maheshwari
>
> *+1 (non-binding):*
> Philippe Gagnon
>
> *-1 (binding)*:
> Dan Davydov
> Alex Guziel
>
>
>
> On Thu, Oct 17, 2019 at 5:29 PM Dan Davydov <ddavy...@twitter.com> wrote:
>
>> Not sure I'm convinced, I think my core concern of maintaining two
>> representations and not having a thought-out future plan of how the schema
>> will evolve/how easy this migration will be still stands. It feels like we
>> are trying to chase some short-term wins here. But hey it's a democracy,
>> don't let my vote block you :)!
>>
>> On Thu, Oct 17, 2019 at 10:15 AM Kaxil Naik <kaxiln...@gmail.com> wrote:
>>
>>> Hi Dan, I understand your concern. Your +1 and suggestions are very
>>> important for us so let me try to explain in more details
>>> and see if I can convince you.
>>>
>>> Please check replies in-line
>>>
>>> I think this is the kind of core change we should make from the get-go,
>>>> my
>>>> feeling is that we will be paying for this decision otherwise, and the
>>>> cost
>>>> will surpass the immediate value. Ideally we would fully understand the
>>>> long-term plan before deciding on this serialization format, but I
>>>> think at
>>>> least having a single serialized representation for now would be a
>>>> decent
>>>> compromise for me.
>>>> This is still a blocker to me agreeing.
>>>
>>>
>>> Trying to do all (Serialization for Webserver, Scheduler, etc) at once
>>> would be error prone as we would need to cover all cases as well and
>>> it is very difficult to test and review such a large chunk of work even
>>> if we have multiple people reviewing it.
>>>
>>> Doing it iteratively would allow us to roll out serialisation while
>>> keeping backwards compatibility. Keeping backwards compatibility for now
>>> is very important atleast for Websever as the users would have to wait
>>> for months or at-worse 1 more year (plus the upgrade path to 1.10* to 2.0 is
>>> already looking quite huge).
>>>
>>> Airflow 2.0 is still going to take months and is already going to have
>>> number of breaking changes that is going to make
>>> updating to it cumbersome task:
>>>
>>> - No non-RBAC UI
>>> - No Py2
>>> - Import paths have (or in the process of being) changed from all/most
>>> contrib objects
>>> - CLI reorganization
>>> - and various others:
>>> https://github.com/apache/airflow/blob/master/UPDATING.md#airflow-master
>>>
>>> I am listing the above for 1 reason - "How long would the users have to
>>> deal with Webserver scalability issue" before they upgrade to 2.0 !
>>>
>>> However, I agree with your argument that we should ideally understand
>>> the long-term plan.
>>>
>>> The current serialized representation of DAG looks like the following:
>>>
>>> https://github.com/apache/airflow/blob/a6f37cd0da0774dff6461d3fdcb94dee9d101fda/tests/serialization/test_dag_serialization.py#L40-L93
>>>
>>> As you can see most of the information required by SimpleTask is already
>>> present. Currently, SimpleTask is used by the Scheduler. The next phase
>>> of this AIP will be to use Serialization in the Scheduler and removing
>>> SimpleTask altogether.
>>>
>>> As you said that at least having a single serialized representation, for
>>> now, would be a decent compromise for me, how about you suggest
>>>  or let us know if you think we are missing anything in the serialized
>>> blob example above.
>>>
>>> ------------------------
>>>
>>> As long as all of the changes are in the camp of "easy to migrate/undo in
>>>> the future" then these are not blockers for me. I'm not 100% convinced
>>>> that
>>>> they are, but you have more context so I'll defer to your judgment.
>>>
>>>
>>> Currently, as the Serialization applies only to the webserver, if
>>> something goes wrong, changing `store_serialized_dags=False` would restore
>>> no-serialization behavior.
>>>
>>> We are not touching Scheduler for 1.10* as it is an actual Heart of
>>> Airflow and wouldn't feel comfortable in changing that core in a
>>> minor/patch fix.
>>>
>>> ------------------------
>>>
>>> I'm happy with JSON, my main point about "what if we get it wrong" is
>>>> that
>>>> I feel like more planning work could have been put into this project vs
>>>> just a couple of rough high-level target "milestones". This
>>>> significantly
>>>> increases the risk of the project long term in my eyes.
>>>
>>>
>>> Getting all the things right all at once would be harder than taking it
>>> piece by piece.
>>> We have discovered many things in Webserver that we initially did not
>>> take into account like "Operator Links".
>>> Even Scheduler would have something similar which we won't know now but
>>> find out later which would then require re-working and going back-and-forth.
>>>
>>> ------------------------
>>>
>>>> I think as long as we combine the serialized DAG representations and
>>>> have
>>>> some confidence that the data format will be easy to switch to future
>>>> potential formats we would want (kind of hard to do without a more
>>>> detailed
>>>> future plan...), then I would happily change my vote.
>>>
>>>
>>> For storing versioned dags, this is just a rough idea and it can change:
>>>
>>> We have a last_updated_time column in the Serialized Dag table.
>>> Currently we just update that dag row when we process the Dag again,
>>> instead, we would just store it as a different row if the DAG has
>>> changed and if it already had a DagRun with previous version i.e
>>> we don't need to store it every time it changes if it hasn't run in the
>>> meantime.
>>>
>>> We can assign a unique number (version) to each row for a DAG.
>>> Sorting by `last_updated_time` column who gives us how DAG updated over
>>> time.
>>>
>>> Regards,
>>> Kaxil
>>>
>>> On Wed, Oct 16, 2019 at 4:43 PM Dan Davydov <ddavy...@twitter.com.invalid>
>>> wrote:
>>>
>>>> Responses inline.
>>>>
>>>> On Wed, Oct 16, 2019 at 6:07 AM Ash Berlin-Taylor <a...@apache.org>
>>>> wrote:
>>>>
>>>> > Thanks for the feedback and discussion everyone - it's nice to know
>>>> people
>>>> > feel strongly about this. Let's make sure we build this right!
>>>> >
>>>> > As a reminder: the main goals of this AIP:
>>>> >
>>>> > - To speed up bootup/worker recycle time of the webserver
>>>> >
>>>> >   If you have a large number of DAGs or highly dynamic DAGs it is
>>>> possible
>>>> > that starting your webserver needs a Gunicorn worker timeout of 5 or
>>>> 10
>>>> > minutes. (It is worse at start up because each of the gunicorn workers
>>>> > parses the dag itself, in exactly the same order, at the same time as
>>>> the
>>>> > other workers, leading to disk contention, and CPU burst.)
>>>> >
>>>> >   And just yesterday I helped two people diagnose a problem that was
>>>> > caused by slow NFS: > time airflow list_dags` took 3:35mins to go
>>>> over 43
>>>> > dags.
>>>> >
>>>> >   At Astronomer, and Google (Cloud Composer) we've seen this slow
>>>> loading
>>>> > many times for our customer's installs.
>>>> >
>>>> >   And not to mention this makes the webserver snappier
>>>> >
>>>> > - To make the webserver almost "stateless" (not the best term for it:
>>>> > Diskless perhaps? Dagbag-less?).
>>>> >
>>>> >   This makes actions on the webserver quicker. We had a number of PRs
>>>> in
>>>> > the 1.10 series that made views and API endpoints in the webserver
>>>> quicker
>>>> > By making them only load one dag from disk. This takes it further and
>>>> means
>>>> > that (as of the PR) only the Rendered tab needs to go to the code on
>>>> disk.
>>>> > As a reminder the "trigger_dag" endpoint would previously load all
>>>> dags.
>>>> > Other's still do. Multiple people on have complained about this Slack.
>>>> >
>>>> >   (The only remaining view that has to hit the code on disk is the
>>>> > Rendered template tab, as right now a DAG can define custom macros or
>>>> > filters, and we don't want to have to deal with serializing code. The
>>>> only
>>>> > "real" way of doing it is CloudPickle. No thank you)
>>>> >
>>>> >
>>>> > To address the points .
>>>> >
>>>> >
>>>> > On 15 Oct 2019, at 21:14, Driesprong, Fokko <fo...@driesprong.frl>
>>>> wrote
>>>> > > - Are we going to extend the existing data model, to allow the
>>>> RDBMS to
>>>> > > optimize queries on fields that we use a lot?
>>>> >
>>>> > Yes, as needed. Not really needed for this PR yet (it doesn't change
>>>> any
>>>> > query patterns in the webserver, so they are using the existing
>>>> tables)
>>>> >
>>>> > > - How are we going to do state evolution when we extend the JSON
>>>> model
>>>> > >
>>>> >
>>>> > The top level object we're storing has a __version field (for example
>>>> > `{"__version": 1, "dag": { ... } }`) so we can detect older versions
>>>> and
>>>> > either run a db "migration" on install, or upgrade at load time.
>>>> >
>>>> >
>>>> >
>>>> > On 15 Oct 2019, at 20:04, Dan Davydov <ddavy...@twitter.com.invalid>
>>>> > wrote:
>>>> > > Having both a SimpleDagBag representation and the JSON
>>>> representation
>>>> > > doesn't make sense to me at the moment: *"**Quoting from Airflow
>>>> code, it
>>>> > > is “a simplified representation of a DAG that contains all
>>>> attributes
>>>> > > required for instantiating and scheduling its associated tasks.”.
>>>> It does
>>>> > > not contain enough information required by the webserver.". *Why not
>>>> > create
>>>> > > a representation that can be used by both? This is going to be a big
>>>> > > headache to both understand and work with in the codebase since it
>>>> will
>>>> > be
>>>> > > another definition that we need to keep in sync.
>>>> >
>>>> > Honestly: because that makes the change bigger, harder to review and
>>>> > harder to back port. If this is a serious blocker to you agreeing we
>>>> can
>>>> > update it to use the same structure as part of this PR, but I would
>>>> rather
>>>> > it was a separate change.
>>>> >
>>>> I think this is the kind of core change we should make from the get-go,
>>>> my
>>>> feeling is that we will be paying for this decision otherwise, and the
>>>> cost
>>>> will surpass the immediate value. Ideally we would fully understand the
>>>> long-term plan before deciding on this serialization format, but I
>>>> think at
>>>> least having a single serialized representation for now would be a
>>>> decent
>>>> compromise for me.
>>>>
>>>> This is still a blocker to me agreeing.
>>>>
>>>>
>>>> >
>>>> > > Not sure if fileloc/fileloc_hash is the right solution, the longterm
>>>> > > solution I am imagining has clients responsible for uploading DAGs
>>>> rather
>>>> > > than retrieving them from the filesystem so fileloc/fileloc_hash
>>>> wouldn't
>>>> > > even exist (dag_id would be used for indexing here).
>>>> >
>>>> > Then we can remove it later, when that code exists. For now so much
>>>> of the
>>>> > rest of the code base assumes DAGs from files on disk so.
>>>> >
>>>> >
>>>> >
>>>> > > Versioning isn't really addressed either (e.g. if a DAG topology
>>>> changes
>>>> > > with some update you want to be able to show both the old and new
>>>> ones,
>>>> > or
>>>> > > at least have a way to deal with them), there is an argument that
>>>> this is
>>>> > > acceptable since it isn't currently addressed now, but I'm worried
>>>> that
>>>> > > large schema changes should think through the long term plan a bit
>>>> more.
>>>> >
>>>> > Correct, and this is by design to make the change smaller and easier
>>>> to
>>>> > review. Adding versioning using the existing serialization format is
>>>> not
>>>> > that much work once the ground work is here and the webserver is not
>>>> > relying on DAGs from disk. There's also a UI question of how to
>>>> handle this
>>>> > on the Tree view that I don't have an immediate answer for. But yes,
>>>> > versioning of dag run history was in our minds when working on this
>>>> PR.
>>>> >
>>>> > As long as all of the changes are in the camp of "easy to
>>>> migrate/undo in
>>>> the future" then these are not blockers for me. I'm not 100% convinced
>>>> that
>>>> they are, but you have more context so I'll defer to your judgment.
>>>>
>>>>
>>>> > > I feel like getting this wrong is going to make it very hard to
>>>> migrate
>>>> > > things in the future, and make the codebase worse (another
>>>> representation
>>>> > > of DAGs that we need to support/understand/keep parity for). If I'm
>>>> wrong
>>>> > > about this then I would be more willing to +1 this change. This doc
>>>> is a
>>>> > > 1-2 pager and I feel like it is not thorough or deep enough and
>>>> doesn't
>>>> > > give me enough confidence that the work in the PR is going to make
>>>> it
>>>> > > easier to complete the future milestones instead of harder.
>>>> >
>>>> > "But what if we get it wrong" is not something we can act on. So long
>>>> as
>>>> > you are happy with the choice of JSON as the serialization then since
>>>> we've
>>>> > got a version in the record we can change this relatively easily in
>>>> the
>>>> > future - we'll know when we're dealing with an older record (we won't
>>>> have
>>>> > to guess) and can write a format migrator (either to upgrade in
>>>> memory, or
>>>> > to upgrade the record in place. And as a db "migration" or as a
>>>> upgrade
>>>> > command)
>>>> >
>>>> I'm happy with JSON, my main point about "what if we get it wrong" is
>>>> that
>>>> I feel like more planning work could have been put into this project vs
>>>> just a couple of rough high-level target "milestones". This
>>>> significantly
>>>> increases the risk of the project long term in my eyes.
>>>>
>>>>
>>>> >
>>>> > We've based our serialization on Uber's Piper work <
>>>> > https://eng.uber.com/managing-data-workflows-at-scale/ <
>>>> > https://eng.uber.com/managing-data-workflows-at-scale/>> (we had a
>>>> call
>>>> > with Alex at Uber before starting any of this work) so we've piggy
>>>> backed
>>>> > of their hard work and months of running something very close to this
>>>> > format in production.
>>>>
>>>>
>>>> >
>>>> This is great to hear!
>>>>
>>>>
>>>> > Also let's not forget: right now we aren't storing versions, and we
>>>> still
>>>> > have DAGs on disk. If we do get it massively wrong (which I don't
>>>> think we
>>>> > have) we can just drop this entire table and start again. This is all
>>>> an
>>>> > internal implementation detail that is not exposed via an API. What
>>>> better
>>>> > way to see if this is right than by running it at scale in production
>>>> > across 1000s of customers (Astronomer+Cloud Composer)
>>>> >
>>>> If we have thought through future milestones and made sure that we can
>>>> switch to different formats that could required for them easily, then
>>>> I'm
>>>> not a blocker on this front.
>>>>
>>>>
>>>> > Can you be more specific about you are worried about here? It's a bit
>>>> hard
>>>> > to put general "but what if we get it wrong" fears to rest.
>>>> >
>>>> >
>>>> >
>>>> > On Tue, Oct 15, 2019 at 9:10 PM Alex Guziel <alex.guz...@airbnb.com
>>>> > .invalid>
>>>> > > We don't need to have the future plan implemented
>>>> > > completely but it would be nice to see more detailed notes about
>>>> how this
>>>> > > will play out in the future.
>>>> >
>>>> > Fair point. Our rough long-term plan:
>>>> >
>>>> > - Add versioning to the UI, likely by storing a new version linked
>>>> from
>>>> > the DagRun, (but only when it changes, not every time to avoid DB
>>>> bloat -
>>>> > multiple dag_run rows will point at the same serialized blob).
>>>> >
>>>> > - Rework the scheduler use the serialized version, not the Simple*
>>>> > version. assuming we can do it and keep the scheduler performing well.
>>>> > (i.e. that ser/deser time isn't significant for the scheduler loop)
>>>> >
>>>> >
>>>> > - Massively rework how scheduler creates TaskInstances (right now
>>>> this is
>>>> > done in the Dag parsing process, not the main scheduler) and dag
>>>> runs. We
>>>> > have to keep an eye on scheduler performance as we do this. If we can
>>>> pull
>>>> > this back in to the main scheduler then this leads us towards the next
>>>> > points:
>>>> >
>>>> > - Split out the updating of serialized dag/parsing of DAGs from the
>>>> > scheduler loop so it can be a separate component/subprocess (which is
>>>> on by
>>>> > default). This gets us towards being able to submit DAG definitions
>>>> via an
>>>> > API if we wanted. (There's a lot to work out to get that working,
>>>> mostly
>>>> > around running the DAG).
>>>> >
>>>> > - Once the scheduler loop/logic is not tied to parsing the DAGs then
>>>> this
>>>> > also heads us towards being able to run multiple schedulers. (This
>>>> isn't a
>>>> > requirement on HA scheduler, but it makes them easier to "scale out".)
>>>> >
>>>> > Hope this answers people's questions, and thanks for the feedback
>>>> >
>>>> > Alex and Dan: Does this give enough information for you to change your
>>>> > vote?
>>>> >
>>>> I think as long as we combine the serialized DAG representations and
>>>> have
>>>> some confidence that the data format will be easy to switch to future
>>>> potential formats we would want (kind of hard to do without a more
>>>> detailed
>>>> future plan...), then I would happily change my vote.
>>>>
>>>>
>>>> > -ash
>>>> >
>>>> > > On 15 Oct 2019, at 21:14, Driesprong, Fokko <fo...@driesprong.frl>
>>>> > wrote:
>>>> > >
>>>> > > Big +1 from my side, looking forward to make this happen.
>>>> > >
>>>> > > Two sides that aren't completely clear to me:
>>>> > >
>>>> > >   - Are we going to extend the existing data model, to allow the
>>>> RDBMS to
>>>> > >   optimize queries on fields that we use a lot?
>>>> > >   - How are we going to do state evolution when we extend the JSON
>>>> model
>>>> > >
>>>> > > I have good confidence that we'll solve this along the way.
>>>> > >
>>>> > > Cheers, Fokko
>>>> > >
>>>> > > Op di 15 okt. 2019 om 21:29 schreef Dan Davydov
>>>> > > <ddavy...@twitter.com.invalid>:
>>>> > >
>>>> > >> I have been following it from the beginning as well. I understand
>>>> there
>>>> > >> would be short-term wins for some users (I don't think a huge
>>>> amount of
>>>> > >> users?), but I still feel like we are being a bit short-sighted
>>>> here and
>>>> > >> that we are creating more work for ourselves and potentially our
>>>> users
>>>> > in
>>>> > >> the future. I also feel like there will be side effects to users as
>>>> > well,
>>>> > >> many of which don't care about the webserver scalability, such as
>>>> bugs
>>>> > >> caused by the addition of the new webserver representation. I think
>>>> > without
>>>> > >> a design that is much larger in scope I wouldn't feel comfortable
>>>> moving
>>>> > >> forward with this AIP.
>>>> > >>
>>>> > >> On Tue, Oct 15, 2019 at 3:21 PM Jarek Potiuk <
>>>> jarek.pot...@polidea.com>
>>>> > >> wrote:
>>>> > >>
>>>> > >>> Hello Dan, Alex,
>>>> > >>>
>>>> > >>> I believe all the points you make are super-valid ones. But maybe
>>>> you
>>>> > are
>>>> > >>> missing the full context a bit.
>>>> > >>>
>>>> > >>> I followed the original discussion
>>>> > >>> <
>>>> > >>>
>>>> > >>
>>>> >
>>>> https://lists.apache.org/thread.html/a2d426f93c0f4e5f0347627308638b59ca4072fd022a42af1163e34a@%3Cdev.airflow.apache.org%3E
>>>> > >>>>
>>>> > >>> from the very beginning and took part in the initial discussions
>>>> when
>>>> > >> this
>>>> > >>> topic was raised. From the discussion it is quite clear to me
>>>> that this
>>>> > >>> mostly a "tactical" approach to implement something that is
>>>> > backportable
>>>> > >> to
>>>> > >>> 1.10 and rather quick to implement. This is targeted to make
>>>> users more
>>>> > >>> happy with their 1.10 version without the timing uncertainty and
>>>> effort
>>>> > >> of
>>>> > >>> migration to 2.0. It solves the major pain point of stability of
>>>> the UI
>>>> > >> in
>>>> > >>> case there are complex DAGs for which parsing crashes the
>>>> webserver.
>>>> > >>> Like in "being nice to your users".
>>>> > >>>
>>>> > >>> There will be a separate effort to make pretty much all of the
>>>> things
>>>> > you
>>>> > >>> mentioned in 2.0 in a non-backportable way as it requires far too
>>>> many
>>>> > >>> changes in the way how Airflow works internally.
>>>> > >>>
>>>> > >>> Maybe it needs some more explanation + long term plan that
>>>> follows in
>>>> > the
>>>> > >>> AIP itself to explain it to those who have not followed the
>>>> initial
>>>> > >>> discussion, but I think it's fully justified change.
>>>> > >>>
>>>> > >>> J.
>>>> > >>>
>>>> > >>> On Tue, Oct 15, 2019 at 9:10 PM Alex Guziel <
>>>> alex.guz...@airbnb.com
>>>> > >>> .invalid>
>>>> > >>> wrote:
>>>> > >>>
>>>> > >>>> -1 (binding)
>>>> > >>>> Good points made by Dan. We don't need to have the future plan
>>>> > >>> implemented
>>>> > >>>> completely but it would be nice to see more detailed notes about
>>>> how
>>>> > >> this
>>>> > >>>> will play out in the future. We shouldn't walk into a system that
>>>> > >> causes
>>>> > >>>> more pain in the future. (I can't say for sure that it does, but
>>>> I
>>>> > >> can't
>>>> > >>>> say that it doesn't either). I don't think the proposal is
>>>> necessarily
>>>> > >>>> wrong or bad, but I think we need some more detailed planning
>>>> around
>>>> > >>> future
>>>> > >>>> milestones.
>>>> > >>>>
>>>> > >>>> On Tue, Oct 15, 2019 at 12:04 PM Dan Davydov
>>>> > >>> <ddavy...@twitter.com.invalid
>>>> > >>>>>
>>>> > >>>> wrote:
>>>> > >>>>
>>>> > >>>>> -1 (binding), this may sound a bit FUD-y but I don't feel this
>>>> has
>>>> > >> been
>>>> > >>>>> thought through enough...
>>>> > >>>>>
>>>> > >>>>> Having both a SimpleDagBag representation and the JSON
>>>> representation
>>>> > >>>>> doesn't make sense to me at the moment: *"**Quoting from Airflow
>>>> > >> code,
>>>> > >>> it
>>>> > >>>>> is “a simplified representation of a DAG that contains all
>>>> attributes
>>>> > >>>>> required for instantiating and scheduling its associated
>>>> tasks.”. It
>>>> > >>> does
>>>> > >>>>> not contain enough information required by the webserver.".
>>>> *Why not
>>>> > >>>> create
>>>> > >>>>> a representation that can be used by both? This is going to be
>>>> a big
>>>> > >>>>> headache to both understand and work with in the codebase since
>>>> it
>>>> > >> will
>>>> > >>>> be
>>>> > >>>>> another definition that we need to keep in sync.
>>>> > >>>>>
>>>> > >>>>> Not sure if fileloc/fileloc_hash is the right solution, the
>>>> longterm
>>>> > >>>>> solution I am imagining has clients responsible for uploading
>>>> DAGs
>>>> > >>> rather
>>>> > >>>>> than retrieving them from the filesystem so fileloc/fileloc_hash
>>>> > >>> wouldn't
>>>> > >>>>> even exist (dag_id would be used for indexing here).
>>>> > >>>>>
>>>> > >>>>> Versioning isn't really addressed either (e.g. if a DAG topology
>>>> > >>> changes
>>>> > >>>>> with some update you want to be able to show both the old and
>>>> new
>>>> > >> ones,
>>>> > >>>> or
>>>> > >>>>> at least have a way to deal with them), there is an argument
>>>> that
>>>> > >> this
>>>> > >>> is
>>>> > >>>>> acceptable since it isn't currently addressed now, but I'm
>>>> worried
>>>> > >> that
>>>> > >>>>> large schema changes should think through the long term plan a
>>>> bit
>>>> > >>> more.
>>>> > >>>>>
>>>> > >>>>> I feel like getting this wrong is going to make it very hard to
>>>> > >> migrate
>>>> > >>>>> things in the future, and make the codebase worse (another
>>>> > >>> representation
>>>> > >>>>> of DAGs that we need to support/understand/keep parity for). If
>>>> I'm
>>>> > >>> wrong
>>>> > >>>>> about this then I would be more willing to +1 this change. This
>>>> doc
>>>> > >> is
>>>> > >>> a
>>>> > >>>>> 1-2 pager and I feel like it is not thorough or deep enough and
>>>> > >> doesn't
>>>> > >>>>> give me enough confidence that the work in the PR is going to
>>>> make it
>>>> > >>>>> easier to complete the future milestones instead of harder.
>>>> > >>>>>
>>>> > >>>>> On Tue, Oct 15, 2019 at 11:26 AM Kamil Breguła <
>>>> > >>>> kamil.breg...@polidea.com>
>>>> > >>>>> wrote:
>>>> > >>>>>
>>>> > >>>>>> +1 (binding)
>>>> > >>>>>>
>>>> > >>>>>> On Tue, Oct 15, 2019 at 2:57 AM Kaxil Naik <
>>>> kaxiln...@gmail.com>
>>>> > >>>> wrote:
>>>> > >>>>>>
>>>> > >>>>>>> Hello, Airflow community,
>>>> > >>>>>>>
>>>> > >>>>>>> This email calls for a vote to add the DAG Serialization
>>>> feature
>>>> > >> at
>>>> > >>>>>>> https://github.com/apache/airflow/pull/5743.
>>>> > >>>>>>>
>>>> > >>>>>>> *AIP*:
>>>> > >>>>>>>
>>>> > >>>>>>
>>>> > >>>>>
>>>> > >>>>
>>>> > >>>
>>>> > >>
>>>> >
>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler
>>>> > >>>>>>>
>>>> > >>>>>>> *Previous Mailing List discussion*:
>>>> > >>>>>>>
>>>> > >>>>>>
>>>> > >>>>>
>>>> > >>>>
>>>> > >>>
>>>> > >>
>>>> >
>>>> https://lists.apache.org/thread.html/65d282368e0a7c19815badb8b1c6c8d72b0975ce94f601e13af44f74@%3Cdev.airflow.apache.org%3E
>>>> > >>>>>>> .
>>>> > >>>>>>>
>>>> > >>>>>>> *Authors*: Kaxil Naik, Zhou Fang, Ash-Berlin Taylor
>>>> > >>>>>>>
>>>> > >>>>>>> *Summary*:
>>>> > >>>>>>>
>>>> > >>>>>>>   - DAGs are serialized using JSON format and stored in a
>>>> > >>>>> SerializedDag
>>>> > >>>>>>>   table
>>>> > >>>>>>>   - The Webserver now instead of having to parse the DAG file
>>>> > >>> again,
>>>> > >>>>>>>   reads the serialized DAGs in JSON, de-serializes them and
>>>> > >>> creates
>>>> > >>>>> the
>>>> > >>>>>>>   DagBag and uses it to show in the UI.
>>>> > >>>>>>>   - Instead of loading an entire DagBag when the WebServer
>>>> > >> starts
>>>> > >>> we
>>>> > >>>>>>>   only load each DAG on demand from the Serialized Dag table.
>>>> > >> This
>>>> > >>>>> helps
>>>> > >>>>>>>   reduce Webserver startup time and memory. The reduction is
>>>> > >>> notable
>>>> > >>>>>> when you
>>>> > >>>>>>>   have a large number of DAGs.
>>>> > >>>>>>>   - A JSON Schema has been defined and we validate the
>>>> > >> serialized
>>>> > >>>> dag
>>>> > >>>>>>>   before writing it to the database
>>>> > >>>>>>>
>>>> > >>>>>>> [image: image.png]
>>>> > >>>>>>>
>>>> > >>>>>>> A PR (https://github.com/apache/airflow/pull/5743) is ready
>>>> for
>>>> > >>>> review
>>>> > >>>>>>> from the committers and community.
>>>> > >>>>>>>
>>>> > >>>>>>> We also have a WIP PR (
>>>> > >> https://github.com/apache/airflow/pull/5992
>>>> > >>> )
>>>> > >>>> to
>>>> > >>>>>>> backport this feature to 1.10.* branch.
>>>> > >>>>>>>
>>>> > >>>>>>> A big thank you to Zhou and Ash for their continuous help in
>>>> > >>>> improving
>>>> > >>>>>>> this feature/PR.
>>>> > >>>>>>>
>>>> > >>>>>>> This email is formally calling for a vote to accept the AIP
>>>> and
>>>> > >> PR.
>>>> > >>>>>> Please
>>>> > >>>>>>> note that we will update the PR / feature to fix bugs if we
>>>> find
>>>> > >>> any.
>>>> > >>>>>>>
>>>> > >>>>>>> Cheers,
>>>> > >>>>>>> Kaxil
>>>> > >>>>>>>
>>>> > >>>>>>>
>>>> > >>>>>>>
>>>> > >>>>>>>
>>>> > >>>>>>>
>>>> > >>>>>>>
>>>> > >>>>>>>
>>>> > >>>>>>
>>>> > >>>>>
>>>> > >>>>
>>>> > >>>
>>>> > >>>
>>>> > >>> --
>>>> > >>>
>>>> > >>> Jarek Potiuk
>>>> > >>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>>> > >>>
>>>> > >>> M: +48 660 796 129 <+48660796129>
>>>> > >>> [image: Polidea] <https://www.polidea.com/>
>>>> > >>>
>>>> > >>
>>>> >
>>>> >
>>>>
>>>

Re: [VOTE] AIP-24: Persisting serialized DAG in DB for webserver scalability

Reply via email to