Re: [VOTE] AIP-24: Persisting serialized DAG in DB for webserver scalability

Kaxil Naik Thu, 17 Oct 2019 07:16:04 -0700

Hi Dan, I understand your concern. Your +1 and suggestions are very
important for us so let me try to explain in more details
and see if I can convince you.


Please check replies in-line

I think this is the kind of core change we should make from the get-go, my
> feeling is that we will be paying for this decision otherwise, and the cost
> will surpass the immediate value. Ideally we would fully understand the
> long-term plan before deciding on this serialization format, but I think at
> least having a single serialized representation for now would be a decent
> compromise for me.
> This is still a blocker to me agreeing.


Trying to do all (Serialization for Webserver, Scheduler, etc) at once
would be error prone as we would need to cover all cases as well and
it is very difficult to test and review such a large chunk of work even if
we have multiple people reviewing it.

Doing it iteratively would allow us to roll out serialisation while keeping
backwards compatibility. Keeping backwards compatibility for now
is very important atleast for Websever as the users would have to wait for
months or at-worse 1 more year (plus the upgrade path to 1.10* to 2.0 is
already looking quite huge).

Airflow 2.0 is still going to take months and is already going to have
number of breaking changes that is going to make
updating to it cumbersome task:

- No non-RBAC UI
- No Py2
- Import paths have (or in the process of being) changed from all/most
contrib objects
- CLI reorganization
- and various others:
https://github.com/apache/airflow/blob/master/UPDATING.md#airflow-master

I am listing the above for 1 reason - "How long would the users have to
deal with Webserver scalability issue" before they upgrade to 2.0 !

However, I agree with your argument that we should ideally understand the
long-term plan.

The current serialized representation of DAG looks like the following:
https://github.com/apache/airflow/blob/a6f37cd0da0774dff6461d3fdcb94dee9d101fda/tests/serialization/test_dag_serialization.py#L40-L93

As you can see most of the information required by SimpleTask is already
present. Currently, SimpleTask is used by the Scheduler. The next phase
of this AIP will be to use Serialization in the Scheduler and removing
SimpleTask altogether.

As you said that at least having a single serialized representation, for
now, would be a decent compromise for me, how about you suggest
 or let us know if you think we are missing anything in the serialized blob
example above.

------------------------

As long as all of the changes are in the camp of "easy to migrate/undo in
> the future" then these are not blockers for me. I'm not 100% convinced that
> they are, but you have more context so I'll defer to your judgment.


Currently, as the Serialization applies only to the webserver, if something
goes wrong, changing `store_serialized_dags=False` would restore
no-serialization behavior.

We are not touching Scheduler for 1.10* as it is an actual Heart of Airflow
and wouldn't feel comfortable in changing that core in a minor/patch fix.

------------------------

I'm happy with JSON, my main point about "what if we get it wrong" is that
> I feel like more planning work could have been put into this project vs
> just a couple of rough high-level target "milestones". This significantly
> increases the risk of the project long term in my eyes.


Getting all the things right all at once would be harder than taking it
piece by piece.
We have discovered many things in Webserver that we initially did not take
into account like "Operator Links".
Even Scheduler would have something similar which we won't know now but
find out later which would then require re-working and going back-and-forth.

------------------------

> I think as long as we combine the serialized DAG representations and have
> some confidence that the data format will be easy to switch to future
> potential formats we would want (kind of hard to do without a more detailed
> future plan...), then I would happily change my vote.


For storing versioned dags, this is just a rough idea and it can change:

We have a last_updated_time column in the Serialized Dag table. Currently
we just update that dag row when we process the Dag again,
instead, we would just store it as a different row if the DAG has changed
and if it already had a DagRun with previous version i.e
we don't need to store it every time it changes if it hasn't run in the
meantime.

We can assign a unique number (version) to each row for a DAG.
Sorting by `last_updated_time` column who gives us how DAG updated over
time.

Regards,
Kaxil

On Wed, Oct 16, 2019 at 4:43 PM Dan Davydov <[email protected]>
wrote:

> Responses inline.
>
> On Wed, Oct 16, 2019 at 6:07 AM Ash Berlin-Taylor <[email protected]> wrote:
>
> > Thanks for the feedback and discussion everyone - it's nice to know
> people
> > feel strongly about this. Let's make sure we build this right!
> >
> > As a reminder: the main goals of this AIP:
> >
> > - To speed up bootup/worker recycle time of the webserver
> >
> >   If you have a large number of DAGs or highly dynamic DAGs it is
> possible
> > that starting your webserver needs a Gunicorn worker timeout of 5 or 10
> > minutes. (It is worse at start up because each of the gunicorn workers
> > parses the dag itself, in exactly the same order, at the same time as the
> > other workers, leading to disk contention, and CPU burst.)
> >
> >   And just yesterday I helped two people diagnose a problem that was
> > caused by slow NFS: > time airflow list_dags` took 3:35mins to go over 43
> > dags.
> >
> >   At Astronomer, and Google (Cloud Composer) we've seen this slow loading
> > many times for our customer's installs.
> >
> >   And not to mention this makes the webserver snappier
> >
> > - To make the webserver almost "stateless" (not the best term for it:
> > Diskless perhaps? Dagbag-less?).
> >
> >   This makes actions on the webserver quicker. We had a number of PRs in
> > the 1.10 series that made views and API endpoints in the webserver
> quicker
> > By making them only load one dag from disk. This takes it further and
> means
> > that (as of the PR) only the Rendered tab needs to go to the code on
> disk.
> > As a reminder the "trigger_dag" endpoint would previously load all dags.
> > Other's still do. Multiple people on have complained about this Slack.
> >
> >   (The only remaining view that has to hit the code on disk is the
> > Rendered template tab, as right now a DAG can define custom macros or
> > filters, and we don't want to have to deal with serializing code. The
> only
> > "real" way of doing it is CloudPickle. No thank you)
> >
> >
> > To address the points .
> >
> >
> > On 15 Oct 2019, at 21:14, Driesprong, Fokko <[email protected]> wrote
> > > - Are we going to extend the existing data model, to allow the RDBMS to
> > > optimize queries on fields that we use a lot?
> >
> > Yes, as needed. Not really needed for this PR yet (it doesn't change any
> > query patterns in the webserver, so they are using the existing tables)
> >
> > > - How are we going to do state evolution when we extend the JSON model
> > >
> >
> > The top level object we're storing has a __version field (for example
> > `{"__version": 1, "dag": { ... } }`) so we can detect older versions and
> > either run a db "migration" on install, or upgrade at load time.
> >
> >
> >
> > On 15 Oct 2019, at 20:04, Dan Davydov <[email protected]>
> > wrote:
> > > Having both a SimpleDagBag representation and the JSON representation
> > > doesn't make sense to me at the moment: *"**Quoting from Airflow code,
> it
> > > is “a simplified representation of a DAG that contains all attributes
> > > required for instantiating and scheduling its associated tasks.”. It
> does
> > > not contain enough information required by the webserver.". *Why not
> > create
> > > a representation that can be used by both? This is going to be a big
> > > headache to both understand and work with in the codebase since it will
> > be
> > > another definition that we need to keep in sync.
> >
> > Honestly: because that makes the change bigger, harder to review and
> > harder to back port. If this is a serious blocker to you agreeing we can
> > update it to use the same structure as part of this PR, but I would
> rather
> > it was a separate change.
> >
> I think this is the kind of core change we should make from the get-go, my
> feeling is that we will be paying for this decision otherwise, and the cost
> will surpass the immediate value. Ideally we would fully understand the
> long-term plan before deciding on this serialization format, but I think at
> least having a single serialized representation for now would be a decent
> compromise for me.
>
> This is still a blocker to me agreeing.
>
>
> >
> > > Not sure if fileloc/fileloc_hash is the right solution, the longterm
> > > solution I am imagining has clients responsible for uploading DAGs
> rather
> > > than retrieving them from the filesystem so fileloc/fileloc_hash
> wouldn't
> > > even exist (dag_id would be used for indexing here).
> >
> > Then we can remove it later, when that code exists. For now so much of
> the
> > rest of the code base assumes DAGs from files on disk so.
> >
> >
> >
> > > Versioning isn't really addressed either (e.g. if a DAG topology
> changes
> > > with some update you want to be able to show both the old and new ones,
> > or
> > > at least have a way to deal with them), there is an argument that this
> is
> > > acceptable since it isn't currently addressed now, but I'm worried that
> > > large schema changes should think through the long term plan a bit
> more.
> >
> > Correct, and this is by design to make the change smaller and easier to
> > review. Adding versioning using the existing serialization format is not
> > that much work once the ground work is here and the webserver is not
> > relying on DAGs from disk. There's also a UI question of how to handle
> this
> > on the Tree view that I don't have an immediate answer for. But yes,
> > versioning of dag run history was in our minds when working on this PR.
> >
> > As long as all of the changes are in the camp of "easy to migrate/undo in
> the future" then these are not blockers for me. I'm not 100% convinced that
> they are, but you have more context so I'll defer to your judgment.
>
>
> > > I feel like getting this wrong is going to make it very hard to migrate
> > > things in the future, and make the codebase worse (another
> representation
> > > of DAGs that we need to support/understand/keep parity for). If I'm
> wrong
> > > about this then I would be more willing to +1 this change. This doc is
> a
> > > 1-2 pager and I feel like it is not thorough or deep enough and doesn't
> > > give me enough confidence that the work in the PR is going to make it
> > > easier to complete the future milestones instead of harder.
> >
> > "But what if we get it wrong" is not something we can act on. So long as
> > you are happy with the choice of JSON as the serialization then since
> we've
> > got a version in the record we can change this relatively easily in the
> > future - we'll know when we're dealing with an older record (we won't
> have
> > to guess) and can write a format migrator (either to upgrade in memory,
> or
> > to upgrade the record in place. And as a db "migration" or as a upgrade
> > command)
> >
> I'm happy with JSON, my main point about "what if we get it wrong" is that
> I feel like more planning work could have been put into this project vs
> just a couple of rough high-level target "milestones". This significantly
> increases the risk of the project long term in my eyes.
>
>
> >
> > We've based our serialization on Uber's Piper work <
> > https://eng.uber.com/managing-data-workflows-at-scale/ <
> > https://eng.uber.com/managing-data-workflows-at-scale/>> (we had a call
> > with Alex at Uber before starting any of this work) so we've piggy backed
> > of their hard work and months of running something very close to this
> > format in production.
>
>
> >
> This is great to hear!
>
>
> > Also let's not forget: right now we aren't storing versions, and we still
> > have DAGs on disk. If we do get it massively wrong (which I don't think
> we
> > have) we can just drop this entire table and start again. This is all an
> > internal implementation detail that is not exposed via an API. What
> better
> > way to see if this is right than by running it at scale in production
> > across 1000s of customers (Astronomer+Cloud Composer)
> >
> If we have thought through future milestones and made sure that we can
> switch to different formats that could required for them easily, then I'm
> not a blocker on this front.
>
>
> > Can you be more specific about you are worried about here? It's a bit
> hard
> > to put general "but what if we get it wrong" fears to rest.
> >
> >
> >
> > On Tue, Oct 15, 2019 at 9:10 PM Alex Guziel <[email protected]
> > .invalid>
> > > We don't need to have the future plan implemented
> > > completely but it would be nice to see more detailed notes about how
> this
> > > will play out in the future.
> >
> > Fair point. Our rough long-term plan:
> >
> > - Add versioning to the UI, likely by storing a new version linked from
> > the DagRun, (but only when it changes, not every time to avoid DB bloat -
> > multiple dag_run rows will point at the same serialized blob).
> >
> > - Rework the scheduler use the serialized version, not the Simple*
> > version. assuming we can do it and keep the scheduler performing well.
> > (i.e. that ser/deser time isn't significant for the scheduler loop)
> >
> >
> > - Massively rework how scheduler creates TaskInstances (right now this is
> > done in the Dag parsing process, not the main scheduler) and dag runs. We
> > have to keep an eye on scheduler performance as we do this. If we can
> pull
> > this back in to the main scheduler then this leads us towards the next
> > points:
> >
> > - Split out the updating of serialized dag/parsing of DAGs from the
> > scheduler loop so it can be a separate component/subprocess (which is on
> by
> > default). This gets us towards being able to submit DAG definitions via
> an
> > API if we wanted. (There's a lot to work out to get that working, mostly
> > around running the DAG).
> >
> > - Once the scheduler loop/logic is not tied to parsing the DAGs then this
> > also heads us towards being able to run multiple schedulers. (This isn't
> a
> > requirement on HA scheduler, but it makes them easier to "scale out".)
> >
> > Hope this answers people's questions, and thanks for the feedback
> >
> > Alex and Dan: Does this give enough information for you to change your
> > vote?
> >
> I think as long as we combine the serialized DAG representations and have
> some confidence that the data format will be easy to switch to future
> potential formats we would want (kind of hard to do without a more detailed
> future plan...), then I would happily change my vote.
>
>
> > -ash
> >
> > > On 15 Oct 2019, at 21:14, Driesprong, Fokko <[email protected]>
> > wrote:
> > >
> > > Big +1 from my side, looking forward to make this happen.
> > >
> > > Two sides that aren't completely clear to me:
> > >
> > >   - Are we going to extend the existing data model, to allow the RDBMS
> to
> > >   optimize queries on fields that we use a lot?
> > >   - How are we going to do state evolution when we extend the JSON
> model
> > >
> > > I have good confidence that we'll solve this along the way.
> > >
> > > Cheers, Fokko
> > >
> > > Op di 15 okt. 2019 om 21:29 schreef Dan Davydov
> > > <[email protected]>:
> > >
> > >> I have been following it from the beginning as well. I understand
> there
> > >> would be short-term wins for some users (I don't think a huge amount
> of
> > >> users?), but I still feel like we are being a bit short-sighted here
> and
> > >> that we are creating more work for ourselves and potentially our users
> > in
> > >> the future. I also feel like there will be side effects to users as
> > well,
> > >> many of which don't care about the webserver scalability, such as bugs
> > >> caused by the addition of the new webserver representation. I think
> > without
> > >> a design that is much larger in scope I wouldn't feel comfortable
> moving
> > >> forward with this AIP.
> > >>
> > >> On Tue, Oct 15, 2019 at 3:21 PM Jarek Potiuk <
> [email protected]>
> > >> wrote:
> > >>
> > >>> Hello Dan, Alex,
> > >>>
> > >>> I believe all the points you make are super-valid ones. But maybe you
> > are
> > >>> missing the full context a bit.
> > >>>
> > >>> I followed the original discussion
> > >>> <
> > >>>
> > >>
> >
> https://lists.apache.org/thread.html/a2d426f93c0f4e5f0347627308638b59ca4072fd022a42af1163e34a@%3Cdev.airflow.apache.org%3E
> > >>>>
> > >>> from the very beginning and took part in the initial discussions when
> > >> this
> > >>> topic was raised. From the discussion it is quite clear to me that
> this
> > >>> mostly a "tactical" approach to implement something that is
> > backportable
> > >> to
> > >>> 1.10 and rather quick to implement. This is targeted to make users
> more
> > >>> happy with their 1.10 version without the timing uncertainty and
> effort
> > >> of
> > >>> migration to 2.0. It solves the major pain point of stability of the
> UI
> > >> in
> > >>> case there are complex DAGs for which parsing crashes the webserver.
> > >>> Like in "being nice to your users".
> > >>>
> > >>> There will be a separate effort to make pretty much all of the things
> > you
> > >>> mentioned in 2.0 in a non-backportable way as it requires far too
> many
> > >>> changes in the way how Airflow works internally.
> > >>>
> > >>> Maybe it needs some more explanation + long term plan that follows in
> > the
> > >>> AIP itself to explain it to those who have not followed the initial
> > >>> discussion, but I think it's fully justified change.
> > >>>
> > >>> J.
> > >>>
> > >>> On Tue, Oct 15, 2019 at 9:10 PM Alex Guziel <[email protected]
> > >>> .invalid>
> > >>> wrote:
> > >>>
> > >>>> -1 (binding)
> > >>>> Good points made by Dan. We don't need to have the future plan
> > >>> implemented
> > >>>> completely but it would be nice to see more detailed notes about how
> > >> this
> > >>>> will play out in the future. We shouldn't walk into a system that
> > >> causes
> > >>>> more pain in the future. (I can't say for sure that it does, but I
> > >> can't
> > >>>> say that it doesn't either). I don't think the proposal is
> necessarily
> > >>>> wrong or bad, but I think we need some more detailed planning around
> > >>> future
> > >>>> milestones.
> > >>>>
> > >>>> On Tue, Oct 15, 2019 at 12:04 PM Dan Davydov
> > >>> <[email protected]
> > >>>>>
> > >>>> wrote:
> > >>>>
> > >>>>> -1 (binding), this may sound a bit FUD-y but I don't feel this has
> > >> been
> > >>>>> thought through enough...
> > >>>>>
> > >>>>> Having both a SimpleDagBag representation and the JSON
> representation
> > >>>>> doesn't make sense to me at the moment: *"**Quoting from Airflow
> > >> code,
> > >>> it
> > >>>>> is “a simplified representation of a DAG that contains all
> attributes
> > >>>>> required for instantiating and scheduling its associated tasks.”.
> It
> > >>> does
> > >>>>> not contain enough information required by the webserver.". *Why
> not
> > >>>> create
> > >>>>> a representation that can be used by both? This is going to be a
> big
> > >>>>> headache to both understand and work with in the codebase since it
> > >> will
> > >>>> be
> > >>>>> another definition that we need to keep in sync.
> > >>>>>
> > >>>>> Not sure if fileloc/fileloc_hash is the right solution, the
> longterm
> > >>>>> solution I am imagining has clients responsible for uploading DAGs
> > >>> rather
> > >>>>> than retrieving them from the filesystem so fileloc/fileloc_hash
> > >>> wouldn't
> > >>>>> even exist (dag_id would be used for indexing here).
> > >>>>>
> > >>>>> Versioning isn't really addressed either (e.g. if a DAG topology
> > >>> changes
> > >>>>> with some update you want to be able to show both the old and new
> > >> ones,
> > >>>> or
> > >>>>> at least have a way to deal with them), there is an argument that
> > >> this
> > >>> is
> > >>>>> acceptable since it isn't currently addressed now, but I'm worried
> > >> that
> > >>>>> large schema changes should think through the long term plan a bit
> > >>> more.
> > >>>>>
> > >>>>> I feel like getting this wrong is going to make it very hard to
> > >> migrate
> > >>>>> things in the future, and make the codebase worse (another
> > >>> representation
> > >>>>> of DAGs that we need to support/understand/keep parity for). If I'm
> > >>> wrong
> > >>>>> about this then I would be more willing to +1 this change. This doc
> > >> is
> > >>> a
> > >>>>> 1-2 pager and I feel like it is not thorough or deep enough and
> > >> doesn't
> > >>>>> give me enough confidence that the work in the PR is going to make
> it
> > >>>>> easier to complete the future milestones instead of harder.
> > >>>>>
> > >>>>> On Tue, Oct 15, 2019 at 11:26 AM Kamil Breguła <
> > >>>> [email protected]>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> +1 (binding)
> > >>>>>>
> > >>>>>> On Tue, Oct 15, 2019 at 2:57 AM Kaxil Naik <[email protected]>
> > >>>> wrote:
> > >>>>>>
> > >>>>>>> Hello, Airflow community,
> > >>>>>>>
> > >>>>>>> This email calls for a vote to add the DAG Serialization feature
> > >> at
> > >>>>>>> https://github.com/apache/airflow/pull/5743.
> > >>>>>>>
> > >>>>>>> *AIP*:
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler
> > >>>>>>>
> > >>>>>>> *Previous Mailing List discussion*:
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> https://lists.apache.org/thread.html/65d282368e0a7c19815badb8b1c6c8d72b0975ce94f601e13af44f74@%3Cdev.airflow.apache.org%3E
> > >>>>>>> .
> > >>>>>>>
> > >>>>>>> *Authors*: Kaxil Naik, Zhou Fang, Ash-Berlin Taylor
> > >>>>>>>
> > >>>>>>> *Summary*:
> > >>>>>>>
> > >>>>>>>   - DAGs are serialized using JSON format and stored in a
> > >>>>> SerializedDag
> > >>>>>>>   table
> > >>>>>>>   - The Webserver now instead of having to parse the DAG file
> > >>> again,
> > >>>>>>>   reads the serialized DAGs in JSON, de-serializes them and
> > >>> creates
> > >>>>> the
> > >>>>>>>   DagBag and uses it to show in the UI.
> > >>>>>>>   - Instead of loading an entire DagBag when the WebServer
> > >> starts
> > >>> we
> > >>>>>>>   only load each DAG on demand from the Serialized Dag table.
> > >> This
> > >>>>> helps
> > >>>>>>>   reduce Webserver startup time and memory. The reduction is
> > >>> notable
> > >>>>>> when you
> > >>>>>>>   have a large number of DAGs.
> > >>>>>>>   - A JSON Schema has been defined and we validate the
> > >> serialized
> > >>>> dag
> > >>>>>>>   before writing it to the database
> > >>>>>>>
> > >>>>>>> [image: image.png]
> > >>>>>>>
> > >>>>>>> A PR (https://github.com/apache/airflow/pull/5743) is ready for
> > >>>> review
> > >>>>>>> from the committers and community.
> > >>>>>>>
> > >>>>>>> We also have a WIP PR (
> > >> https://github.com/apache/airflow/pull/5992
> > >>> )
> > >>>> to
> > >>>>>>> backport this feature to 1.10.* branch.
> > >>>>>>>
> > >>>>>>> A big thank you to Zhou and Ash for their continuous help in
> > >>>> improving
> > >>>>>>> this feature/PR.
> > >>>>>>>
> > >>>>>>> This email is formally calling for a vote to accept the AIP and
> > >> PR.
> > >>>>>> Please
> > >>>>>>> note that we will update the PR / feature to fix bugs if we find
> > >>> any.
> > >>>>>>>
> > >>>>>>> Cheers,
> > >>>>>>> Kaxil
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>>
> > >>> Jarek Potiuk
> > >>> Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >>>
> > >>> M: +48 660 796 129 <+48660796129>
> > >>> [image: Polidea] <https://www.polidea.com/>
> > >>>
> > >>
> >
> >
>

Re: [VOTE] AIP-24: Persisting serialized DAG in DB for webserver scalability

Reply via email to