Re: [VOTE] AIP-24: Persisting serialized DAG in DB for webserver scalability

Ash Berlin-Taylor Wed, 16 Oct 2019 03:07:33 -0700

Thanks for the feedback and discussion everyone - it's nice to know people feel 
strongly about this. Let's make sure we build this right!


As a reminder: the main goals of this AIP:

- To speed up bootup/worker recycle time of the webserver

  If you have a large number of DAGs or highly dynamic DAGs it is possible that 
starting your webserver needs a Gunicorn worker timeout of 5 or 10 minutes. (It 
is worse at start up because each of the gunicorn workers parses the dag 
itself, in exactly the same order, at the same time as the other workers, 
leading to disk contention, and CPU burst.)

  And just yesterday I helped two people diagnose a problem that was caused by 
slow NFS: > time airflow list_dags` took 3:35mins to go over 43 dags.

  At Astronomer, and Google (Cloud Composer) we've seen this slow loading many 
times for our customer's installs.

  And not to mention this makes the webserver snappier

- To make the webserver almost "stateless" (not the best term for it: Diskless 
perhaps? Dagbag-less?).

  This makes actions on the webserver quicker. We had a number of PRs in the 
1.10 series that made views and API endpoints in the webserver quicker By 
making them only load one dag from disk. This takes it further and means that 
(as of the PR) only the Rendered tab needs to go to the code on disk. As a 
reminder the "trigger_dag" endpoint would previously load all dags. Other's 
still do. Multiple people on have complained about this Slack.

  (The only remaining view that has to hit the code on disk is the Rendered 
template tab, as right now a DAG can define custom macros or filters, and we 
don't want to have to deal with serializing code. The only "real" way of doing 
it is CloudPickle. No thank you)


To address the points .


On 15 Oct 2019, at 21:14, Driesprong, Fokko <fo...@driesprong.frl> wrote
> - Are we going to extend the existing data model, to allow the RDBMS to
> optimize queries on fields that we use a lot?

Yes, as needed. Not really needed for this PR yet (it doesn't change any query 
patterns in the webserver, so they are using the existing tables)

> - How are we going to do state evolution when we extend the JSON model
> 

The top level object we're storing has a __version field (for example 
`{"__version": 1, "dag": { ... } }`) so we can detect older versions and either 
run a db "migration" on install, or upgrade at load time. 



On 15 Oct 2019, at 20:04, Dan Davydov <ddavy...@twitter.com.invalid> wrote:
> Having both a SimpleDagBag representation and the JSON representation
> doesn't make sense to me at the moment: *"**Quoting from Airflow code, it
> is “a simplified representation of a DAG that contains all attributes
> required for instantiating and scheduling its associated tasks.”. It does
> not contain enough information required by the webserver.". *Why not create
> a representation that can be used by both? This is going to be a big
> headache to both understand and work with in the codebase since it will be
> another definition that we need to keep in sync.

Honestly: because that makes the change bigger, harder to review and harder to 
back port. If this is a serious blocker to you agreeing we can update it to use 
the same structure as part of this PR, but I would rather it was a separate 
change.

> Not sure if fileloc/fileloc_hash is the right solution, the longterm
> solution I am imagining has clients responsible for uploading DAGs rather
> than retrieving them from the filesystem so fileloc/fileloc_hash wouldn't
> even exist (dag_id would be used for indexing here).

Then we can remove it later, when that code exists. For now so much of the rest 
of the code base assumes DAGs from files on disk so.



> Versioning isn't really addressed either (e.g. if a DAG topology changes
> with some update you want to be able to show both the old and new ones, or
> at least have a way to deal with them), there is an argument that this is
> acceptable since it isn't currently addressed now, but I'm worried that
> large schema changes should think through the long term plan a bit more.

Correct, and this is by design to make the change smaller and easier to review. 
Adding versioning using the existing serialization format is not that much work 
once the ground work is here and the webserver is not relying on DAGs from 
disk. There's also a UI question of how to handle this on the Tree view that I 
don't have an immediate answer for. But yes, versioning of dag run history was 
in our minds when working on this PR.

> I feel like getting this wrong is going to make it very hard to migrate
> things in the future, and make the codebase worse (another representation
> of DAGs that we need to support/understand/keep parity for). If I'm wrong
> about this then I would be more willing to +1 this change. This doc is a
> 1-2 pager and I feel like it is not thorough or deep enough and doesn't
> give me enough confidence that the work in the PR is going to make it
> easier to complete the future milestones instead of harder.

"But what if we get it wrong" is not something we can act on. So long as you 
are happy with the choice of JSON as the serialization then since we've got a 
version in the record we can change this relatively easily in the future - 
we'll know when we're dealing with an older record (we won't have to guess) and 
can write a format migrator (either to upgrade in memory, or to upgrade the 
record in place. And as a db "migration" or as a upgrade command)

We've based our serialization on Uber's Piper work 
<https://eng.uber.com/managing-data-workflows-at-scale/ 
<https://eng.uber.com/managing-data-workflows-at-scale/>> (we had a call with 
Alex at Uber before starting any of this work) so we've piggy backed of their 
hard work and months of running something very close to this format in 
production.

Also let's not forget: right now we aren't storing versions, and we still have 
DAGs on disk. If we do get it massively wrong (which I don't think we have) we 
can just drop this entire table and start again. This is all an internal 
implementation detail that is not exposed via an API. What better way to see if 
this is right than by running it at scale in production across 1000s of 
customers (Astronomer+Cloud Composer)

Can you be more specific about you are worried about here? It's a bit hard to 
put general "but what if we get it wrong" fears to rest.



On Tue, Oct 15, 2019 at 9:10 PM Alex Guziel <alex.guz...@airbnb.com.invalid>
> We don't need to have the future plan implemented
> completely but it would be nice to see more detailed notes about how this
> will play out in the future.

Fair point. Our rough long-term plan:

- Add versioning to the UI, likely by storing a new version linked from the 
DagRun, (but only when it changes, not every time to avoid DB bloat - multiple 
dag_run rows will point at the same serialized blob).

- Rework the scheduler use the serialized version, not the Simple* version. 
assuming we can do it and keep the scheduler performing well. (i.e. that 
ser/deser time isn't significant for the scheduler loop)


- Massively rework how scheduler creates TaskInstances (right now this is done 
in the Dag parsing process, not the main scheduler) and dag runs. We have to 
keep an eye on scheduler performance as we do this. If we can pull this back in 
to the main scheduler then this leads us towards the next points:

- Split out the updating of serialized dag/parsing of DAGs from the scheduler 
loop so it can be a separate component/subprocess (which is on by default). 
This gets us towards being able to submit DAG definitions via an API if we 
wanted. (There's a lot to work out to get that working, mostly around running 
the DAG).

- Once the scheduler loop/logic is not tied to parsing the DAGs then this also 
heads us towards being able to run multiple schedulers. (This isn't a 
requirement on HA scheduler, but it makes them easier to "scale out".)

Hope this answers people's questions, and thanks for the feedback

Alex and Dan: Does this give enough information for you to change your vote?

-ash

> On 15 Oct 2019, at 21:14, Driesprong, Fokko <fo...@driesprong.frl> wrote:
> 
> Big +1 from my side, looking forward to make this happen.
> 
> Two sides that aren't completely clear to me:
> 
>   - Are we going to extend the existing data model, to allow the RDBMS to
>   optimize queries on fields that we use a lot?
>   - How are we going to do state evolution when we extend the JSON model
> 
> I have good confidence that we'll solve this along the way.
> 
> Cheers, Fokko
> 
> Op di 15 okt. 2019 om 21:29 schreef Dan Davydov
> <ddavy...@twitter.com.invalid>:
> 
>> I have been following it from the beginning as well. I understand there
>> would be short-term wins for some users (I don't think a huge amount of
>> users?), but I still feel like we are being a bit short-sighted here and
>> that we are creating more work for ourselves and potentially our users in
>> the future. I also feel like there will be side effects to users as well,
>> many of which don't care about the webserver scalability, such as bugs
>> caused by the addition of the new webserver representation. I think without
>> a design that is much larger in scope I wouldn't feel comfortable moving
>> forward with this AIP.
>> 
>> On Tue, Oct 15, 2019 at 3:21 PM Jarek Potiuk <jarek.pot...@polidea.com>
>> wrote:
>> 
>>> Hello Dan, Alex,
>>> 
>>> I believe all the points you make are super-valid ones. But maybe you are
>>> missing the full context a bit.
>>> 
>>> I followed the original discussion
>>> <
>>> 
>> https://lists.apache.org/thread.html/a2d426f93c0f4e5f0347627308638b59ca4072fd022a42af1163e34a@%3Cdev.airflow.apache.org%3E
>>>> 
>>> from the very beginning and took part in the initial discussions when
>> this
>>> topic was raised. From the discussion it is quite clear to me that this
>>> mostly a "tactical" approach to implement something that is backportable
>> to
>>> 1.10 and rather quick to implement. This is targeted to make users more
>>> happy with their 1.10 version without the timing uncertainty and effort
>> of
>>> migration to 2.0. It solves the major pain point of stability of the UI
>> in
>>> case there are complex DAGs for which parsing crashes the webserver.
>>> Like in "being nice to your users".
>>> 
>>> There will be a separate effort to make pretty much all of the things you
>>> mentioned in 2.0 in a non-backportable way as it requires far too many
>>> changes in the way how Airflow works internally.
>>> 
>>> Maybe it needs some more explanation + long term plan that follows in the
>>> AIP itself to explain it to those who have not followed the initial
>>> discussion, but I think it's fully justified change.
>>> 
>>> J.
>>> 
>>> On Tue, Oct 15, 2019 at 9:10 PM Alex Guziel <alex.guz...@airbnb.com
>>> .invalid>
>>> wrote:
>>> 
>>>> -1 (binding)
>>>> Good points made by Dan. We don't need to have the future plan
>>> implemented
>>>> completely but it would be nice to see more detailed notes about how
>> this
>>>> will play out in the future. We shouldn't walk into a system that
>> causes
>>>> more pain in the future. (I can't say for sure that it does, but I
>> can't
>>>> say that it doesn't either). I don't think the proposal is necessarily
>>>> wrong or bad, but I think we need some more detailed planning around
>>> future
>>>> milestones.
>>>> 
>>>> On Tue, Oct 15, 2019 at 12:04 PM Dan Davydov
>>> <ddavy...@twitter.com.invalid
>>>>> 
>>>> wrote:
>>>> 
>>>>> -1 (binding), this may sound a bit FUD-y but I don't feel this has
>> been
>>>>> thought through enough...
>>>>> 
>>>>> Having both a SimpleDagBag representation and the JSON representation
>>>>> doesn't make sense to me at the moment: *"**Quoting from Airflow
>> code,
>>> it
>>>>> is “a simplified representation of a DAG that contains all attributes
>>>>> required for instantiating and scheduling its associated tasks.”. It
>>> does
>>>>> not contain enough information required by the webserver.". *Why not
>>>> create
>>>>> a representation that can be used by both? This is going to be a big
>>>>> headache to both understand and work with in the codebase since it
>> will
>>>> be
>>>>> another definition that we need to keep in sync.
>>>>> 
>>>>> Not sure if fileloc/fileloc_hash is the right solution, the longterm
>>>>> solution I am imagining has clients responsible for uploading DAGs
>>> rather
>>>>> than retrieving them from the filesystem so fileloc/fileloc_hash
>>> wouldn't
>>>>> even exist (dag_id would be used for indexing here).
>>>>> 
>>>>> Versioning isn't really addressed either (e.g. if a DAG topology
>>> changes
>>>>> with some update you want to be able to show both the old and new
>> ones,
>>>> or
>>>>> at least have a way to deal with them), there is an argument that
>> this
>>> is
>>>>> acceptable since it isn't currently addressed now, but I'm worried
>> that
>>>>> large schema changes should think through the long term plan a bit
>>> more.
>>>>> 
>>>>> I feel like getting this wrong is going to make it very hard to
>> migrate
>>>>> things in the future, and make the codebase worse (another
>>> representation
>>>>> of DAGs that we need to support/understand/keep parity for). If I'm
>>> wrong
>>>>> about this then I would be more willing to +1 this change. This doc
>> is
>>> a
>>>>> 1-2 pager and I feel like it is not thorough or deep enough and
>> doesn't
>>>>> give me enough confidence that the work in the PR is going to make it
>>>>> easier to complete the future milestones instead of harder.
>>>>> 
>>>>> On Tue, Oct 15, 2019 at 11:26 AM Kamil Breguła <
>>>> kamil.breg...@polidea.com>
>>>>> wrote:
>>>>> 
>>>>>> +1 (binding)
>>>>>> 
>>>>>> On Tue, Oct 15, 2019 at 2:57 AM Kaxil Naik <kaxiln...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>>> Hello, Airflow community,
>>>>>>> 
>>>>>>> This email calls for a vote to add the DAG Serialization feature
>> at
>>>>>>> https://github.com/apache/airflow/pull/5743.
>>>>>>> 
>>>>>>> *AIP*:
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler
>>>>>>> 
>>>>>>> *Previous Mailing List discussion*:
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://lists.apache.org/thread.html/65d282368e0a7c19815badb8b1c6c8d72b0975ce94f601e13af44f74@%3Cdev.airflow.apache.org%3E
>>>>>>> .
>>>>>>> 
>>>>>>> *Authors*: Kaxil Naik, Zhou Fang, Ash-Berlin Taylor
>>>>>>> 
>>>>>>> *Summary*:
>>>>>>> 
>>>>>>>   - DAGs are serialized using JSON format and stored in a
>>>>> SerializedDag
>>>>>>>   table
>>>>>>>   - The Webserver now instead of having to parse the DAG file
>>> again,
>>>>>>>   reads the serialized DAGs in JSON, de-serializes them and
>>> creates
>>>>> the
>>>>>>>   DagBag and uses it to show in the UI.
>>>>>>>   - Instead of loading an entire DagBag when the WebServer
>> starts
>>> we
>>>>>>>   only load each DAG on demand from the Serialized Dag table.
>> This
>>>>> helps
>>>>>>>   reduce Webserver startup time and memory. The reduction is
>>> notable
>>>>>> when you
>>>>>>>   have a large number of DAGs.
>>>>>>>   - A JSON Schema has been defined and we validate the
>> serialized
>>>> dag
>>>>>>>   before writing it to the database
>>>>>>> 
>>>>>>> [image: image.png]
>>>>>>> 
>>>>>>> A PR (https://github.com/apache/airflow/pull/5743) is ready for
>>>> review
>>>>>>> from the committers and community.
>>>>>>> 
>>>>>>> We also have a WIP PR (
>> https://github.com/apache/airflow/pull/5992
>>> )
>>>> to
>>>>>>> backport this feature to 1.10.* branch.
>>>>>>> 
>>>>>>> A big thank you to Zhou and Ash for their continuous help in
>>>> improving
>>>>>>> this feature/PR.
>>>>>>> 
>>>>>>> This email is formally calling for a vote to accept the AIP and
>> PR.
>>>>>> Please
>>>>>>> note that we will update the PR / feature to fix bugs if we find
>>> any.
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Kaxil
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> 
>>> Jarek Potiuk
>>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>> 
>>> M: +48 660 796 129 <+48660796129>
>>> [image: Polidea] <https://www.polidea.com/>
>>> 
>>

Re: [VOTE] AIP-24: Persisting serialized DAG in DB for webserver scalability

Reply via email to