Hi guys,
I have addressed all comments on the PR: https://github.com/apache/airflow/pull/5743 Can we merge this PR please if everything looks good and approved by the committers? It is becoming increasingly difficult to rebase on master and resolve conflicts. I also have the backport PR (https://github.com/apache/airflow/pull/5992) ready and that too is becoming nightmare to maintain. Regards, Kaxil On Wed, Oct 23, 2019, 23:39 Kaxil Naik <kaxiln...@gmail.com> wrote: > This vote passed (although not unanimous) and I'll mark this AIP as > accepted. > > *Result*: > +1 votes: 7 (6 binding and 1 non-binding vote) > -1 votes: 2 (2 binding and 0 non-binding votes) > > *+1 (binding)*: > Kaxil Naik > Ash-Berlin Taylor > Jarek Potiuk > Kamil Breguła > Fokko Driesprong > Sumit Maheshwari > > *+1 (non-binding):* > Philippe Gagnon > > *-1 (binding)*: > Dan Davydov > Alex Guziel > > > > On Thu, Oct 17, 2019 at 5:29 PM Dan Davydov <ddavy...@twitter.com> wrote: > >> Not sure I'm convinced, I think my core concern of maintaining two >> representations and not having a thought-out future plan of how the schema >> will evolve/how easy this migration will be still stands. It feels like we >> are trying to chase some short-term wins here. But hey it's a democracy, >> don't let my vote block you :)! >> >> On Thu, Oct 17, 2019 at 10:15 AM Kaxil Naik <kaxiln...@gmail.com> wrote: >> >>> Hi Dan, I understand your concern. Your +1 and suggestions are very >>> important for us so let me try to explain in more details >>> and see if I can convince you. >>> >>> Please check replies in-line >>> >>> I think this is the kind of core change we should make from the get-go, >>>> my >>>> feeling is that we will be paying for this decision otherwise, and the >>>> cost >>>> will surpass the immediate value. Ideally we would fully understand the >>>> long-term plan before deciding on this serialization format, but I >>>> think at >>>> least having a single serialized representation for now would be a >>>> decent >>>> compromise for me. >>>> This is still a blocker to me agreeing. >>> >>> >>> Trying to do all (Serialization for Webserver, Scheduler, etc) at once >>> would be error prone as we would need to cover all cases as well and >>> it is very difficult to test and review such a large chunk of work even >>> if we have multiple people reviewing it. >>> >>> Doing it iteratively would allow us to roll out serialisation while >>> keeping backwards compatibility. Keeping backwards compatibility for now >>> is very important atleast for Websever as the users would have to wait >>> for months or at-worse 1 more year (plus the upgrade path to 1.10* to 2.0 is >>> already looking quite huge). >>> >>> Airflow 2.0 is still going to take months and is already going to have >>> number of breaking changes that is going to make >>> updating to it cumbersome task: >>> >>> - No non-RBAC UI >>> - No Py2 >>> - Import paths have (or in the process of being) changed from all/most >>> contrib objects >>> - CLI reorganization >>> - and various others: >>> https://github.com/apache/airflow/blob/master/UPDATING.md#airflow-master >>> >>> I am listing the above for 1 reason - "How long would the users have to >>> deal with Webserver scalability issue" before they upgrade to 2.0 ! >>> >>> However, I agree with your argument that we should ideally understand >>> the long-term plan. >>> >>> The current serialized representation of DAG looks like the following: >>> >>> https://github.com/apache/airflow/blob/a6f37cd0da0774dff6461d3fdcb94dee9d101fda/tests/serialization/test_dag_serialization.py#L40-L93 >>> >>> As you can see most of the information required by SimpleTask is already >>> present. Currently, SimpleTask is used by the Scheduler. The next phase >>> of this AIP will be to use Serialization in the Scheduler and removing >>> SimpleTask altogether. >>> >>> As you said that at least having a single serialized representation, for >>> now, would be a decent compromise for me, how about you suggest >>> or let us know if you think we are missing anything in the serialized >>> blob example above. >>> >>> ------------------------ >>> >>> As long as all of the changes are in the camp of "easy to migrate/undo in >>>> the future" then these are not blockers for me. I'm not 100% convinced >>>> that >>>> they are, but you have more context so I'll defer to your judgment. >>> >>> >>> Currently, as the Serialization applies only to the webserver, if >>> something goes wrong, changing `store_serialized_dags=False` would restore >>> no-serialization behavior. >>> >>> We are not touching Scheduler for 1.10* as it is an actual Heart of >>> Airflow and wouldn't feel comfortable in changing that core in a >>> minor/patch fix. >>> >>> ------------------------ >>> >>> I'm happy with JSON, my main point about "what if we get it wrong" is >>>> that >>>> I feel like more planning work could have been put into this project vs >>>> just a couple of rough high-level target "milestones". This >>>> significantly >>>> increases the risk of the project long term in my eyes. >>> >>> >>> Getting all the things right all at once would be harder than taking it >>> piece by piece. >>> We have discovered many things in Webserver that we initially did not >>> take into account like "Operator Links". >>> Even Scheduler would have something similar which we won't know now but >>> find out later which would then require re-working and going back-and-forth. >>> >>> ------------------------ >>> >>>> I think as long as we combine the serialized DAG representations and >>>> have >>>> some confidence that the data format will be easy to switch to future >>>> potential formats we would want (kind of hard to do without a more >>>> detailed >>>> future plan...), then I would happily change my vote. >>> >>> >>> For storing versioned dags, this is just a rough idea and it can change: >>> >>> We have a last_updated_time column in the Serialized Dag table. >>> Currently we just update that dag row when we process the Dag again, >>> instead, we would just store it as a different row if the DAG has >>> changed and if it already had a DagRun with previous version i.e >>> we don't need to store it every time it changes if it hasn't run in the >>> meantime. >>> >>> We can assign a unique number (version) to each row for a DAG. >>> Sorting by `last_updated_time` column who gives us how DAG updated over >>> time. >>> >>> Regards, >>> Kaxil >>> >>> On Wed, Oct 16, 2019 at 4:43 PM Dan Davydov <ddavy...@twitter.com.invalid> >>> wrote: >>> >>>> Responses inline. >>>> >>>> On Wed, Oct 16, 2019 at 6:07 AM Ash Berlin-Taylor <a...@apache.org> >>>> wrote: >>>> >>>> > Thanks for the feedback and discussion everyone - it's nice to know >>>> people >>>> > feel strongly about this. Let's make sure we build this right! >>>> > >>>> > As a reminder: the main goals of this AIP: >>>> > >>>> > - To speed up bootup/worker recycle time of the webserver >>>> > >>>> > If you have a large number of DAGs or highly dynamic DAGs it is >>>> possible >>>> > that starting your webserver needs a Gunicorn worker timeout of 5 or >>>> 10 >>>> > minutes. (It is worse at start up because each of the gunicorn workers >>>> > parses the dag itself, in exactly the same order, at the same time as >>>> the >>>> > other workers, leading to disk contention, and CPU burst.) >>>> > >>>> > And just yesterday I helped two people diagnose a problem that was >>>> > caused by slow NFS: > time airflow list_dags` took 3:35mins to go >>>> over 43 >>>> > dags. >>>> > >>>> > At Astronomer, and Google (Cloud Composer) we've seen this slow >>>> loading >>>> > many times for our customer's installs. >>>> > >>>> > And not to mention this makes the webserver snappier >>>> > >>>> > - To make the webserver almost "stateless" (not the best term for it: >>>> > Diskless perhaps? Dagbag-less?). >>>> > >>>> > This makes actions on the webserver quicker. We had a number of PRs >>>> in >>>> > the 1.10 series that made views and API endpoints in the webserver >>>> quicker >>>> > By making them only load one dag from disk. This takes it further and >>>> means >>>> > that (as of the PR) only the Rendered tab needs to go to the code on >>>> disk. >>>> > As a reminder the "trigger_dag" endpoint would previously load all >>>> dags. >>>> > Other's still do. Multiple people on have complained about this Slack. >>>> > >>>> > (The only remaining view that has to hit the code on disk is the >>>> > Rendered template tab, as right now a DAG can define custom macros or >>>> > filters, and we don't want to have to deal with serializing code. The >>>> only >>>> > "real" way of doing it is CloudPickle. No thank you) >>>> > >>>> > >>>> > To address the points . >>>> > >>>> > >>>> > On 15 Oct 2019, at 21:14, Driesprong, Fokko <fo...@driesprong.frl> >>>> wrote >>>> > > - Are we going to extend the existing data model, to allow the >>>> RDBMS to >>>> > > optimize queries on fields that we use a lot? >>>> > >>>> > Yes, as needed. Not really needed for this PR yet (it doesn't change >>>> any >>>> > query patterns in the webserver, so they are using the existing >>>> tables) >>>> > >>>> > > - How are we going to do state evolution when we extend the JSON >>>> model >>>> > > >>>> > >>>> > The top level object we're storing has a __version field (for example >>>> > `{"__version": 1, "dag": { ... } }`) so we can detect older versions >>>> and >>>> > either run a db "migration" on install, or upgrade at load time. >>>> > >>>> > >>>> > >>>> > On 15 Oct 2019, at 20:04, Dan Davydov <ddavy...@twitter.com.invalid> >>>> > wrote: >>>> > > Having both a SimpleDagBag representation and the JSON >>>> representation >>>> > > doesn't make sense to me at the moment: *"**Quoting from Airflow >>>> code, it >>>> > > is “a simplified representation of a DAG that contains all >>>> attributes >>>> > > required for instantiating and scheduling its associated tasks.”. >>>> It does >>>> > > not contain enough information required by the webserver.". *Why not >>>> > create >>>> > > a representation that can be used by both? This is going to be a big >>>> > > headache to both understand and work with in the codebase since it >>>> will >>>> > be >>>> > > another definition that we need to keep in sync. >>>> > >>>> > Honestly: because that makes the change bigger, harder to review and >>>> > harder to back port. If this is a serious blocker to you agreeing we >>>> can >>>> > update it to use the same structure as part of this PR, but I would >>>> rather >>>> > it was a separate change. >>>> > >>>> I think this is the kind of core change we should make from the get-go, >>>> my >>>> feeling is that we will be paying for this decision otherwise, and the >>>> cost >>>> will surpass the immediate value. Ideally we would fully understand the >>>> long-term plan before deciding on this serialization format, but I >>>> think at >>>> least having a single serialized representation for now would be a >>>> decent >>>> compromise for me. >>>> >>>> This is still a blocker to me agreeing. >>>> >>>> >>>> > >>>> > > Not sure if fileloc/fileloc_hash is the right solution, the longterm >>>> > > solution I am imagining has clients responsible for uploading DAGs >>>> rather >>>> > > than retrieving them from the filesystem so fileloc/fileloc_hash >>>> wouldn't >>>> > > even exist (dag_id would be used for indexing here). >>>> > >>>> > Then we can remove it later, when that code exists. For now so much >>>> of the >>>> > rest of the code base assumes DAGs from files on disk so. >>>> > >>>> > >>>> > >>>> > > Versioning isn't really addressed either (e.g. if a DAG topology >>>> changes >>>> > > with some update you want to be able to show both the old and new >>>> ones, >>>> > or >>>> > > at least have a way to deal with them), there is an argument that >>>> this is >>>> > > acceptable since it isn't currently addressed now, but I'm worried >>>> that >>>> > > large schema changes should think through the long term plan a bit >>>> more. >>>> > >>>> > Correct, and this is by design to make the change smaller and easier >>>> to >>>> > review. Adding versioning using the existing serialization format is >>>> not >>>> > that much work once the ground work is here and the webserver is not >>>> > relying on DAGs from disk. There's also a UI question of how to >>>> handle this >>>> > on the Tree view that I don't have an immediate answer for. But yes, >>>> > versioning of dag run history was in our minds when working on this >>>> PR. >>>> > >>>> > As long as all of the changes are in the camp of "easy to >>>> migrate/undo in >>>> the future" then these are not blockers for me. I'm not 100% convinced >>>> that >>>> they are, but you have more context so I'll defer to your judgment. >>>> >>>> >>>> > > I feel like getting this wrong is going to make it very hard to >>>> migrate >>>> > > things in the future, and make the codebase worse (another >>>> representation >>>> > > of DAGs that we need to support/understand/keep parity for). If I'm >>>> wrong >>>> > > about this then I would be more willing to +1 this change. This doc >>>> is a >>>> > > 1-2 pager and I feel like it is not thorough or deep enough and >>>> doesn't >>>> > > give me enough confidence that the work in the PR is going to make >>>> it >>>> > > easier to complete the future milestones instead of harder. >>>> > >>>> > "But what if we get it wrong" is not something we can act on. So long >>>> as >>>> > you are happy with the choice of JSON as the serialization then since >>>> we've >>>> > got a version in the record we can change this relatively easily in >>>> the >>>> > future - we'll know when we're dealing with an older record (we won't >>>> have >>>> > to guess) and can write a format migrator (either to upgrade in >>>> memory, or >>>> > to upgrade the record in place. And as a db "migration" or as a >>>> upgrade >>>> > command) >>>> > >>>> I'm happy with JSON, my main point about "what if we get it wrong" is >>>> that >>>> I feel like more planning work could have been put into this project vs >>>> just a couple of rough high-level target "milestones". This >>>> significantly >>>> increases the risk of the project long term in my eyes. >>>> >>>> >>>> > >>>> > We've based our serialization on Uber's Piper work < >>>> > https://eng.uber.com/managing-data-workflows-at-scale/ < >>>> > https://eng.uber.com/managing-data-workflows-at-scale/>> (we had a >>>> call >>>> > with Alex at Uber before starting any of this work) so we've piggy >>>> backed >>>> > of their hard work and months of running something very close to this >>>> > format in production. >>>> >>>> >>>> > >>>> This is great to hear! >>>> >>>> >>>> > Also let's not forget: right now we aren't storing versions, and we >>>> still >>>> > have DAGs on disk. If we do get it massively wrong (which I don't >>>> think we >>>> > have) we can just drop this entire table and start again. This is all >>>> an >>>> > internal implementation detail that is not exposed via an API. What >>>> better >>>> > way to see if this is right than by running it at scale in production >>>> > across 1000s of customers (Astronomer+Cloud Composer) >>>> > >>>> If we have thought through future milestones and made sure that we can >>>> switch to different formats that could required for them easily, then >>>> I'm >>>> not a blocker on this front. >>>> >>>> >>>> > Can you be more specific about you are worried about here? It's a bit >>>> hard >>>> > to put general "but what if we get it wrong" fears to rest. >>>> > >>>> > >>>> > >>>> > On Tue, Oct 15, 2019 at 9:10 PM Alex Guziel <alex.guz...@airbnb.com >>>> > .invalid> >>>> > > We don't need to have the future plan implemented >>>> > > completely but it would be nice to see more detailed notes about >>>> how this >>>> > > will play out in the future. >>>> > >>>> > Fair point. Our rough long-term plan: >>>> > >>>> > - Add versioning to the UI, likely by storing a new version linked >>>> from >>>> > the DagRun, (but only when it changes, not every time to avoid DB >>>> bloat - >>>> > multiple dag_run rows will point at the same serialized blob). >>>> > >>>> > - Rework the scheduler use the serialized version, not the Simple* >>>> > version. assuming we can do it and keep the scheduler performing well. >>>> > (i.e. that ser/deser time isn't significant for the scheduler loop) >>>> > >>>> > >>>> > - Massively rework how scheduler creates TaskInstances (right now >>>> this is >>>> > done in the Dag parsing process, not the main scheduler) and dag >>>> runs. We >>>> > have to keep an eye on scheduler performance as we do this. If we can >>>> pull >>>> > this back in to the main scheduler then this leads us towards the next >>>> > points: >>>> > >>>> > - Split out the updating of serialized dag/parsing of DAGs from the >>>> > scheduler loop so it can be a separate component/subprocess (which is >>>> on by >>>> > default). This gets us towards being able to submit DAG definitions >>>> via an >>>> > API if we wanted. (There's a lot to work out to get that working, >>>> mostly >>>> > around running the DAG). >>>> > >>>> > - Once the scheduler loop/logic is not tied to parsing the DAGs then >>>> this >>>> > also heads us towards being able to run multiple schedulers. (This >>>> isn't a >>>> > requirement on HA scheduler, but it makes them easier to "scale out".) >>>> > >>>> > Hope this answers people's questions, and thanks for the feedback >>>> > >>>> > Alex and Dan: Does this give enough information for you to change your >>>> > vote? >>>> > >>>> I think as long as we combine the serialized DAG representations and >>>> have >>>> some confidence that the data format will be easy to switch to future >>>> potential formats we would want (kind of hard to do without a more >>>> detailed >>>> future plan...), then I would happily change my vote. >>>> >>>> >>>> > -ash >>>> > >>>> > > On 15 Oct 2019, at 21:14, Driesprong, Fokko <fo...@driesprong.frl> >>>> > wrote: >>>> > > >>>> > > Big +1 from my side, looking forward to make this happen. >>>> > > >>>> > > Two sides that aren't completely clear to me: >>>> > > >>>> > > - Are we going to extend the existing data model, to allow the >>>> RDBMS to >>>> > > optimize queries on fields that we use a lot? >>>> > > - How are we going to do state evolution when we extend the JSON >>>> model >>>> > > >>>> > > I have good confidence that we'll solve this along the way. >>>> > > >>>> > > Cheers, Fokko >>>> > > >>>> > > Op di 15 okt. 2019 om 21:29 schreef Dan Davydov >>>> > > <ddavy...@twitter.com.invalid>: >>>> > > >>>> > >> I have been following it from the beginning as well. I understand >>>> there >>>> > >> would be short-term wins for some users (I don't think a huge >>>> amount of >>>> > >> users?), but I still feel like we are being a bit short-sighted >>>> here and >>>> > >> that we are creating more work for ourselves and potentially our >>>> users >>>> > in >>>> > >> the future. I also feel like there will be side effects to users as >>>> > well, >>>> > >> many of which don't care about the webserver scalability, such as >>>> bugs >>>> > >> caused by the addition of the new webserver representation. I think >>>> > without >>>> > >> a design that is much larger in scope I wouldn't feel comfortable >>>> moving >>>> > >> forward with this AIP. >>>> > >> >>>> > >> On Tue, Oct 15, 2019 at 3:21 PM Jarek Potiuk < >>>> jarek.pot...@polidea.com> >>>> > >> wrote: >>>> > >> >>>> > >>> Hello Dan, Alex, >>>> > >>> >>>> > >>> I believe all the points you make are super-valid ones. But maybe >>>> you >>>> > are >>>> > >>> missing the full context a bit. >>>> > >>> >>>> > >>> I followed the original discussion >>>> > >>> < >>>> > >>> >>>> > >> >>>> > >>>> https://lists.apache.org/thread.html/a2d426f93c0f4e5f0347627308638b59ca4072fd022a42af1163e34a@%3Cdev.airflow.apache.org%3E >>>> > >>>> >>>> > >>> from the very beginning and took part in the initial discussions >>>> when >>>> > >> this >>>> > >>> topic was raised. From the discussion it is quite clear to me >>>> that this >>>> > >>> mostly a "tactical" approach to implement something that is >>>> > backportable >>>> > >> to >>>> > >>> 1.10 and rather quick to implement. This is targeted to make >>>> users more >>>> > >>> happy with their 1.10 version without the timing uncertainty and >>>> effort >>>> > >> of >>>> > >>> migration to 2.0. It solves the major pain point of stability of >>>> the UI >>>> > >> in >>>> > >>> case there are complex DAGs for which parsing crashes the >>>> webserver. >>>> > >>> Like in "being nice to your users". >>>> > >>> >>>> > >>> There will be a separate effort to make pretty much all of the >>>> things >>>> > you >>>> > >>> mentioned in 2.0 in a non-backportable way as it requires far too >>>> many >>>> > >>> changes in the way how Airflow works internally. >>>> > >>> >>>> > >>> Maybe it needs some more explanation + long term plan that >>>> follows in >>>> > the >>>> > >>> AIP itself to explain it to those who have not followed the >>>> initial >>>> > >>> discussion, but I think it's fully justified change. >>>> > >>> >>>> > >>> J. >>>> > >>> >>>> > >>> On Tue, Oct 15, 2019 at 9:10 PM Alex Guziel < >>>> alex.guz...@airbnb.com >>>> > >>> .invalid> >>>> > >>> wrote: >>>> > >>> >>>> > >>>> -1 (binding) >>>> > >>>> Good points made by Dan. We don't need to have the future plan >>>> > >>> implemented >>>> > >>>> completely but it would be nice to see more detailed notes about >>>> how >>>> > >> this >>>> > >>>> will play out in the future. We shouldn't walk into a system that >>>> > >> causes >>>> > >>>> more pain in the future. (I can't say for sure that it does, but >>>> I >>>> > >> can't >>>> > >>>> say that it doesn't either). I don't think the proposal is >>>> necessarily >>>> > >>>> wrong or bad, but I think we need some more detailed planning >>>> around >>>> > >>> future >>>> > >>>> milestones. >>>> > >>>> >>>> > >>>> On Tue, Oct 15, 2019 at 12:04 PM Dan Davydov >>>> > >>> <ddavy...@twitter.com.invalid >>>> > >>>>> >>>> > >>>> wrote: >>>> > >>>> >>>> > >>>>> -1 (binding), this may sound a bit FUD-y but I don't feel this >>>> has >>>> > >> been >>>> > >>>>> thought through enough... >>>> > >>>>> >>>> > >>>>> Having both a SimpleDagBag representation and the JSON >>>> representation >>>> > >>>>> doesn't make sense to me at the moment: *"**Quoting from Airflow >>>> > >> code, >>>> > >>> it >>>> > >>>>> is “a simplified representation of a DAG that contains all >>>> attributes >>>> > >>>>> required for instantiating and scheduling its associated >>>> tasks.”. It >>>> > >>> does >>>> > >>>>> not contain enough information required by the webserver.". >>>> *Why not >>>> > >>>> create >>>> > >>>>> a representation that can be used by both? This is going to be >>>> a big >>>> > >>>>> headache to both understand and work with in the codebase since >>>> it >>>> > >> will >>>> > >>>> be >>>> > >>>>> another definition that we need to keep in sync. >>>> > >>>>> >>>> > >>>>> Not sure if fileloc/fileloc_hash is the right solution, the >>>> longterm >>>> > >>>>> solution I am imagining has clients responsible for uploading >>>> DAGs >>>> > >>> rather >>>> > >>>>> than retrieving them from the filesystem so fileloc/fileloc_hash >>>> > >>> wouldn't >>>> > >>>>> even exist (dag_id would be used for indexing here). >>>> > >>>>> >>>> > >>>>> Versioning isn't really addressed either (e.g. if a DAG topology >>>> > >>> changes >>>> > >>>>> with some update you want to be able to show both the old and >>>> new >>>> > >> ones, >>>> > >>>> or >>>> > >>>>> at least have a way to deal with them), there is an argument >>>> that >>>> > >> this >>>> > >>> is >>>> > >>>>> acceptable since it isn't currently addressed now, but I'm >>>> worried >>>> > >> that >>>> > >>>>> large schema changes should think through the long term plan a >>>> bit >>>> > >>> more. >>>> > >>>>> >>>> > >>>>> I feel like getting this wrong is going to make it very hard to >>>> > >> migrate >>>> > >>>>> things in the future, and make the codebase worse (another >>>> > >>> representation >>>> > >>>>> of DAGs that we need to support/understand/keep parity for). If >>>> I'm >>>> > >>> wrong >>>> > >>>>> about this then I would be more willing to +1 this change. This >>>> doc >>>> > >> is >>>> > >>> a >>>> > >>>>> 1-2 pager and I feel like it is not thorough or deep enough and >>>> > >> doesn't >>>> > >>>>> give me enough confidence that the work in the PR is going to >>>> make it >>>> > >>>>> easier to complete the future milestones instead of harder. >>>> > >>>>> >>>> > >>>>> On Tue, Oct 15, 2019 at 11:26 AM Kamil Breguła < >>>> > >>>> kamil.breg...@polidea.com> >>>> > >>>>> wrote: >>>> > >>>>> >>>> > >>>>>> +1 (binding) >>>> > >>>>>> >>>> > >>>>>> On Tue, Oct 15, 2019 at 2:57 AM Kaxil Naik < >>>> kaxiln...@gmail.com> >>>> > >>>> wrote: >>>> > >>>>>> >>>> > >>>>>>> Hello, Airflow community, >>>> > >>>>>>> >>>> > >>>>>>> This email calls for a vote to add the DAG Serialization >>>> feature >>>> > >> at >>>> > >>>>>>> https://github.com/apache/airflow/pull/5743. >>>> > >>>>>>> >>>> > >>>>>>> *AIP*: >>>> > >>>>>>> >>>> > >>>>>> >>>> > >>>>> >>>> > >>>> >>>> > >>> >>>> > >> >>>> > >>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler >>>> > >>>>>>> >>>> > >>>>>>> *Previous Mailing List discussion*: >>>> > >>>>>>> >>>> > >>>>>> >>>> > >>>>> >>>> > >>>> >>>> > >>> >>>> > >> >>>> > >>>> https://lists.apache.org/thread.html/65d282368e0a7c19815badb8b1c6c8d72b0975ce94f601e13af44f74@%3Cdev.airflow.apache.org%3E >>>> > >>>>>>> . >>>> > >>>>>>> >>>> > >>>>>>> *Authors*: Kaxil Naik, Zhou Fang, Ash-Berlin Taylor >>>> > >>>>>>> >>>> > >>>>>>> *Summary*: >>>> > >>>>>>> >>>> > >>>>>>> - DAGs are serialized using JSON format and stored in a >>>> > >>>>> SerializedDag >>>> > >>>>>>> table >>>> > >>>>>>> - The Webserver now instead of having to parse the DAG file >>>> > >>> again, >>>> > >>>>>>> reads the serialized DAGs in JSON, de-serializes them and >>>> > >>> creates >>>> > >>>>> the >>>> > >>>>>>> DagBag and uses it to show in the UI. >>>> > >>>>>>> - Instead of loading an entire DagBag when the WebServer >>>> > >> starts >>>> > >>> we >>>> > >>>>>>> only load each DAG on demand from the Serialized Dag table. >>>> > >> This >>>> > >>>>> helps >>>> > >>>>>>> reduce Webserver startup time and memory. The reduction is >>>> > >>> notable >>>> > >>>>>> when you >>>> > >>>>>>> have a large number of DAGs. >>>> > >>>>>>> - A JSON Schema has been defined and we validate the >>>> > >> serialized >>>> > >>>> dag >>>> > >>>>>>> before writing it to the database >>>> > >>>>>>> >>>> > >>>>>>> [image: image.png] >>>> > >>>>>>> >>>> > >>>>>>> A PR (https://github.com/apache/airflow/pull/5743) is ready >>>> for >>>> > >>>> review >>>> > >>>>>>> from the committers and community. >>>> > >>>>>>> >>>> > >>>>>>> We also have a WIP PR ( >>>> > >> https://github.com/apache/airflow/pull/5992 >>>> > >>> ) >>>> > >>>> to >>>> > >>>>>>> backport this feature to 1.10.* branch. >>>> > >>>>>>> >>>> > >>>>>>> A big thank you to Zhou and Ash for their continuous help in >>>> > >>>> improving >>>> > >>>>>>> this feature/PR. >>>> > >>>>>>> >>>> > >>>>>>> This email is formally calling for a vote to accept the AIP >>>> and >>>> > >> PR. >>>> > >>>>>> Please >>>> > >>>>>>> note that we will update the PR / feature to fix bugs if we >>>> find >>>> > >>> any. >>>> > >>>>>>> >>>> > >>>>>>> Cheers, >>>> > >>>>>>> Kaxil >>>> > >>>>>>> >>>> > >>>>>>> >>>> > >>>>>>> >>>> > >>>>>>> >>>> > >>>>>>> >>>> > >>>>>>> >>>> > >>>>>>> >>>> > >>>>>> >>>> > >>>>> >>>> > >>>> >>>> > >>> >>>> > >>> >>>> > >>> -- >>>> > >>> >>>> > >>> Jarek Potiuk >>>> > >>> Polidea <https://www.polidea.com/> | Principal Software Engineer >>>> > >>> >>>> > >>> M: +48 660 796 129 <+48660796129> >>>> > >>> [image: Polidea] <https://www.polidea.com/> >>>> > >>> >>>> > >> >>>> > >>>> > >>>> >>>