Not sure I'm convinced, I think my core concern of maintaining two representations and not having a thought-out future plan of how the schema will evolve/how easy this migration will be still stands. It feels like we are trying to chase some short-term wins here. But hey it's a democracy, don't let my vote block you :)!
On Thu, Oct 17, 2019 at 10:15 AM Kaxil Naik <kaxiln...@gmail.com> wrote: > Hi Dan, I understand your concern. Your +1 and suggestions are very > important for us so let me try to explain in more details > and see if I can convince you. > > Please check replies in-line > > I think this is the kind of core change we should make from the get-go, my >> feeling is that we will be paying for this decision otherwise, and the >> cost >> will surpass the immediate value. Ideally we would fully understand the >> long-term plan before deciding on this serialization format, but I think >> at >> least having a single serialized representation for now would be a decent >> compromise for me. >> This is still a blocker to me agreeing. > > > Trying to do all (Serialization for Webserver, Scheduler, etc) at once > would be error prone as we would need to cover all cases as well and > it is very difficult to test and review such a large chunk of work even if > we have multiple people reviewing it. > > Doing it iteratively would allow us to roll out serialisation while > keeping backwards compatibility. Keeping backwards compatibility for now > is very important atleast for Websever as the users would have to wait for > months or at-worse 1 more year (plus the upgrade path to 1.10* to 2.0 is > already looking quite huge). > > Airflow 2.0 is still going to take months and is already going to have > number of breaking changes that is going to make > updating to it cumbersome task: > > - No non-RBAC UI > - No Py2 > - Import paths have (or in the process of being) changed from all/most > contrib objects > - CLI reorganization > - and various others: > https://github.com/apache/airflow/blob/master/UPDATING.md#airflow-master > > I am listing the above for 1 reason - "How long would the users have to > deal with Webserver scalability issue" before they upgrade to 2.0 ! > > However, I agree with your argument that we should ideally understand the > long-term plan. > > The current serialized representation of DAG looks like the following: > > https://github.com/apache/airflow/blob/a6f37cd0da0774dff6461d3fdcb94dee9d101fda/tests/serialization/test_dag_serialization.py#L40-L93 > > As you can see most of the information required by SimpleTask is already > present. Currently, SimpleTask is used by the Scheduler. The next phase > of this AIP will be to use Serialization in the Scheduler and removing > SimpleTask altogether. > > As you said that at least having a single serialized representation, for > now, would be a decent compromise for me, how about you suggest > or let us know if you think we are missing anything in the serialized > blob example above. > > ------------------------ > > As long as all of the changes are in the camp of "easy to migrate/undo in >> the future" then these are not blockers for me. I'm not 100% convinced >> that >> they are, but you have more context so I'll defer to your judgment. > > > Currently, as the Serialization applies only to the webserver, if > something goes wrong, changing `store_serialized_dags=False` would restore > no-serialization behavior. > > We are not touching Scheduler for 1.10* as it is an actual Heart of > Airflow and wouldn't feel comfortable in changing that core in a > minor/patch fix. > > ------------------------ > > I'm happy with JSON, my main point about "what if we get it wrong" is that >> I feel like more planning work could have been put into this project vs >> just a couple of rough high-level target "milestones". This significantly >> increases the risk of the project long term in my eyes. > > > Getting all the things right all at once would be harder than taking it > piece by piece. > We have discovered many things in Webserver that we initially did not take > into account like "Operator Links". > Even Scheduler would have something similar which we won't know now but > find out later which would then require re-working and going back-and-forth. > > ------------------------ > >> I think as long as we combine the serialized DAG representations and have >> some confidence that the data format will be easy to switch to future >> potential formats we would want (kind of hard to do without a more >> detailed >> future plan...), then I would happily change my vote. > > > For storing versioned dags, this is just a rough idea and it can change: > > We have a last_updated_time column in the Serialized Dag table. Currently > we just update that dag row when we process the Dag again, > instead, we would just store it as a different row if the DAG has changed > and if it already had a DagRun with previous version i.e > we don't need to store it every time it changes if it hasn't run in the > meantime. > > We can assign a unique number (version) to each row for a DAG. > Sorting by `last_updated_time` column who gives us how DAG updated over > time. > > Regards, > Kaxil > > On Wed, Oct 16, 2019 at 4:43 PM Dan Davydov <ddavy...@twitter.com.invalid> > wrote: > >> Responses inline. >> >> On Wed, Oct 16, 2019 at 6:07 AM Ash Berlin-Taylor <a...@apache.org> wrote: >> >> > Thanks for the feedback and discussion everyone - it's nice to know >> people >> > feel strongly about this. Let's make sure we build this right! >> > >> > As a reminder: the main goals of this AIP: >> > >> > - To speed up bootup/worker recycle time of the webserver >> > >> > If you have a large number of DAGs or highly dynamic DAGs it is >> possible >> > that starting your webserver needs a Gunicorn worker timeout of 5 or 10 >> > minutes. (It is worse at start up because each of the gunicorn workers >> > parses the dag itself, in exactly the same order, at the same time as >> the >> > other workers, leading to disk contention, and CPU burst.) >> > >> > And just yesterday I helped two people diagnose a problem that was >> > caused by slow NFS: > time airflow list_dags` took 3:35mins to go over >> 43 >> > dags. >> > >> > At Astronomer, and Google (Cloud Composer) we've seen this slow >> loading >> > many times for our customer's installs. >> > >> > And not to mention this makes the webserver snappier >> > >> > - To make the webserver almost "stateless" (not the best term for it: >> > Diskless perhaps? Dagbag-less?). >> > >> > This makes actions on the webserver quicker. We had a number of PRs in >> > the 1.10 series that made views and API endpoints in the webserver >> quicker >> > By making them only load one dag from disk. This takes it further and >> means >> > that (as of the PR) only the Rendered tab needs to go to the code on >> disk. >> > As a reminder the "trigger_dag" endpoint would previously load all dags. >> > Other's still do. Multiple people on have complained about this Slack. >> > >> > (The only remaining view that has to hit the code on disk is the >> > Rendered template tab, as right now a DAG can define custom macros or >> > filters, and we don't want to have to deal with serializing code. The >> only >> > "real" way of doing it is CloudPickle. No thank you) >> > >> > >> > To address the points . >> > >> > >> > On 15 Oct 2019, at 21:14, Driesprong, Fokko <fo...@driesprong.frl> >> wrote >> > > - Are we going to extend the existing data model, to allow the RDBMS >> to >> > > optimize queries on fields that we use a lot? >> > >> > Yes, as needed. Not really needed for this PR yet (it doesn't change any >> > query patterns in the webserver, so they are using the existing tables) >> > >> > > - How are we going to do state evolution when we extend the JSON model >> > > >> > >> > The top level object we're storing has a __version field (for example >> > `{"__version": 1, "dag": { ... } }`) so we can detect older versions and >> > either run a db "migration" on install, or upgrade at load time. >> > >> > >> > >> > On 15 Oct 2019, at 20:04, Dan Davydov <ddavy...@twitter.com.invalid> >> > wrote: >> > > Having both a SimpleDagBag representation and the JSON representation >> > > doesn't make sense to me at the moment: *"**Quoting from Airflow >> code, it >> > > is “a simplified representation of a DAG that contains all attributes >> > > required for instantiating and scheduling its associated tasks.”. It >> does >> > > not contain enough information required by the webserver.". *Why not >> > create >> > > a representation that can be used by both? This is going to be a big >> > > headache to both understand and work with in the codebase since it >> will >> > be >> > > another definition that we need to keep in sync. >> > >> > Honestly: because that makes the change bigger, harder to review and >> > harder to back port. If this is a serious blocker to you agreeing we can >> > update it to use the same structure as part of this PR, but I would >> rather >> > it was a separate change. >> > >> I think this is the kind of core change we should make from the get-go, my >> feeling is that we will be paying for this decision otherwise, and the >> cost >> will surpass the immediate value. Ideally we would fully understand the >> long-term plan before deciding on this serialization format, but I think >> at >> least having a single serialized representation for now would be a decent >> compromise for me. >> >> This is still a blocker to me agreeing. >> >> >> > >> > > Not sure if fileloc/fileloc_hash is the right solution, the longterm >> > > solution I am imagining has clients responsible for uploading DAGs >> rather >> > > than retrieving them from the filesystem so fileloc/fileloc_hash >> wouldn't >> > > even exist (dag_id would be used for indexing here). >> > >> > Then we can remove it later, when that code exists. For now so much of >> the >> > rest of the code base assumes DAGs from files on disk so. >> > >> > >> > >> > > Versioning isn't really addressed either (e.g. if a DAG topology >> changes >> > > with some update you want to be able to show both the old and new >> ones, >> > or >> > > at least have a way to deal with them), there is an argument that >> this is >> > > acceptable since it isn't currently addressed now, but I'm worried >> that >> > > large schema changes should think through the long term plan a bit >> more. >> > >> > Correct, and this is by design to make the change smaller and easier to >> > review. Adding versioning using the existing serialization format is not >> > that much work once the ground work is here and the webserver is not >> > relying on DAGs from disk. There's also a UI question of how to handle >> this >> > on the Tree view that I don't have an immediate answer for. But yes, >> > versioning of dag run history was in our minds when working on this PR. >> > >> > As long as all of the changes are in the camp of "easy to migrate/undo >> in >> the future" then these are not blockers for me. I'm not 100% convinced >> that >> they are, but you have more context so I'll defer to your judgment. >> >> >> > > I feel like getting this wrong is going to make it very hard to >> migrate >> > > things in the future, and make the codebase worse (another >> representation >> > > of DAGs that we need to support/understand/keep parity for). If I'm >> wrong >> > > about this then I would be more willing to +1 this change. This doc >> is a >> > > 1-2 pager and I feel like it is not thorough or deep enough and >> doesn't >> > > give me enough confidence that the work in the PR is going to make it >> > > easier to complete the future milestones instead of harder. >> > >> > "But what if we get it wrong" is not something we can act on. So long as >> > you are happy with the choice of JSON as the serialization then since >> we've >> > got a version in the record we can change this relatively easily in the >> > future - we'll know when we're dealing with an older record (we won't >> have >> > to guess) and can write a format migrator (either to upgrade in memory, >> or >> > to upgrade the record in place. And as a db "migration" or as a upgrade >> > command) >> > >> I'm happy with JSON, my main point about "what if we get it wrong" is that >> I feel like more planning work could have been put into this project vs >> just a couple of rough high-level target "milestones". This significantly >> increases the risk of the project long term in my eyes. >> >> >> > >> > We've based our serialization on Uber's Piper work < >> > https://eng.uber.com/managing-data-workflows-at-scale/ < >> > https://eng.uber.com/managing-data-workflows-at-scale/>> (we had a call >> > with Alex at Uber before starting any of this work) so we've piggy >> backed >> > of their hard work and months of running something very close to this >> > format in production. >> >> >> > >> This is great to hear! >> >> >> > Also let's not forget: right now we aren't storing versions, and we >> still >> > have DAGs on disk. If we do get it massively wrong (which I don't think >> we >> > have) we can just drop this entire table and start again. This is all an >> > internal implementation detail that is not exposed via an API. What >> better >> > way to see if this is right than by running it at scale in production >> > across 1000s of customers (Astronomer+Cloud Composer) >> > >> If we have thought through future milestones and made sure that we can >> switch to different formats that could required for them easily, then I'm >> not a blocker on this front. >> >> >> > Can you be more specific about you are worried about here? It's a bit >> hard >> > to put general "but what if we get it wrong" fears to rest. >> > >> > >> > >> > On Tue, Oct 15, 2019 at 9:10 PM Alex Guziel <alex.guz...@airbnb.com >> > .invalid> >> > > We don't need to have the future plan implemented >> > > completely but it would be nice to see more detailed notes about how >> this >> > > will play out in the future. >> > >> > Fair point. Our rough long-term plan: >> > >> > - Add versioning to the UI, likely by storing a new version linked from >> > the DagRun, (but only when it changes, not every time to avoid DB bloat >> - >> > multiple dag_run rows will point at the same serialized blob). >> > >> > - Rework the scheduler use the serialized version, not the Simple* >> > version. assuming we can do it and keep the scheduler performing well. >> > (i.e. that ser/deser time isn't significant for the scheduler loop) >> > >> > >> > - Massively rework how scheduler creates TaskInstances (right now this >> is >> > done in the Dag parsing process, not the main scheduler) and dag runs. >> We >> > have to keep an eye on scheduler performance as we do this. If we can >> pull >> > this back in to the main scheduler then this leads us towards the next >> > points: >> > >> > - Split out the updating of serialized dag/parsing of DAGs from the >> > scheduler loop so it can be a separate component/subprocess (which is >> on by >> > default). This gets us towards being able to submit DAG definitions via >> an >> > API if we wanted. (There's a lot to work out to get that working, mostly >> > around running the DAG). >> > >> > - Once the scheduler loop/logic is not tied to parsing the DAGs then >> this >> > also heads us towards being able to run multiple schedulers. (This >> isn't a >> > requirement on HA scheduler, but it makes them easier to "scale out".) >> > >> > Hope this answers people's questions, and thanks for the feedback >> > >> > Alex and Dan: Does this give enough information for you to change your >> > vote? >> > >> I think as long as we combine the serialized DAG representations and have >> some confidence that the data format will be easy to switch to future >> potential formats we would want (kind of hard to do without a more >> detailed >> future plan...), then I would happily change my vote. >> >> >> > -ash >> > >> > > On 15 Oct 2019, at 21:14, Driesprong, Fokko <fo...@driesprong.frl> >> > wrote: >> > > >> > > Big +1 from my side, looking forward to make this happen. >> > > >> > > Two sides that aren't completely clear to me: >> > > >> > > - Are we going to extend the existing data model, to allow the >> RDBMS to >> > > optimize queries on fields that we use a lot? >> > > - How are we going to do state evolution when we extend the JSON >> model >> > > >> > > I have good confidence that we'll solve this along the way. >> > > >> > > Cheers, Fokko >> > > >> > > Op di 15 okt. 2019 om 21:29 schreef Dan Davydov >> > > <ddavy...@twitter.com.invalid>: >> > > >> > >> I have been following it from the beginning as well. I understand >> there >> > >> would be short-term wins for some users (I don't think a huge amount >> of >> > >> users?), but I still feel like we are being a bit short-sighted here >> and >> > >> that we are creating more work for ourselves and potentially our >> users >> > in >> > >> the future. I also feel like there will be side effects to users as >> > well, >> > >> many of which don't care about the webserver scalability, such as >> bugs >> > >> caused by the addition of the new webserver representation. I think >> > without >> > >> a design that is much larger in scope I wouldn't feel comfortable >> moving >> > >> forward with this AIP. >> > >> >> > >> On Tue, Oct 15, 2019 at 3:21 PM Jarek Potiuk < >> jarek.pot...@polidea.com> >> > >> wrote: >> > >> >> > >>> Hello Dan, Alex, >> > >>> >> > >>> I believe all the points you make are super-valid ones. But maybe >> you >> > are >> > >>> missing the full context a bit. >> > >>> >> > >>> I followed the original discussion >> > >>> < >> > >>> >> > >> >> > >> https://lists.apache.org/thread.html/a2d426f93c0f4e5f0347627308638b59ca4072fd022a42af1163e34a@%3Cdev.airflow.apache.org%3E >> > >>>> >> > >>> from the very beginning and took part in the initial discussions >> when >> > >> this >> > >>> topic was raised. From the discussion it is quite clear to me that >> this >> > >>> mostly a "tactical" approach to implement something that is >> > backportable >> > >> to >> > >>> 1.10 and rather quick to implement. This is targeted to make users >> more >> > >>> happy with their 1.10 version without the timing uncertainty and >> effort >> > >> of >> > >>> migration to 2.0. It solves the major pain point of stability of >> the UI >> > >> in >> > >>> case there are complex DAGs for which parsing crashes the webserver. >> > >>> Like in "being nice to your users". >> > >>> >> > >>> There will be a separate effort to make pretty much all of the >> things >> > you >> > >>> mentioned in 2.0 in a non-backportable way as it requires far too >> many >> > >>> changes in the way how Airflow works internally. >> > >>> >> > >>> Maybe it needs some more explanation + long term plan that follows >> in >> > the >> > >>> AIP itself to explain it to those who have not followed the initial >> > >>> discussion, but I think it's fully justified change. >> > >>> >> > >>> J. >> > >>> >> > >>> On Tue, Oct 15, 2019 at 9:10 PM Alex Guziel <alex.guz...@airbnb.com >> > >>> .invalid> >> > >>> wrote: >> > >>> >> > >>>> -1 (binding) >> > >>>> Good points made by Dan. We don't need to have the future plan >> > >>> implemented >> > >>>> completely but it would be nice to see more detailed notes about >> how >> > >> this >> > >>>> will play out in the future. We shouldn't walk into a system that >> > >> causes >> > >>>> more pain in the future. (I can't say for sure that it does, but I >> > >> can't >> > >>>> say that it doesn't either). I don't think the proposal is >> necessarily >> > >>>> wrong or bad, but I think we need some more detailed planning >> around >> > >>> future >> > >>>> milestones. >> > >>>> >> > >>>> On Tue, Oct 15, 2019 at 12:04 PM Dan Davydov >> > >>> <ddavy...@twitter.com.invalid >> > >>>>> >> > >>>> wrote: >> > >>>> >> > >>>>> -1 (binding), this may sound a bit FUD-y but I don't feel this has >> > >> been >> > >>>>> thought through enough... >> > >>>>> >> > >>>>> Having both a SimpleDagBag representation and the JSON >> representation >> > >>>>> doesn't make sense to me at the moment: *"**Quoting from Airflow >> > >> code, >> > >>> it >> > >>>>> is “a simplified representation of a DAG that contains all >> attributes >> > >>>>> required for instantiating and scheduling its associated tasks.”. >> It >> > >>> does >> > >>>>> not contain enough information required by the webserver.". *Why >> not >> > >>>> create >> > >>>>> a representation that can be used by both? This is going to be a >> big >> > >>>>> headache to both understand and work with in the codebase since it >> > >> will >> > >>>> be >> > >>>>> another definition that we need to keep in sync. >> > >>>>> >> > >>>>> Not sure if fileloc/fileloc_hash is the right solution, the >> longterm >> > >>>>> solution I am imagining has clients responsible for uploading DAGs >> > >>> rather >> > >>>>> than retrieving them from the filesystem so fileloc/fileloc_hash >> > >>> wouldn't >> > >>>>> even exist (dag_id would be used for indexing here). >> > >>>>> >> > >>>>> Versioning isn't really addressed either (e.g. if a DAG topology >> > >>> changes >> > >>>>> with some update you want to be able to show both the old and new >> > >> ones, >> > >>>> or >> > >>>>> at least have a way to deal with them), there is an argument that >> > >> this >> > >>> is >> > >>>>> acceptable since it isn't currently addressed now, but I'm worried >> > >> that >> > >>>>> large schema changes should think through the long term plan a bit >> > >>> more. >> > >>>>> >> > >>>>> I feel like getting this wrong is going to make it very hard to >> > >> migrate >> > >>>>> things in the future, and make the codebase worse (another >> > >>> representation >> > >>>>> of DAGs that we need to support/understand/keep parity for). If >> I'm >> > >>> wrong >> > >>>>> about this then I would be more willing to +1 this change. This >> doc >> > >> is >> > >>> a >> > >>>>> 1-2 pager and I feel like it is not thorough or deep enough and >> > >> doesn't >> > >>>>> give me enough confidence that the work in the PR is going to >> make it >> > >>>>> easier to complete the future milestones instead of harder. >> > >>>>> >> > >>>>> On Tue, Oct 15, 2019 at 11:26 AM Kamil Breguła < >> > >>>> kamil.breg...@polidea.com> >> > >>>>> wrote: >> > >>>>> >> > >>>>>> +1 (binding) >> > >>>>>> >> > >>>>>> On Tue, Oct 15, 2019 at 2:57 AM Kaxil Naik <kaxiln...@gmail.com> >> > >>>> wrote: >> > >>>>>> >> > >>>>>>> Hello, Airflow community, >> > >>>>>>> >> > >>>>>>> This email calls for a vote to add the DAG Serialization feature >> > >> at >> > >>>>>>> https://github.com/apache/airflow/pull/5743. >> > >>>>>>> >> > >>>>>>> *AIP*: >> > >>>>>>> >> > >>>>>> >> > >>>>> >> > >>>> >> > >>> >> > >> >> > >> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler >> > >>>>>>> >> > >>>>>>> *Previous Mailing List discussion*: >> > >>>>>>> >> > >>>>>> >> > >>>>> >> > >>>> >> > >>> >> > >> >> > >> https://lists.apache.org/thread.html/65d282368e0a7c19815badb8b1c6c8d72b0975ce94f601e13af44f74@%3Cdev.airflow.apache.org%3E >> > >>>>>>> . >> > >>>>>>> >> > >>>>>>> *Authors*: Kaxil Naik, Zhou Fang, Ash-Berlin Taylor >> > >>>>>>> >> > >>>>>>> *Summary*: >> > >>>>>>> >> > >>>>>>> - DAGs are serialized using JSON format and stored in a >> > >>>>> SerializedDag >> > >>>>>>> table >> > >>>>>>> - The Webserver now instead of having to parse the DAG file >> > >>> again, >> > >>>>>>> reads the serialized DAGs in JSON, de-serializes them and >> > >>> creates >> > >>>>> the >> > >>>>>>> DagBag and uses it to show in the UI. >> > >>>>>>> - Instead of loading an entire DagBag when the WebServer >> > >> starts >> > >>> we >> > >>>>>>> only load each DAG on demand from the Serialized Dag table. >> > >> This >> > >>>>> helps >> > >>>>>>> reduce Webserver startup time and memory. The reduction is >> > >>> notable >> > >>>>>> when you >> > >>>>>>> have a large number of DAGs. >> > >>>>>>> - A JSON Schema has been defined and we validate the >> > >> serialized >> > >>>> dag >> > >>>>>>> before writing it to the database >> > >>>>>>> >> > >>>>>>> [image: image.png] >> > >>>>>>> >> > >>>>>>> A PR (https://github.com/apache/airflow/pull/5743) is ready for >> > >>>> review >> > >>>>>>> from the committers and community. >> > >>>>>>> >> > >>>>>>> We also have a WIP PR ( >> > >> https://github.com/apache/airflow/pull/5992 >> > >>> ) >> > >>>> to >> > >>>>>>> backport this feature to 1.10.* branch. >> > >>>>>>> >> > >>>>>>> A big thank you to Zhou and Ash for their continuous help in >> > >>>> improving >> > >>>>>>> this feature/PR. >> > >>>>>>> >> > >>>>>>> This email is formally calling for a vote to accept the AIP and >> > >> PR. >> > >>>>>> Please >> > >>>>>>> note that we will update the PR / feature to fix bugs if we find >> > >>> any. >> > >>>>>>> >> > >>>>>>> Cheers, >> > >>>>>>> Kaxil >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>> >> > >>>>> >> > >>>> >> > >>> >> > >>> >> > >>> -- >> > >>> >> > >>> Jarek Potiuk >> > >>> Polidea <https://www.polidea.com/> | Principal Software Engineer >> > >>> >> > >>> M: +48 660 796 129 <+48660796129> >> > >>> [image: Polidea] <https://www.polidea.com/> >> > >>> >> > >> >> > >> > >> >