I agree with what Jarek is saying. And if just for testing a user wants to see what his template field would look like they can used airflow render from the cli
On Thu, Oct 24, 2019 at 3:19 PM Jarek Potiuk <jarek.pot...@polidea.com> wrote: > Great work! > > I also think it's a pragmatic and reasonable approach to add it to 1.10 and > work on improved version for 2.0 (and have it as the only option). > This way we do not really add unnecessary complexity - we will just improve > the serialized implementation in our 2.0 effort, we close all the gaps we > know now and get rid of SimpleDag. > > Storing rendered template seems to be a good idea. > > But one question here - maybe I do not understand it - why do we even > attempt to render the templates in UI in the tasks that have not run yet? > Does it really make sense? > We do not know if what we have in the UI will have the actual values that > will be used during the run. Rendering it in the UI when they have not been > rendered yet might give different values than those actually used during > task run. > > For me It would make perfect sense to keep the unrendered template before > the task runs (and mark it as such) and rendered one (from the DB as > suggested by Ash) after it runs. > > Or am I missing something? > > J. > > On Thu, Oct 24, 2019 at 4:01 PM Driesprong, Fokko <fo...@driesprong.frl> > wrote: > > > Thank you, > > > > I fine with adding it to 1.10 as the current PR where is it an option, > and > > for now, the current behavior would be the old one, until smooth out all > > the issues. > > > > For 2.0 I'd love to see it as the only option so we can remove the logic > > and code of the SimpleDag to keep the codebase nice and clean. > > > > I like your thinking out loud Ash, and I think that should be a first > > approach to storing the rendered templates. > > > > Cheers, Fokko > > > > Op do 24 okt. 2019 om 12:05 schreef Ash Berlin-Taylor <a...@apache.org>: > > > > > Speaking with my Astronomer hat on, we are going to include this in our > > > next 1.10.x image as this improves the experience when deploying > changes. > > > We could not backport it to the v-1-10-test for inclusion in next > > release, > > > but the work to backport it is already done. > > > > > > Yes, making this the default (or only?) option in 2.0 sounds good to > me. > > > > > > And to re-iterate, we aren't done on this bit of work (even if the AIP > > > might be done) when this PR is merged and we will carry on working on > the > > > next stage of this work. > > > > > > Next stages of work: > > > - Work out best way off addressing > > > https://github.com/apache/airflow/pull/5743/files#r335284375 < > > > https://github.com/apache/airflow/pull/5743/files#r335284375> > (Operator > > > links that expect an Operator instance to work.) > > > - Porting scheduler to use the serialised version instead of Simple* > > > (overlaps with my performance testing+improvement work) > > > - Tackling the Rendered templates (pre baking? something else?) > > > > > > > > > Thinking about the rendering: the only reason we can't render the > strings > > > on the webserver is that there _might_ be local user defined macros or > > > filters for the dag. But in the case where there isn't (most of them) > we > > > could possibly still render the template in the webserver. For tasks > that > > > have already run (those that have a TaskInstance row) maybe we could > > store > > > the rendered version as executed in there? Just thinking out loud here. > > > > > > -ash > > > > > > > > > > On 24 Oct 2019, at 10:56, Driesprong, Fokko <fo...@driesprong.frl> > > > wrote: > > > > > > > > Awesome work all, > > > > > > > > I would be more in the opposite direction. Maybe we should not > backport > > > > such a huge change to 1.10 and work on it for 2.0. Also, I'd like to > > see > > > it > > > > as being default in 2.0, having it as an option will introduce > another > > > > permutation in the configuration and this will make the whole thing > > more > > > > complex to maintain, and also te debug. > > > > > > > > I would personally not be worried too much with the pre-rendering. As > > > long > > > > as wel make sure that at execution time, the thing gets rendered > > > (including > > > > xcom at that time), and then save it again for reference. So you can > > > always > > > > trace what's being executed, but before execution, it might be > > incorrect > > > > (because the xcom etc is not properly being populated). > > > > > > > > Cheers, Fokko > > > > > > > > > > > > Op do 24 okt. 2019 om 11:51 schreef Kevin Yang <yrql...@gmail.com>: > > > > > > > >> Just saw the latest email from Kaxil and want to clarify. I was > > replying > > > >> the previous posts and making general comments instead of trying to > > > block > > > >> the PR from being merged. I'm fine with merging it and improve > > > >> incrementally. It is quite polished and already huge now, I can > > totally > > > >> relate to the rebase pain. > > > >> > > > >> > > > >> Cheers, > > > >> Kevin Y > > > >> > > > >> On Thu, Oct 24, 2019 at 2:40 AM Kevin Yang <yrql...@gmail.com> > wrote: > > > >> > > > >>> Oops, I guess I'm too slow reading the PRs to let me vote in time > 😛 > > > >>> Congrats Zhou, Kaxil and Ash for getting it accepted 🎉 > > > >>> > > > >>> I'm pretty excited about this long-anticipated change and thank you > > > guys > > > >>> for the great work. Though I do have a few comments. > > > >>> > > > >>> I agree with Dan's concern about maintaining two representations > > would > > > >>> potentially create a lot pain in the future. If the scheduler > rework > > > >>> mentioned in Ash's rough long term plan can be a more immediate > > action, > > > >>> doesn't matter if it is in the same PR, then I won't be so worried. > > > >>> > > > >>> With records in this new separate table, I agree the cost of > > > >>> rollback/upgrade won't be too much of a headache. We have those > > > >> approaches > > > >>> Ash mentioned or can even create another table and let them coexist > > > for a > > > >>> while if the change is dramastic. > > > >>> > > > >>> The added DB load seems pretty negligible. But just to be extra > safe, > > > >>> we'll try to test it within our cluster, where DB load is at the > > > boarder > > > >>> line, and report back. > > > >>> > > > >>> About the template rendering, I think we should still try to > > pre-render > > > >> it > > > >>> and persist it into the DB. Rendered is a very frequently accessed > > > >> endpoint > > > >>> and for large DAG files( we have DAG files taking multiple mins to > > > parse) > > > >>> it would take too long to load, block the webserver workers if we > got > > > too > > > >>> many rendered endpoint requests and create quite some confusions > > given > > > >> that > > > >>> we act differently on different endpoints. In AIP-17, we pre-render > > the > > > >>> template when creating the task instances where we have > > execution_date. > > > >> If > > > >>> XCOM is the blocker, we can maybe pre-render everything else > besides > > > the > > > >>> XCOM and render the XCOM part on the fly. Since people can access > the > > > >>> rendered endpoint before upstream tasks finished, we'll probly have > > to > > > >>> exclude XCOM in pre-rendering anyway. I would be very happy if this > > can > > > >> be > > > >>> a high priority follow up of the AIP. > > > >>> > > > >>> > > > >>> Cheers, > > > >>> Kevin Y > > > >>> > > > >>> On Wed, Oct 23, 2019 at 3:39 PM Kaxil Naik <kaxiln...@gmail.com> > > > wrote: > > > >>> > > > >>>> This vote passed (although not unanimous) and I'll mark this AIP > as > > > >>>> accepted. > > > >>>> > > > >>>> *Result*: > > > >>>> +1 votes: 7 (6 binding and 1 non-binding vote) > > > >>>> -1 votes: 2 (2 binding and 0 non-binding votes) > > > >>>> > > > >>>> *+1 (binding)*: > > > >>>> Kaxil Naik > > > >>>> Ash-Berlin Taylor > > > >>>> Jarek Potiuk > > > >>>> Kamil Breguła > > > >>>> Fokko Driesprong > > > >>>> Sumit Maheshwari > > > >>>> > > > >>>> *+1 (non-binding):* > > > >>>> Philippe Gagnon > > > >>>> > > > >>>> *-1 (binding)*: > > > >>>> Dan Davydov > > > >>>> Alex Guziel > > > >>>> > > > >>>> > > > >>>> > > > >>>> On Thu, Oct 17, 2019 at 5:29 PM Dan Davydov <ddavy...@twitter.com > > > > > >> wrote: > > > >>>> > > > >>>>> Not sure I'm convinced, I think my core concern of maintaining > two > > > >>>>> representations and not having a thought-out future plan of how > the > > > >>>> schema > > > >>>>> will evolve/how easy this migration will be still stands. It > feels > > > >> like > > > >>>> we > > > >>>>> are trying to chase some short-term wins here. But hey it's a > > > >> democracy, > > > >>>>> don't let my vote block you :)! > > > >>>>> > > > >>>>> On Thu, Oct 17, 2019 at 10:15 AM Kaxil Naik <kaxiln...@gmail.com > > > > > >>>> wrote: > > > >>>>> > > > >>>>>> Hi Dan, I understand your concern. Your +1 and suggestions are > > very > > > >>>>>> important for us so let me try to explain in more details > > > >>>>>> and see if I can convince you. > > > >>>>>> > > > >>>>>> Please check replies in-line > > > >>>>>> > > > >>>>>> I think this is the kind of core change we should make from the > > > >>>> get-go, my > > > >>>>>>> feeling is that we will be paying for this decision otherwise, > > and > > > >> the > > > >>>>>>> cost > > > >>>>>>> will surpass the immediate value. Ideally we would fully > > understand > > > >>>> the > > > >>>>>>> long-term plan before deciding on this serialization format, > but > > I > > > >>>> think > > > >>>>>>> at > > > >>>>>>> least having a single serialized representation for now would > be > > a > > > >>>> decent > > > >>>>>>> compromise for me. > > > >>>>>>> This is still a blocker to me agreeing. > > > >>>>>> > > > >>>>>> > > > >>>>>> Trying to do all (Serialization for Webserver, Scheduler, etc) > at > > > >> once > > > >>>>>> would be error prone as we would need to cover all cases as well > > and > > > >>>>>> it is very difficult to test and review such a large chunk of > work > > > >> even > > > >>>>>> if we have multiple people reviewing it. > > > >>>>>> > > > >>>>>> Doing it iteratively would allow us to roll out serialisation > > while > > > >>>>>> keeping backwards compatibility. Keeping backwards compatibility > > for > > > >>>> now > > > >>>>>> is very important atleast for Websever as the users would have > to > > > >> wait > > > >>>>>> for months or at-worse 1 more year (plus the upgrade path to > 1.10* > > > to > > > >>>> 2.0 is > > > >>>>>> already looking quite huge). > > > >>>>>> > > > >>>>>> Airflow 2.0 is still going to take months and is already going > to > > > >> have > > > >>>>>> number of breaking changes that is going to make > > > >>>>>> updating to it cumbersome task: > > > >>>>>> > > > >>>>>> - No non-RBAC UI > > > >>>>>> - No Py2 > > > >>>>>> - Import paths have (or in the process of being) changed from > > > >> all/most > > > >>>>>> contrib objects > > > >>>>>> - CLI reorganization > > > >>>>>> - and various others: > > > >>>>>> > > > >>>> > > > >> > > > > https://github.com/apache/airflow/blob/master/UPDATING.md#airflow-master > > > >>>>>> > > > >>>>>> I am listing the above for 1 reason - "How long would the users > > have > > > >> to > > > >>>>>> deal with Webserver scalability issue" before they upgrade to > 2.0 > > ! > > > >>>>>> > > > >>>>>> However, I agree with your argument that we should ideally > > > understand > > > >>>> the > > > >>>>>> long-term plan. > > > >>>>>> > > > >>>>>> The current serialized representation of DAG looks like the > > > >> following: > > > >>>>>> > > > >>>>>> > > > >>>> > > > >> > > > > > > https://github.com/apache/airflow/blob/a6f37cd0da0774dff6461d3fdcb94dee9d101fda/tests/serialization/test_dag_serialization.py#L40-L93 > > > >>>>>> > > > >>>>>> As you can see most of the information required by SimpleTask is > > > >>>> already > > > >>>>>> present. Currently, SimpleTask is used by the Scheduler. The > next > > > >> phase > > > >>>>>> of this AIP will be to use Serialization in the Scheduler and > > > >> removing > > > >>>>>> SimpleTask altogether. > > > >>>>>> > > > >>>>>> As you said that at least having a single serialized > > representation, > > > >>>> for > > > >>>>>> now, would be a decent compromise for me, how about you suggest > > > >>>>>> or let us know if you think we are missing anything in the > > > >> serialized > > > >>>>>> blob example above. > > > >>>>>> > > > >>>>>> ------------------------ > > > >>>>>> > > > >>>>>> As long as all of the changes are in the camp of "easy to > > > >> migrate/undo > > > >>>> in > > > >>>>>>> the future" then these are not blockers for me. I'm not 100% > > > >> convinced > > > >>>>>>> that > > > >>>>>>> they are, but you have more context so I'll defer to your > > judgment. > > > >>>>>> > > > >>>>>> > > > >>>>>> Currently, as the Serialization applies only to the webserver, > if > > > >>>>>> something goes wrong, changing `store_serialized_dags=False` > would > > > >>>> restore > > > >>>>>> no-serialization behavior. > > > >>>>>> > > > >>>>>> We are not touching Scheduler for 1.10* as it is an actual Heart > > of > > > >>>>>> Airflow and wouldn't feel comfortable in changing that core in a > > > >>>>>> minor/patch fix. > > > >>>>>> > > > >>>>>> ------------------------ > > > >>>>>> > > > >>>>>> I'm happy with JSON, my main point about "what if we get it > wrong" > > > is > > > >>>> that > > > >>>>>>> I feel like more planning work could have been put into this > > > project > > > >>>> vs > > > >>>>>>> just a couple of rough high-level target "milestones". This > > > >>>> significantly > > > >>>>>>> increases the risk of the project long term in my eyes. > > > >>>>>> > > > >>>>>> > > > >>>>>> Getting all the things right all at once would be harder than > > taking > > > >> it > > > >>>>>> piece by piece. > > > >>>>>> We have discovered many things in Webserver that we initially > did > > > not > > > >>>>>> take into account like "Operator Links". > > > >>>>>> Even Scheduler would have something similar which we won't know > > now > > > >> but > > > >>>>>> find out later which would then require re-working and going > > > >>>> back-and-forth. > > > >>>>>> > > > >>>>>> ------------------------ > > > >>>>>> > > > >>>>>>> I think as long as we combine the serialized DAG > representations > > > and > > > >>>> have > > > >>>>>>> some confidence that the data format will be easy to switch to > > > >> future > > > >>>>>>> potential formats we would want (kind of hard to do without a > > more > > > >>>>>>> detailed > > > >>>>>>> future plan...), then I would happily change my vote. > > > >>>>>> > > > >>>>>> > > > >>>>>> For storing versioned dags, this is just a rough idea and it can > > > >>>> change: > > > >>>>>> > > > >>>>>> We have a last_updated_time column in the Serialized Dag table. > > > >>>> Currently > > > >>>>>> we just update that dag row when we process the Dag again, > > > >>>>>> instead, we would just store it as a different row if the DAG > has > > > >>>> changed > > > >>>>>> and if it already had a DagRun with previous version i.e > > > >>>>>> we don't need to store it every time it changes if it hasn't run > > in > > > >> the > > > >>>>>> meantime. > > > >>>>>> > > > >>>>>> We can assign a unique number (version) to each row for a DAG. > > > >>>>>> Sorting by `last_updated_time` column who gives us how DAG > updated > > > >> over > > > >>>>>> time. > > > >>>>>> > > > >>>>>> Regards, > > > >>>>>> Kaxil > > > >>>>>> > > > >>>>>> On Wed, Oct 16, 2019 at 4:43 PM Dan Davydov > > > >>>> <ddavy...@twitter.com.invalid> > > > >>>>>> wrote: > > > >>>>>> > > > >>>>>>> Responses inline. > > > >>>>>>> > > > >>>>>>> On Wed, Oct 16, 2019 at 6:07 AM Ash Berlin-Taylor < > > a...@apache.org> > > > >>>>>>> wrote: > > > >>>>>>> > > > >>>>>>>> Thanks for the feedback and discussion everyone - it's nice to > > > >> know > > > >>>>>>> people > > > >>>>>>>> feel strongly about this. Let's make sure we build this right! > > > >>>>>>>> > > > >>>>>>>> As a reminder: the main goals of this AIP: > > > >>>>>>>> > > > >>>>>>>> - To speed up bootup/worker recycle time of the webserver > > > >>>>>>>> > > > >>>>>>>> If you have a large number of DAGs or highly dynamic DAGs it > is > > > >>>>>>> possible > > > >>>>>>>> that starting your webserver needs a Gunicorn worker timeout > of > > 5 > > > >>>> or 10 > > > >>>>>>>> minutes. (It is worse at start up because each of the gunicorn > > > >>>> workers > > > >>>>>>>> parses the dag itself, in exactly the same order, at the same > > time > > > >>>> as > > > >>>>>>> the > > > >>>>>>>> other workers, leading to disk contention, and CPU burst.) > > > >>>>>>>> > > > >>>>>>>> And just yesterday I helped two people diagnose a problem > that > > > >> was > > > >>>>>>>> caused by slow NFS: > time airflow list_dags` took 3:35mins to > > go > > > >>>> over > > > >>>>>>> 43 > > > >>>>>>>> dags. > > > >>>>>>>> > > > >>>>>>>> At Astronomer, and Google (Cloud Composer) we've seen this > slow > > > >>>>>>> loading > > > >>>>>>>> many times for our customer's installs. > > > >>>>>>>> > > > >>>>>>>> And not to mention this makes the webserver snappier > > > >>>>>>>> > > > >>>>>>>> - To make the webserver almost "stateless" (not the best term > > for > > > >>>> it: > > > >>>>>>>> Diskless perhaps? Dagbag-less?). > > > >>>>>>>> > > > >>>>>>>> This makes actions on the webserver quicker. We had a number > of > > > >>>> PRs > > > >>>>>>> in > > > >>>>>>>> the 1.10 series that made views and API endpoints in the > > webserver > > > >>>>>>> quicker > > > >>>>>>>> By making them only load one dag from disk. This takes it > > further > > > >>>> and > > > >>>>>>> means > > > >>>>>>>> that (as of the PR) only the Rendered tab needs to go to the > > code > > > >> on > > > >>>>>>> disk. > > > >>>>>>>> As a reminder the "trigger_dag" endpoint would previously load > > all > > > >>>>>>> dags. > > > >>>>>>>> Other's still do. Multiple people on have complained about > this > > > >>>> Slack. > > > >>>>>>>> > > > >>>>>>>> (The only remaining view that has to hit the code on disk is > > the > > > >>>>>>>> Rendered template tab, as right now a DAG can define custom > > macros > > > >>>> or > > > >>>>>>>> filters, and we don't want to have to deal with serializing > > code. > > > >>>> The > > > >>>>>>> only > > > >>>>>>>> "real" way of doing it is CloudPickle. No thank you) > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> To address the points . > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> On 15 Oct 2019, at 21:14, Driesprong, Fokko > > <fo...@driesprong.frl > > > >>> > > > >>>>>>> wrote > > > >>>>>>>>> - Are we going to extend the existing data model, to allow > the > > > >>>> RDBMS > > > >>>>>>> to > > > >>>>>>>>> optimize queries on fields that we use a lot? > > > >>>>>>>> > > > >>>>>>>> Yes, as needed. Not really needed for this PR yet (it doesn't > > > >> change > > > >>>>>>> any > > > >>>>>>>> query patterns in the webserver, so they are using the > existing > > > >>>> tables) > > > >>>>>>>> > > > >>>>>>>>> - How are we going to do state evolution when we extend the > > JSON > > > >>>>>>> model > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>>> The top level object we're storing has a __version field (for > > > >>>> example > > > >>>>>>>> `{"__version": 1, "dag": { ... } }`) so we can detect older > > > >> versions > > > >>>>>>> and > > > >>>>>>>> either run a db "migration" on install, or upgrade at load > time. > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> On 15 Oct 2019, at 20:04, Dan Davydov > > > >> <ddavy...@twitter.com.invalid > > > >>>>> > > > >>>>>>>> wrote: > > > >>>>>>>>> Having both a SimpleDagBag representation and the JSON > > > >>>> representation > > > >>>>>>>>> doesn't make sense to me at the moment: *"**Quoting from > > Airflow > > > >>>>>>> code, it > > > >>>>>>>>> is “a simplified representation of a DAG that contains all > > > >>>> attributes > > > >>>>>>>>> required for instantiating and scheduling its associated > > > >> tasks.”. > > > >>>> It > > > >>>>>>> does > > > >>>>>>>>> not contain enough information required by the webserver.". > > *Why > > > >>>> not > > > >>>>>>>> create > > > >>>>>>>>> a representation that can be used by both? This is going to > be > > a > > > >>>> big > > > >>>>>>>>> headache to both understand and work with in the codebase > since > > > >> it > > > >>>>>>> will > > > >>>>>>>> be > > > >>>>>>>>> another definition that we need to keep in sync. > > > >>>>>>>> > > > >>>>>>>> Honestly: because that makes the change bigger, harder to > review > > > >> and > > > >>>>>>>> harder to back port. If this is a serious blocker to you > > agreeing > > > >> we > > > >>>>>>> can > > > >>>>>>>> update it to use the same structure as part of this PR, but I > > > >> would > > > >>>>>>> rather > > > >>>>>>>> it was a separate change. > > > >>>>>>>> > > > >>>>>>> I think this is the kind of core change we should make from the > > > >>>> get-go, > > > >>>>>>> my > > > >>>>>>> feeling is that we will be paying for this decision otherwise, > > and > > > >> the > > > >>>>>>> cost > > > >>>>>>> will surpass the immediate value. Ideally we would fully > > understand > > > >>>> the > > > >>>>>>> long-term plan before deciding on this serialization format, > but > > I > > > >>>> think > > > >>>>>>> at > > > >>>>>>> least having a single serialized representation for now would > be > > a > > > >>>> decent > > > >>>>>>> compromise for me. > > > >>>>>>> > > > >>>>>>> This is still a blocker to me agreeing. > > > >>>>>>> > > > >>>>>>> > > > >>>>>>>> > > > >>>>>>>>> Not sure if fileloc/fileloc_hash is the right solution, the > > > >>>> longterm > > > >>>>>>>>> solution I am imagining has clients responsible for uploading > > > >> DAGs > > > >>>>>>> rather > > > >>>>>>>>> than retrieving them from the filesystem so > > fileloc/fileloc_hash > > > >>>>>>> wouldn't > > > >>>>>>>>> even exist (dag_id would be used for indexing here). > > > >>>>>>>> > > > >>>>>>>> Then we can remove it later, when that code exists. For now so > > > >> much > > > >>>> of > > > >>>>>>> the > > > >>>>>>>> rest of the code base assumes DAGs from files on disk so. > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>>> Versioning isn't really addressed either (e.g. if a DAG > > topology > > > >>>>>>> changes > > > >>>>>>>>> with some update you want to be able to show both the old and > > > >> new > > > >>>>>>> ones, > > > >>>>>>>> or > > > >>>>>>>>> at least have a way to deal with them), there is an argument > > > >> that > > > >>>>>>> this is > > > >>>>>>>>> acceptable since it isn't currently addressed now, but I'm > > > >> worried > > > >>>>>>> that > > > >>>>>>>>> large schema changes should think through the long term plan > a > > > >> bit > > > >>>>>>> more. > > > >>>>>>>> > > > >>>>>>>> Correct, and this is by design to make the change smaller and > > > >>>> easier to > > > >>>>>>>> review. Adding versioning using the existing serialization > > format > > > >> is > > > >>>>>>> not > > > >>>>>>>> that much work once the ground work is here and the webserver > is > > > >> not > > > >>>>>>>> relying on DAGs from disk. There's also a UI question of how > to > > > >>>> handle > > > >>>>>>> this > > > >>>>>>>> on the Tree view that I don't have an immediate answer for. > But > > > >> yes, > > > >>>>>>>> versioning of dag run history was in our minds when working on > > > >> this > > > >>>> PR. > > > >>>>>>>> > > > >>>>>>>> As long as all of the changes are in the camp of "easy to > > > >>>> migrate/undo > > > >>>>>>> in > > > >>>>>>> the future" then these are not blockers for me. I'm not 100% > > > >> convinced > > > >>>>>>> that > > > >>>>>>> they are, but you have more context so I'll defer to your > > judgment. > > > >>>>>>> > > > >>>>>>> > > > >>>>>>>>> I feel like getting this wrong is going to make it very hard > to > > > >>>>>>> migrate > > > >>>>>>>>> things in the future, and make the codebase worse (another > > > >>>>>>> representation > > > >>>>>>>>> of DAGs that we need to support/understand/keep parity for). > If > > > >>>> I'm > > > >>>>>>> wrong > > > >>>>>>>>> about this then I would be more willing to +1 this change. > This > > > >>>> doc > > > >>>>>>> is a > > > >>>>>>>>> 1-2 pager and I feel like it is not thorough or deep enough > and > > > >>>>>>> doesn't > > > >>>>>>>>> give me enough confidence that the work in the PR is going to > > > >>>> make it > > > >>>>>>>>> easier to complete the future milestones instead of harder. > > > >>>>>>>> > > > >>>>>>>> "But what if we get it wrong" is not something we can act on. > So > > > >>>> long > > > >>>>>>> as > > > >>>>>>>> you are happy with the choice of JSON as the serialization > then > > > >>>> since > > > >>>>>>> we've > > > >>>>>>>> got a version in the record we can change this relatively > easily > > > >> in > > > >>>> the > > > >>>>>>>> future - we'll know when we're dealing with an older record > (we > > > >>>> won't > > > >>>>>>> have > > > >>>>>>>> to guess) and can write a format migrator (either to upgrade > in > > > >>>>>>> memory, or > > > >>>>>>>> to upgrade the record in place. And as a db "migration" or as > a > > > >>>> upgrade > > > >>>>>>>> command) > > > >>>>>>>> > > > >>>>>>> I'm happy with JSON, my main point about "what if we get it > > wrong" > > > >> is > > > >>>>>>> that > > > >>>>>>> I feel like more planning work could have been put into this > > > project > > > >>>> vs > > > >>>>>>> just a couple of rough high-level target "milestones". This > > > >>>> significantly > > > >>>>>>> increases the risk of the project long term in my eyes. > > > >>>>>>> > > > >>>>>>> > > > >>>>>>>> > > > >>>>>>>> We've based our serialization on Uber's Piper work < > > > >>>>>>>> https://eng.uber.com/managing-data-workflows-at-scale/ < > > > >>>>>>>> https://eng.uber.com/managing-data-workflows-at-scale/>> (we > > had > > > >> a > > > >>>>>>> call > > > >>>>>>>> with Alex at Uber before starting any of this work) so we've > > piggy > > > >>>>>>> backed > > > >>>>>>>> of their hard work and months of running something very close > to > > > >>>> this > > > >>>>>>>> format in production. > > > >>>>>>> > > > >>>>>>> > > > >>>>>>>> > > > >>>>>>> This is great to hear! > > > >>>>>>> > > > >>>>>>> > > > >>>>>>>> Also let's not forget: right now we aren't storing versions, > and > > > >> we > > > >>>>>>> still > > > >>>>>>>> have DAGs on disk. If we do get it massively wrong (which I > > don't > > > >>>>>>> think we > > > >>>>>>>> have) we can just drop this entire table and start again. This > > is > > > >>>> all > > > >>>>>>> an > > > >>>>>>>> internal implementation detail that is not exposed via an API. > > > >> What > > > >>>>>>> better > > > >>>>>>>> way to see if this is right than by running it at scale in > > > >>>> production > > > >>>>>>>> across 1000s of customers (Astronomer+Cloud Composer) > > > >>>>>>>> > > > >>>>>>> If we have thought through future milestones and made sure that > > we > > > >> can > > > >>>>>>> switch to different formats that could required for them > easily, > > > >> then > > > >>>> I'm > > > >>>>>>> not a blocker on this front. > > > >>>>>>> > > > >>>>>>> > > > >>>>>>>> Can you be more specific about you are worried about here? > It's > > a > > > >>>> bit > > > >>>>>>> hard > > > >>>>>>>> to put general "but what if we get it wrong" fears to rest. > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> On Tue, Oct 15, 2019 at 9:10 PM Alex Guziel < > > > >> alex.guz...@airbnb.com > > > >>>>>>>> .invalid> > > > >>>>>>>>> We don't need to have the future plan implemented > > > >>>>>>>>> completely but it would be nice to see more detailed notes > > about > > > >>>> how > > > >>>>>>> this > > > >>>>>>>>> will play out in the future. > > > >>>>>>>> > > > >>>>>>>> Fair point. Our rough long-term plan: > > > >>>>>>>> > > > >>>>>>>> - Add versioning to the UI, likely by storing a new version > > linked > > > >>>> from > > > >>>>>>>> the DagRun, (but only when it changes, not every time to avoid > > DB > > > >>>>>>> bloat - > > > >>>>>>>> multiple dag_run rows will point at the same serialized blob). > > > >>>>>>>> > > > >>>>>>>> - Rework the scheduler use the serialized version, not the > > Simple* > > > >>>>>>>> version. assuming we can do it and keep the scheduler > performing > > > >>>> well. > > > >>>>>>>> (i.e. that ser/deser time isn't significant for the scheduler > > > >> loop) > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> - Massively rework how scheduler creates TaskInstances (right > > now > > > >>>> this > > > >>>>>>> is > > > >>>>>>>> done in the Dag parsing process, not the main scheduler) and > dag > > > >>>> runs. > > > >>>>>>> We > > > >>>>>>>> have to keep an eye on scheduler performance as we do this. If > > we > > > >>>> can > > > >>>>>>> pull > > > >>>>>>>> this back in to the main scheduler then this leads us towards > > the > > > >>>> next > > > >>>>>>>> points: > > > >>>>>>>> > > > >>>>>>>> - Split out the updating of serialized dag/parsing of DAGs > from > > > >> the > > > >>>>>>>> scheduler loop so it can be a separate component/subprocess > > (which > > > >>>> is > > > >>>>>>> on by > > > >>>>>>>> default). This gets us towards being able to submit DAG > > > >> definitions > > > >>>>>>> via an > > > >>>>>>>> API if we wanted. (There's a lot to work out to get that > > working, > > > >>>>>>> mostly > > > >>>>>>>> around running the DAG). > > > >>>>>>>> > > > >>>>>>>> - Once the scheduler loop/logic is not tied to parsing the > DAGs > > > >> then > > > >>>>>>> this > > > >>>>>>>> also heads us towards being able to run multiple schedulers. > > (This > > > >>>>>>> isn't a > > > >>>>>>>> requirement on HA scheduler, but it makes them easier to > "scale > > > >>>> out".) > > > >>>>>>>> > > > >>>>>>>> Hope this answers people's questions, and thanks for the > > feedback > > > >>>>>>>> > > > >>>>>>>> Alex and Dan: Does this give enough information for you to > > change > > > >>>> your > > > >>>>>>>> vote? > > > >>>>>>>> > > > >>>>>>> I think as long as we combine the serialized DAG > representations > > > and > > > >>>> have > > > >>>>>>> some confidence that the data format will be easy to switch to > > > >> future > > > >>>>>>> potential formats we would want (kind of hard to do without a > > more > > > >>>>>>> detailed > > > >>>>>>> future plan...), then I would happily change my vote. > > > >>>>>>> > > > >>>>>>> > > > >>>>>>>> -ash > > > >>>>>>>> > > > >>>>>>>>> On 15 Oct 2019, at 21:14, Driesprong, Fokko > > > >> <fo...@driesprong.frl > > > >>>>> > > > >>>>>>>> wrote: > > > >>>>>>>>> > > > >>>>>>>>> Big +1 from my side, looking forward to make this happen. > > > >>>>>>>>> > > > >>>>>>>>> Two sides that aren't completely clear to me: > > > >>>>>>>>> > > > >>>>>>>>> - Are we going to extend the existing data model, to allow > the > > > >>>>>>> RDBMS to > > > >>>>>>>>> optimize queries on fields that we use a lot? > > > >>>>>>>>> - How are we going to do state evolution when we extend the > > > >> JSON > > > >>>>>>> model > > > >>>>>>>>> > > > >>>>>>>>> I have good confidence that we'll solve this along the way. > > > >>>>>>>>> > > > >>>>>>>>> Cheers, Fokko > > > >>>>>>>>> > > > >>>>>>>>> Op di 15 okt. 2019 om 21:29 schreef Dan Davydov > > > >>>>>>>>> <ddavy...@twitter.com.invalid>: > > > >>>>>>>>> > > > >>>>>>>>>> I have been following it from the beginning as well. I > > > >> understand > > > >>>>>>> there > > > >>>>>>>>>> would be short-term wins for some users (I don't think a > huge > > > >>>>>>> amount of > > > >>>>>>>>>> users?), but I still feel like we are being a bit > > short-sighted > > > >>>>>>> here and > > > >>>>>>>>>> that we are creating more work for ourselves and potentially > > > >> our > > > >>>>>>> users > > > >>>>>>>> in > > > >>>>>>>>>> the future. I also feel like there will be side effects to > > > >> users > > > >>>> as > > > >>>>>>>> well, > > > >>>>>>>>>> many of which don't care about the webserver scalability, > such > > > >> as > > > >>>>>>> bugs > > > >>>>>>>>>> caused by the addition of the new webserver representation. > I > > > >>>> think > > > >>>>>>>> without > > > >>>>>>>>>> a design that is much larger in scope I wouldn't feel > > > >> comfortable > > > >>>>>>> moving > > > >>>>>>>>>> forward with this AIP. > > > >>>>>>>>>> > > > >>>>>>>>>> On Tue, Oct 15, 2019 at 3:21 PM Jarek Potiuk < > > > >>>>>>> jarek.pot...@polidea.com> > > > >>>>>>>>>> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>>> Hello Dan, Alex, > > > >>>>>>>>>>> > > > >>>>>>>>>>> I believe all the points you make are super-valid ones. But > > > >>>> maybe > > > >>>>>>> you > > > >>>>>>>> are > > > >>>>>>>>>>> missing the full context a bit. > > > >>>>>>>>>>> > > > >>>>>>>>>>> I followed the original discussion > > > >>>>>>>>>>> < > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>> > > > >> > > > > > > https://lists.apache.org/thread.html/a2d426f93c0f4e5f0347627308638b59ca4072fd022a42af1163e34a@%3Cdev.airflow.apache.org%3E > > > >>>>>>>>>>>> > > > >>>>>>>>>>> from the very beginning and took part in the initial > > > >> discussions > > > >>>>>>> when > > > >>>>>>>>>> this > > > >>>>>>>>>>> topic was raised. From the discussion it is quite clear to > me > > > >>>> that > > > >>>>>>> this > > > >>>>>>>>>>> mostly a "tactical" approach to implement something that is > > > >>>>>>>> backportable > > > >>>>>>>>>> to > > > >>>>>>>>>>> 1.10 and rather quick to implement. This is targeted to > make > > > >>>> users > > > >>>>>>> more > > > >>>>>>>>>>> happy with their 1.10 version without the timing > uncertainty > > > >> and > > > >>>>>>> effort > > > >>>>>>>>>> of > > > >>>>>>>>>>> migration to 2.0. It solves the major pain point of > stability > > > >> of > > > >>>>>>> the UI > > > >>>>>>>>>> in > > > >>>>>>>>>>> case there are complex DAGs for which parsing crashes the > > > >>>>>>> webserver. > > > >>>>>>>>>>> Like in "being nice to your users". > > > >>>>>>>>>>> > > > >>>>>>>>>>> There will be a separate effort to make pretty much all of > > the > > > >>>>>>> things > > > >>>>>>>> you > > > >>>>>>>>>>> mentioned in 2.0 in a non-backportable way as it requires > far > > > >>>> too > > > >>>>>>> many > > > >>>>>>>>>>> changes in the way how Airflow works internally. > > > >>>>>>>>>>> > > > >>>>>>>>>>> Maybe it needs some more explanation + long term plan that > > > >>>> follows > > > >>>>>>> in > > > >>>>>>>> the > > > >>>>>>>>>>> AIP itself to explain it to those who have not followed the > > > >>>> initial > > > >>>>>>>>>>> discussion, but I think it's fully justified change. > > > >>>>>>>>>>> > > > >>>>>>>>>>> J. > > > >>>>>>>>>>> > > > >>>>>>>>>>> On Tue, Oct 15, 2019 at 9:10 PM Alex Guziel < > > > >>>>>>> alex.guz...@airbnb.com > > > >>>>>>>>>>> .invalid> > > > >>>>>>>>>>> wrote: > > > >>>>>>>>>>> > > > >>>>>>>>>>>> -1 (binding) > > > >>>>>>>>>>>> Good points made by Dan. We don't need to have the future > > > >> plan > > > >>>>>>>>>>> implemented > > > >>>>>>>>>>>> completely but it would be nice to see more detailed notes > > > >>>> about > > > >>>>>>> how > > > >>>>>>>>>> this > > > >>>>>>>>>>>> will play out in the future. We shouldn't walk into a > system > > > >>>> that > > > >>>>>>>>>> causes > > > >>>>>>>>>>>> more pain in the future. (I can't say for sure that it > does, > > > >>>> but I > > > >>>>>>>>>> can't > > > >>>>>>>>>>>> say that it doesn't either). I don't think the proposal is > > > >>>>>>> necessarily > > > >>>>>>>>>>>> wrong or bad, but I think we need some more detailed > > planning > > > >>>>>>> around > > > >>>>>>>>>>> future > > > >>>>>>>>>>>> milestones. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> On Tue, Oct 15, 2019 at 12:04 PM Dan Davydov > > > >>>>>>>>>>> <ddavy...@twitter.com.invalid > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> wrote: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>>> -1 (binding), this may sound a bit FUD-y but I don't feel > > > >> this > > > >>>>>>> has > > > >>>>>>>>>> been > > > >>>>>>>>>>>>> thought through enough... > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Having both a SimpleDagBag representation and the JSON > > > >>>>>>> representation > > > >>>>>>>>>>>>> doesn't make sense to me at the moment: *"**Quoting from > > > >>>> Airflow > > > >>>>>>>>>> code, > > > >>>>>>>>>>> it > > > >>>>>>>>>>>>> is “a simplified representation of a DAG that contains > all > > > >>>>>>> attributes > > > >>>>>>>>>>>>> required for instantiating and scheduling its associated > > > >>>>>>> tasks.”. It > > > >>>>>>>>>>> does > > > >>>>>>>>>>>>> not contain enough information required by the > webserver.". > > > >>>> *Why > > > >>>>>>> not > > > >>>>>>>>>>>> create > > > >>>>>>>>>>>>> a representation that can be used by both? This is going > to > > > >>>> be a > > > >>>>>>> big > > > >>>>>>>>>>>>> headache to both understand and work with in the codebase > > > >>>> since > > > >>>>>>> it > > > >>>>>>>>>> will > > > >>>>>>>>>>>> be > > > >>>>>>>>>>>>> another definition that we need to keep in sync. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Not sure if fileloc/fileloc_hash is the right solution, > the > > > >>>>>>> longterm > > > >>>>>>>>>>>>> solution I am imagining has clients responsible for > > > >> uploading > > > >>>>>>> DAGs > > > >>>>>>>>>>> rather > > > >>>>>>>>>>>>> than retrieving them from the filesystem so > > > >>>> fileloc/fileloc_hash > > > >>>>>>>>>>> wouldn't > > > >>>>>>>>>>>>> even exist (dag_id would be used for indexing here). > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Versioning isn't really addressed either (e.g. if a DAG > > > >>>> topology > > > >>>>>>>>>>> changes > > > >>>>>>>>>>>>> with some update you want to be able to show both the old > > > >> and > > > >>>> new > > > >>>>>>>>>> ones, > > > >>>>>>>>>>>> or > > > >>>>>>>>>>>>> at least have a way to deal with them), there is an > > argument > > > >>>> that > > > >>>>>>>>>> this > > > >>>>>>>>>>> is > > > >>>>>>>>>>>>> acceptable since it isn't currently addressed now, but > I'm > > > >>>>>>> worried > > > >>>>>>>>>> that > > > >>>>>>>>>>>>> large schema changes should think through the long term > > > >> plan a > > > >>>>>>> bit > > > >>>>>>>>>>> more. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> I feel like getting this wrong is going to make it very > > hard > > > >>>> to > > > >>>>>>>>>> migrate > > > >>>>>>>>>>>>> things in the future, and make the codebase worse > (another > > > >>>>>>>>>>> representation > > > >>>>>>>>>>>>> of DAGs that we need to support/understand/keep parity > > for). > > > >>>> If > > > >>>>>>> I'm > > > >>>>>>>>>>> wrong > > > >>>>>>>>>>>>> about this then I would be more willing to +1 this > change. > > > >>>> This > > > >>>>>>> doc > > > >>>>>>>>>> is > > > >>>>>>>>>>> a > > > >>>>>>>>>>>>> 1-2 pager and I feel like it is not thorough or deep > enough > > > >>>> and > > > >>>>>>>>>> doesn't > > > >>>>>>>>>>>>> give me enough confidence that the work in the PR is > going > > > >> to > > > >>>>>>> make it > > > >>>>>>>>>>>>> easier to complete the future milestones instead of > harder. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> On Tue, Oct 15, 2019 at 11:26 AM Kamil Breguła < > > > >>>>>>>>>>>> kamil.breg...@polidea.com> > > > >>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>>> +1 (binding) > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> On Tue, Oct 15, 2019 at 2:57 AM Kaxil Naik < > > > >>>> kaxiln...@gmail.com > > > >>>>>>>> > > > >>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Hello, Airflow community, > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> This email calls for a vote to add the DAG > Serialization > > > >>>>>>> feature > > > >>>>>>>>>> at > > > >>>>>>>>>>>>>>> https://github.com/apache/airflow/pull/5743. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> *AIP*: > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>> > > > >> > > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> *Previous Mailing List discussion*: > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>> > > > >> > > > > > > https://lists.apache.org/thread.html/65d282368e0a7c19815badb8b1c6c8d72b0975ce94f601e13af44f74@%3Cdev.airflow.apache.org%3E > > > >>>>>>>>>>>>>>> . > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> *Authors*: Kaxil Naik, Zhou Fang, Ash-Berlin Taylor > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> *Summary*: > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> - DAGs are serialized using JSON format and stored in > a > > > >>>>>>>>>>>>> SerializedDag > > > >>>>>>>>>>>>>>> table > > > >>>>>>>>>>>>>>> - The Webserver now instead of having to parse the DAG > > > >>>> file > > > >>>>>>>>>>> again, > > > >>>>>>>>>>>>>>> reads the serialized DAGs in JSON, de-serializes them > > > >> and > > > >>>>>>>>>>> creates > > > >>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>> DagBag and uses it to show in the UI. > > > >>>>>>>>>>>>>>> - Instead of loading an entire DagBag when the > WebServer > > > >>>>>>>>>> starts > > > >>>>>>>>>>> we > > > >>>>>>>>>>>>>>> only load each DAG on demand from the Serialized Dag > > > >>>> table. > > > >>>>>>>>>> This > > > >>>>>>>>>>>>> helps > > > >>>>>>>>>>>>>>> reduce Webserver startup time and memory. The > reduction > > > >> is > > > >>>>>>>>>>> notable > > > >>>>>>>>>>>>>> when you > > > >>>>>>>>>>>>>>> have a large number of DAGs. > > > >>>>>>>>>>>>>>> - A JSON Schema has been defined and we validate the > > > >>>>>>>>>> serialized > > > >>>>>>>>>>>> dag > > > >>>>>>>>>>>>>>> before writing it to the database > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> [image: image.png] > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> A PR (https://github.com/apache/airflow/pull/5743) is > > > >> ready > > > >>>>>>> for > > > >>>>>>>>>>>> review > > > >>>>>>>>>>>>>>> from the committers and community. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> We also have a WIP PR ( > > > >>>>>>>>>> https://github.com/apache/airflow/pull/5992 > > > >>>>>>>>>>> ) > > > >>>>>>>>>>>> to > > > >>>>>>>>>>>>>>> backport this feature to 1.10.* branch. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> A big thank you to Zhou and Ash for their continuous > help > > > >> in > > > >>>>>>>>>>>> improving > > > >>>>>>>>>>>>>>> this feature/PR. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> This email is formally calling for a vote to accept the > > > >> AIP > > > >>>> and > > > >>>>>>>>>> PR. > > > >>>>>>>>>>>>>> Please > > > >>>>>>>>>>>>>>> note that we will update the PR / feature to fix bugs > if > > > >> we > > > >>>>>>> find > > > >>>>>>>>>>> any. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Cheers, > > > >>>>>>>>>>>>>>> Kaxil > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> -- > > > >>>>>>>>>>> > > > >>>>>>>>>>> Jarek Potiuk > > > >>>>>>>>>>> Polidea <https://www.polidea.com/> | Principal Software > > > >>>> Engineer > > > >>>>>>>>>>> > > > >>>>>>>>>>> M: +48 660 796 129 <+48660796129> > > > >>>>>>>>>>> [image: Polidea] <https://www.polidea.com/> > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>> > > > >>> > > > >> > > > > > > > > > > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/> | Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > [image: Polidea] <https://www.polidea.com/> >