Also AIP-44 - which is the DB isolation mode is much more detailed and ready for deeper discussion if needed. https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Database+API
On Tue, Dec 14, 2021 at 12:07 PM Jarek Potiuk <[email protected]> wrote: > > And just to add to that. > > Thanks again for the initial comments and pushing us to provide more > details. That allowed us to discuss and focus on many of the aspects > that were raised and we have many more answers now: > > * first of all - we focused on making sure impact on the existing code > and "behavior" of Airflow is minimal. In fact there should be > virtually no change vs. current behavior when db isolation is > disabled. > * secondly - we've done a full inventory of how the API needed should > look like and (not unsurprisingly) it turned out that the current > REST-style API is good for part of it but most of the 'logic" of > airflow can be done efficiently when we go to RPC-style API. However > we propose that the authorization/exposure of the API is the same as > we use currently in the REST API, this will allow us to reuse a big > part of the infrastructure we already have > * thirdly - we took deep into our hearts the comments about having to > maintain pretty much the same logic in a few different places. That's > an obvious maintenance problem. The proposal we came up with addresses > it - we are going to keep the logic of Airflow internals in one place > only and we will simply smartly route on where the logic will be > executed > * regarding the performance impact - we described the deployment > options that our proposal makes available - we do not want to favor > one deployment option over another, but we made sure the architecture > is done in the way, that you can choose which deployment is good for > you: "no isolation", "partial isolation", "full isolation" - each with > different performance/resource characteristics - all of them fully > horizontally scalable and nicely manageable. > > We look forward to comments, also for the API 43 - > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-43+DAG+Processor+separation > - AIP-43 is a prerequiste to AIP-44 and they both work together. > > We also will think and discuss more follow-up AIPs once we get those > approved (hopefully ;) ). The multi-tenancy is a 'long haul" and while > those two AIPs are foundational building blocks, there are at least > few more follow-up AIPs to be able to say "we're done with > Multi-tenancy" :). > > J. > > > > On Tue, Dec 14, 2021 at 9:58 AM Mateusz Henc <[email protected]> wrote: > > > > Hi, > > > > As promised we (credits to Jarek) updated the AIPs, added more details, did > > inventory and changed the way API endpoints are generated. > > > > We also renamed it to Airflow Internal API - so the url has changed: > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API > > > > Please take a look, any comments are highly appreciated. > > > > Best regards, > > Mateusz Henc > > > > > > On Mon, Dec 6, 2021 at 3:09 PM Mateusz Henc <[email protected]> wrote: > >> > >> Hi, > >> Thank you Ash for your feedback (in both AIPs) > >> > >> We are working to address your concerns. We will update the AIPs in a few > >> days. > >> I will let you know when it's done. > >> > >> Best regards, > >> Mateusz Henc > >> > >> > >> On Fri, Dec 3, 2021 at 7:27 PM Jarek Potiuk <[email protected]> wrote: > >>> > >>> Cool. Thanks for the guidance. > >>> > >>> On Fri, Dec 3, 2021 at 6:37 PM Ash Berlin-Taylor <[email protected]> wrote: > >>> > > >>> > - make an inventory: It doesn't need to be exhaustive, but a > >>> > representative sample. > >>> > - More clearly define _what_ the API calls return -- object type, > >>> > methods on them etc. > >>> > > >>> > From the AIP you have this example: > >>> > > >>> > def get_dag_run(self, dag_id, execution_date): > >>> > return self.db_client.get_dag_run(dag_id,execution_date) > >>> > > >>> > What does that _actually_ return? What capabilities does it have? > >>> > > >>> > (I have other thoughts but those are less fundamental and can be > >>> > discussed later) > >>> > > >>> > -ash > >>> > > >>> > On Fri, Dec 3 2021 at 18:20:21 +0100, Jarek Potiuk <[email protected]> > >>> > wrote: > >>> > > >>> > Surely - if you think that we need to do some more work to get > >>> > confidence, that's fine. I am sure we can improve it to the level that > >>> > we will not have to do full performance tests, and you are confident in > >>> > the direction. Just to clarify your concerns and make sure we are on > >>> > the same page - as I understand it should: * make an inventory of > >>> > "actual changes" the proposal will involve in the database-low-level > >>> > code of Airflow * based on that either assessment that those changes > >>> > are unlikely (or likely make a performance impact) * if we asses that > >>> > it is likely to have an impact, some at least rudimentary performance > >>> > tests to prove that this is manageable I think that might be a good > >>> > exercise to do. Does it sound about right? Or do you have any concerns > >>> > about certain architectural decisions taken? No problem with Friday, > >>> > but if we get answers today. I think it will give us time to think > >>> > about it over the weekend and address it next week. J. On Fri, Dec 3, > >>> > 2021 at 5:56 PM Ash Berlin-Taylor <[email protected]> wrote: > >>> > > >>> > This is a fundamental change to the architecture with significant > >>> > possible impacts on performance, and likely requires touching a large > >>> > portion of the code base. Sorry, you're going to have to do expand on > >>> > the details first and work out what would actually be involved and what > >>> > the impacts will be: Right now I have serious reservations to this > >>> > approach, so I can't agree on the high level proposal without an actual > >>> > proposal (The current document is, at best, an outline, not an actual > >>> > proposal.) Sorry to be a grinch right before the weekend. Ash On Thu, > >>> > Dec 2 2021 at 22:47:34 +0100, Jarek Potiuk <[email protected]> wrote: Oh > >>> > yeah - good point and we spoke about performance testing/implications. > >>> > Performance is something we were discussing as the next step when we > >>> > get general "OK" in the direction - we just want to make sure that > >>> > there are no "huge" blockers in the way this is proposed and explain > >>> > any doubts first, so that the investment in performance part makes > >>> > sense. We do not want to spend a lot of time on getting the tests done > >>> > and detailed inventory of methods/ API calls to get - only to find out > >>> > that this is generally "bad direction". Just to clarify again - we also > >>> > considered (alternative option) to automatically map all the DB methods > >>> > in the remote calls. But we dropped that idea - precisely for the > >>> > reason of performance, and transaction integrity. So we are NOT mapping > >>> > DB calls into API calls. those will be "logical operations" on the > >>> > database. Generally speaking, most of the API calls for the "airflow > >>> > system-level but executed in worker" calls will be rather "coarse" than > >>> > fine-grained. For example, the aforementioned "mini scheduler" - where > >>> > we want to make a single API call and run the whole of it on the DBAPI > >>> > side. So there - performance impact is very limited IMHO. And If we see > >>> > any other "logic" like that in other parts of the code (zombie > >>> > detection as an example). We plan to make a detailed inventory of those > >>> > once we get general "Looks good" for the direction. For now we did some > >>> > "rough" checking and it seems a plausible approach and quite doable. > >>> > One more note - the "fine-grained" ( "variable" update/retrieval, > >>> > "connection update retrieval") - via REST API will still be used by the > >>> > user's code though (Parsing DAGs, operators, workers and callbacks). We > >>> > also plan to make sure that none of the "Community" operators are using > >>> > "non-blessed" DB calls (we can do it in our CI). So at the end of the > >>> > exercise, all operators, hooks, etc. from the community will be > >>> > guaranteed to only use the DB APIs that are available in the "DB API" > >>> > module. But there I do not expect pretty much any performance penalty > >>> > as those are very fast and rare operations (and good thing there is > >>> > that we can cache results of those in workers/DAG processing). J. On > >>> > Thu, Dec 2, 2021 at 7:16 PM Andrew Godwin > >>> > <[email protected]> wrote: Ah, my bad, I missed that. > >>> > I'd still like to see discussion of the performance impacts, though. On > >>> > Thu, Dec 2, 2021 at 11:14 AM Ash Berlin-Taylor <[email protected]> wrote: > >>> > The scheduler was excluded from the components that would use the dbapi > >>> > - the mini scheduler is the odd one out here is it (currently) runs on > >>> > the work but shares much of the code from the scheduling path. -a On 2 > >>> > December 2021 17:56:40 GMT, Andrew Godwin > >>> > <[email protected]> wrote: I would also like to see > >>> > some discussion in this AIP about how the data is going to be > >>> > serialised to and from the database instances (obviously Connexion is > >>> > involved, but I presume more transformation code is needed than that) > >>> > and the potential slowdown this would cause. In my experience, a > >>> > somewhat direct ORM mapping like this is going to result in > >>> > considerably slower times for any complex operation that's touching a > >>> > few hundred rows. Is there a reason this is being proposed for the > >>> > scheduler code, too? In my mind, the best approach to multitenancy > >>> > would be to remove all user-supplied code from the scheduler and leave > >>> > it with direct DB access, rather than trying to indirect all scheduler > >>> > access through another API layer. Andrew On Thu, Dec 2, 2021 at 10:29 > >>> > AM Jarek Potiuk <[email protected]> wrote: Yeah - I thik Ash you are > >>> > completely right we need some more "detailed" clarification. I believe, > >>> > I know what you are - rightfully - afraid of (re impact on the code), > >>> > and maybe we have not done a good job on explaining it with some of our > >>> > assumptions we had when we worked on it with Mateusz. Simply it was not > >>> > clear that our aim is to absolutely minimise the impact on the > >>> > "internal DB transactions" done in schedulers and workers. The idea is > >>> > that change will at most result in moving an execution of the > >>> > transactions to another process but not changing what the DB > >>> > transactions do internally. Actually this was one of the reason for the > >>> > "alternative" approach (you can see it in the document) we discussed > >>> > about - hijack "sqlalchemy session" - this is far too low level and the > >>> > aim of the "DB-API" is NOT to replace direct DB calls (Hence we need to > >>> > figure out a better name). The API is there to provide "scheduler > >>> > logic" API and "REST access to Airflow primitives like > >>> > dags/tasks/variables/connections" etc.. As an example (which we briefly > >>> > talked about in slack) the "_run_mini_scheduler_on_child_tasks" case > >>> > (https://github.com/apache/airflow/blob/main/airflow/jobs/local_task_job.py#L225-L274) > >>> > is an example (that we would put in the doc). As we thought of it - > >>> > this is a "single DB-API operation". Those are not Pure REST calls of > >>> > course, they are more RPC-like calls. That is why even initially I > >>> > thought of separating the API completely. But since there are a lot of > >>> > common "primitive" calls that we can re-use, I think having a separate > >>> > DB-API component which will re-use connexion implementation, replacing > >>> > authentication with the custom worker <> DB-API authentication is the > >>> > way to go. And yes if we agree on the general idea, we need to choose > >>> > the best way on how to best "connect" the REST API we have with the > >>> > RPC-kind of API we need for some cases in workers. But we wanted to > >>> > make sure we are on the same page with the direction. And yes it means > >>> > that DB-API will potentially have to handle quite a number of DB > >>> > operations (and that it has to be replicable and scalable as well) - > >>> > but DB-API will be "stateless" similarly as the webserver is, so it > >>> > will be scalable by definition. And yest performance tests will be part > >>> > of POC - likely even before we finally ask for votes there. So in > >>> > short: * no modification or impact on current scheduler behaviour when > >>> > DB Isolation is disabled * only higher level methods will be moved out > >>> > to DB-API and we will reuse existing "REST" APIS where it makes sense * > >>> > we aim to have "0" changes to the logic of processing - both in Dag > >>> > Processing logic and DB API. We think with this architecture we > >>> > proposed it's perfectly doable I hope this clarifies a bit, and once we > >>> > agree on general direction, we will definitely work on adding more > >>> > details and clarification (we actually already have a lot of that but > >>> > we just wanted to start with explaining the idea and going into more > >>> > details later when we are sure there are no "high-level" blockers from > >>> > the community. J, On Thu, Dec 2, 2021 at 4:46 PM Ash Berlin-Taylor > >>> > <[email protected]> wrote: > > I just provided a general idea for the > >>> > approach - but if you want me to put more examples then I am happy to > >>> > do that > > > Yes please. > > It is too general for me and I can't work > >>> > out what effect it would actually have on the code base, especially how > >>> > it would look with the config option to enable/disable direct db > >>> > access. > > -ash > > On Thu, Dec 2 2021 at 16:36:57 +0100, Mateusz Henc > >>> > <[email protected]> wrote: > > Hi, > I am sorry if it is not > >>> > clear enough, let me try to explain it here, so maybe it gives more > >>> > light on the idea. > See my comments below > > On Thu, Dec 2, 2021 at > >>> > 3:39 PM Ash Berlin-Taylor <[email protected]> wrote: >> >> I'm sorry to > >>> > say it, but this proposal right just doesn't contain enough detail to > >>> > say what the actual changes to the code would be, and what the impact > >>> > would be >> >> To take the one example you have so far: >> >> >> def > >>> > get_dag_run(self, dag_id, execution_date): >> return > >>> > self.db_client.get_dag_run(dag_id,execution_date) >> >> So form this > >>> > snippet I'm guessing it would be used like this: >> >> dag_run = > >>> > db_client.get_dag_run(dag_id, execution_date) >> >> What type of object > >>> > is returned? > > > As it replaces: > dag_run = session.query(DagRun) > > >>> > .filter(DagRun.dag_id == dag_id, DagRun.execution_date == > >>> > execution_date) > .first() > > then the type of the object will be > >>> > exactly the same (DagRun) . > >> >> >> Do we need one API method per > >>> > individual query we have in the source? > > > No, as explained by the > >>> > sentence: > > The method may be extended, accepting more optional > >>> > parameters to avoid having too many similar implementations. > > >> >> > >>> > >> Which components would use this new mode when it's enabled? > > > > >>> > You may read: > Airflow Database APi is a new independent component of > >>> > Airflow. It allows isolating some components (Worker, DagProcessor and > >>> > Triggerer) from direct access to DB. > >> >> But what you haven't said > >>> > the first thing about is what _other_ changes would be needed in the > >>> > code. To take a fairly simple example: >> >> dag_run = > >>> > db_client.get_dag_run(dag_id, execution_date) >> dag_run.queued_at = > >>> > timezone.now() >> # How do I save this? >> >> In short, you need to put > >>> > a lot more detail into this before we can even have an idea of the full > >>> > scope of the change this proposal would involve, and what code changes > >>> > would be needed for compnents to work with and without this setting > >>> > enabled. > > > For this particular example - it depends on the > >>> > intention of the code author > - If this should be in transaction - > >>> > then I would actually introduce new method like enqueue_dag_run(...) > >>> > that would run these two steps on Airflow DB API side > - if not then, > >>> > maybe just the "update_dag_run" method accepting the whole "dag_run" > >>> > object and saving it to the DB. > > In general - we could take naive > >>> > approach, eg replace code: > dag_run = session.query(DagRun) > > >>> > .filter(DagRun.dag_id == dag_id, DagRun.execution_date == > >>> > execution_date) > .first() > with: > if self.db_isolation: > dag_run = > >>> > session.query(DagRun) > .filter(DagRun.dag_id == dag_id, > >>> > DagRun.execution_date == execution_date) > .first() > else: > dag_run = > >>> > db_client.get_dag_run(self, dag_id, execution_date) > > The problem is > >>> > that Airflow DB API would need to have the same implementation for the > >>> > query - so duplicated code. That's why we propose moving this code to > >>> > the DBClient which is also used by the Airflow DB API(in DB direct > >>> > mode). > > I know there are many places where the code is much more > >>> > complicated than a single query, but they must be handled one-by-one, > >>> > during the implementation, otherwise this AIP would be way too big. > > > >>> > I just provided a general idea for the approach - but if you want me to > >>> > put more examples then I am happy to do that > > Best regards, > > >>> > Mateusz Henc > >> >> On Thu, Dec 2 2021 at 14:23:56 +0100, Mateusz Henc > >>> > <[email protected]> wrote: >> >> Hi, >> I just added a new AIP > >>> > for running some Airflow components in DB-isolation mode, without > >>> > direct access to the Airflow Database, but they will use a new API for > >>> > thi purpose. >> >> PTAL: >> > >>> > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Database+API > >>> > >> >> Open question: >> I called it "Airflow Database API" - however I > >>> > feel it could be more than just an access layer for the database. So if > >>> > you have a better name, please let me know, I am happy to change it. >> > >>> > >> Best regards, >> Mateusz Henc
