Re: [Discuss] AIP-44 Airflow Database API

Jarek Potiuk Mon, 03 Jan 2022 05:49:16 -0800

Also AIP-44 - which is the DB isolation mode is much more detailed and
ready for deeper discussion if needed.
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Database+API


On Tue, Dec 14, 2021 at 12:07 PM Jarek Potiuk <[email protected]> wrote:
>
> And just to add to that.
>
> Thanks again for the initial comments and pushing us to provide more
> details. That allowed us to discuss and focus on many of the aspects
> that were raised and we have many more answers now:
>
> * first of all - we focused on making sure impact on the existing code
> and "behavior" of Airflow is minimal. In fact there should be
> virtually no change vs. current behavior when db isolation is
> disabled.
> * secondly - we've done a full inventory of how the API needed should
> look like and (not unsurprisingly) it turned out that the current
> REST-style API is good for part of it but most of the 'logic" of
> airflow can be done efficiently when we go to RPC-style API. However
> we propose that the authorization/exposure of the API is the same as
> we use currently in the REST API, this will allow us to reuse a big
> part of the infrastructure we already have
> * thirdly - we took deep into our hearts the comments about having to
> maintain pretty much the same logic in a few different places. That's
> an obvious maintenance problem. The proposal we came up with addresses
> it - we are going to keep the logic of Airflow internals in one place
> only and we will simply smartly route on where the logic will be
> executed
> * regarding the performance impact - we described the deployment
> options that our proposal makes available - we do not want to favor
> one deployment option over another, but we made sure the architecture
> is done in the way, that you can choose which deployment is good for
> you: "no isolation", "partial isolation", "full isolation" - each with
> different performance/resource characteristics - all of them fully
> horizontally scalable and nicely manageable.
>
> We look forward to comments, also for the API 43 -
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-43+DAG+Processor+separation
> - AIP-43 is a prerequiste to AIP-44 and they both work together.
>
> We also will think and discuss more follow-up AIPs once we get those
> approved (hopefully ;) ). The multi-tenancy is a 'long haul" and while
> those two AIPs are foundational building blocks, there are at least
> few more follow-up AIPs to be able to say "we're done with
> Multi-tenancy" :).
>
> J.
>
>
>
> On Tue, Dec 14, 2021 at 9:58 AM Mateusz Henc <[email protected]> wrote:
> >
> > Hi,
> >
> > As promised we (credits to Jarek) updated the AIPs, added more details, did 
> > inventory and changed the way API endpoints are generated.
> >
> > We also renamed it to Airflow Internal API - so the url has changed:
> > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API
> >
> > Please take a look, any comments are highly appreciated.
> >
> > Best regards,
> > Mateusz Henc
> >
> >
> > On Mon, Dec 6, 2021 at 3:09 PM Mateusz Henc <[email protected]> wrote:
> >>
> >> Hi,
> >> Thank you Ash for your feedback (in both AIPs)
> >>
> >> We are working to address your concerns. We will update the AIPs in a few 
> >> days.
> >> I will let you know when it's done.
> >>
> >> Best regards,
> >> Mateusz Henc
> >>
> >>
> >> On Fri, Dec 3, 2021 at 7:27 PM Jarek Potiuk <[email protected]> wrote:
> >>>
> >>> Cool. Thanks for the guidance.
> >>>
> >>> On Fri, Dec 3, 2021 at 6:37 PM Ash Berlin-Taylor <[email protected]> wrote:
> >>> >
> >>> > - make an inventory: It doesn't need to be exhaustive, but a 
> >>> > representative sample.
> >>> > - More clearly define _what_ the API calls return -- object type, 
> >>> > methods on them etc.
> >>> >
> >>> > From the AIP you have this example:
> >>> >
> >>> >   def get_dag_run(self, dag_id, execution_date):
> >>> >     return self.db_client.get_dag_run(dag_id,execution_date)
> >>> >
> >>> > What does that _actually_ return? What capabilities does it have?
> >>> >
> >>> > (I have other thoughts but those are less fundamental and can be 
> >>> > discussed later)
> >>> >
> >>> > -ash
> >>> >
> >>> > On Fri, Dec 3 2021 at 18:20:21 +0100, Jarek Potiuk <[email protected]> 
> >>> > wrote:
> >>> >
> >>> > Surely - if you think that we need to do some more work to get 
> >>> > confidence, that's fine. I am sure we can improve it to the level that 
> >>> > we will not have to do full performance tests, and you are confident in 
> >>> > the direction. Just to clarify your concerns and make sure we are on 
> >>> > the same page - as I understand it should: * make an inventory of 
> >>> > "actual changes" the proposal will involve in the database-low-level 
> >>> > code of Airflow * based on that either assessment that those changes 
> >>> > are unlikely (or likely make a performance impact) * if we asses that 
> >>> > it is likely to have an impact, some at least rudimentary performance 
> >>> > tests to prove that this is manageable I think that might be a good 
> >>> > exercise to do. Does it sound about right? Or do you have any concerns 
> >>> > about certain architectural decisions taken? No problem with Friday, 
> >>> > but if we get answers today. I think it will give us time to think 
> >>> > about it over the weekend and address it next week. J. On Fri, Dec 3, 
> >>> > 2021 at 5:56 PM Ash Berlin-Taylor <[email protected]> wrote:
> >>> >
> >>> > This is a fundamental change to the architecture with significant 
> >>> > possible impacts on performance, and likely requires touching a large 
> >>> > portion of the code base. Sorry, you're going to have to do expand on 
> >>> > the details first and work out what would actually be involved and what 
> >>> > the impacts will be: Right now I have serious reservations to this 
> >>> > approach, so I can't agree on the high level proposal without an actual 
> >>> > proposal (The current document is, at best, an outline, not an actual 
> >>> > proposal.) Sorry to be a grinch right before the weekend. Ash On Thu, 
> >>> > Dec 2 2021 at 22:47:34 +0100, Jarek Potiuk <[email protected]> wrote: Oh 
> >>> > yeah - good point and we spoke about performance testing/implications. 
> >>> > Performance is something we were discussing as the next step when we 
> >>> > get general "OK" in the direction - we just want to make sure that 
> >>> > there are no "huge" blockers in the way this is proposed and explain 
> >>> > any doubts first, so that the investment in performance part makes 
> >>> > sense. We do not want to spend a lot of time on getting the tests done 
> >>> > and detailed inventory of methods/ API calls to get - only to find out 
> >>> > that this is generally "bad direction". Just to clarify again - we also 
> >>> > considered (alternative option) to automatically map all the DB methods 
> >>> > in the remote calls. But we dropped that idea - precisely for the 
> >>> > reason of performance, and transaction integrity. So we are NOT mapping 
> >>> > DB calls into API calls. those will be "logical operations" on the 
> >>> > database. Generally speaking, most of the API calls for the "airflow 
> >>> > system-level but executed in worker" calls will be rather "coarse" than 
> >>> > fine-grained. For example, the aforementioned "mini scheduler" - where 
> >>> > we want to make a single API call and run the whole of it on the DBAPI 
> >>> > side. So there - performance impact is very limited IMHO. And If we see 
> >>> > any other "logic" like that in other parts of the code (zombie 
> >>> > detection as an example). We plan to make a detailed inventory of those 
> >>> > once we get general "Looks good" for the direction. For now we did some 
> >>> > "rough" checking and it seems a plausible approach and quite doable. 
> >>> > One more note - the "fine-grained" ( "variable" update/retrieval, 
> >>> > "connection update retrieval") - via REST API will still be used by the 
> >>> > user's code though (Parsing DAGs, operators, workers and callbacks). We 
> >>> > also plan to make sure that none of the "Community" operators are using 
> >>> > "non-blessed" DB calls (we can do it in our CI). So at the end of the 
> >>> > exercise, all operators, hooks, etc. from the community will be 
> >>> > guaranteed to only use the DB APIs that are available in the "DB API" 
> >>> > module. But there I do not expect pretty much any performance penalty 
> >>> > as those are very fast and rare operations (and good thing there is 
> >>> > that we can cache results of those in workers/DAG processing). J. On 
> >>> > Thu, Dec 2, 2021 at 7:16 PM Andrew Godwin 
> >>> > <[email protected]> wrote: Ah, my bad, I missed that. 
> >>> > I'd still like to see discussion of the performance impacts, though. On 
> >>> > Thu, Dec 2, 2021 at 11:14 AM Ash Berlin-Taylor <[email protected]> wrote: 
> >>> > The scheduler was excluded from the components that would use the dbapi 
> >>> > - the mini scheduler is the odd one out here is it (currently) runs on 
> >>> > the work but shares much of the code from the scheduling path. -a On 2 
> >>> > December 2021 17:56:40 GMT, Andrew Godwin 
> >>> > <[email protected]> wrote: I would also like to see 
> >>> > some discussion in this AIP about how the data is going to be 
> >>> > serialised to and from the database instances (obviously Connexion is 
> >>> > involved, but I presume more transformation code is needed than that) 
> >>> > and the potential slowdown this would cause. In my experience, a 
> >>> > somewhat direct ORM mapping like this is going to result in 
> >>> > considerably slower times for any complex operation that's touching a 
> >>> > few hundred rows. Is there a reason this is being proposed for the 
> >>> > scheduler code, too? In my mind, the best approach to multitenancy 
> >>> > would be to remove all user-supplied code from the scheduler and leave 
> >>> > it with direct DB access, rather than trying to indirect all scheduler 
> >>> > access through another API layer. Andrew On Thu, Dec 2, 2021 at 10:29 
> >>> > AM Jarek Potiuk <[email protected]> wrote: Yeah - I thik Ash you are 
> >>> > completely right we need some more "detailed" clarification. I believe, 
> >>> > I know what you are - rightfully - afraid of (re impact on the code), 
> >>> > and maybe we have not done a good job on explaining it with some of our 
> >>> > assumptions we had when we worked on it with Mateusz. Simply it was not 
> >>> > clear that our aim is to absolutely minimise the impact on the 
> >>> > "internal DB transactions" done in schedulers and workers. The idea is 
> >>> > that change will at most result in moving an execution of the 
> >>> > transactions to another process but not changing what the DB 
> >>> > transactions do internally. Actually this was one of the reason for the 
> >>> > "alternative" approach (you can see it in the document) we discussed 
> >>> > about - hijack "sqlalchemy session" - this is far too low level and the 
> >>> > aim of the "DB-API" is NOT to replace direct DB calls (Hence we need to 
> >>> > figure out a better name). The API is there to provide "scheduler 
> >>> > logic" API and "REST access to Airflow primitives like 
> >>> > dags/tasks/variables/connections" etc.. As an example (which we briefly 
> >>> > talked about in slack) the "_run_mini_scheduler_on_child_tasks" case 
> >>> > (https://github.com/apache/airflow/blob/main/airflow/jobs/local_task_job.py#L225-L274)
> >>> >  is an example (that we would put in the doc). As we thought of it - 
> >>> > this is a "single DB-API operation". Those are not Pure REST calls of 
> >>> > course, they are more RPC-like calls. That is why even initially I 
> >>> > thought of separating the API completely. But since there are a lot of 
> >>> > common "primitive" calls that we can re-use, I think having a separate 
> >>> > DB-API component which will re-use connexion implementation, replacing 
> >>> > authentication with the custom worker <> DB-API authentication is the 
> >>> > way to go. And yes if we agree on the general idea, we need to choose 
> >>> > the best way on how to best "connect" the REST API we have with the 
> >>> > RPC-kind of API we need for some cases in workers. But we wanted to 
> >>> > make sure we are on the same page with the direction. And yes it means 
> >>> > that DB-API will potentially have to handle quite a number of DB 
> >>> > operations (and that it has to be replicable and scalable as well) - 
> >>> > but DB-API will be "stateless" similarly as the webserver is, so it 
> >>> > will be scalable by definition. And yest performance tests will be part 
> >>> > of POC - likely even before we finally ask for votes there. So in 
> >>> > short: * no modification or impact on current scheduler behaviour when 
> >>> > DB Isolation is disabled * only higher level methods will be moved out 
> >>> > to DB-API and we will reuse existing "REST" APIS where it makes sense * 
> >>> > we aim to have "0" changes to the logic of processing - both in Dag 
> >>> > Processing logic and DB API. We think with this architecture we 
> >>> > proposed it's perfectly doable I hope this clarifies a bit, and once we 
> >>> > agree on general direction, we will definitely work on adding more 
> >>> > details and clarification (we actually already have a lot of that but 
> >>> > we just wanted to start with explaining the idea and going into more 
> >>> > details later when we are sure there are no "high-level" blockers from 
> >>> > the community. J, On Thu, Dec 2, 2021 at 4:46 PM Ash Berlin-Taylor 
> >>> > <[email protected]> wrote: > > I just provided a general idea for the 
> >>> > approach - but if you want me to put more examples then I am happy to 
> >>> > do that > > > Yes please. > > It is too general for me and I can't work 
> >>> > out what effect it would actually have on the code base, especially how 
> >>> > it would look with the config option to enable/disable direct db 
> >>> > access. > > -ash > > On Thu, Dec 2 2021 at 16:36:57 +0100, Mateusz Henc 
> >>> > <[email protected]> wrote: > > Hi, > I am sorry if it is not 
> >>> > clear enough, let me try to explain it here, so maybe it gives more 
> >>> > light on the idea. > See my comments below > > On Thu, Dec 2, 2021 at 
> >>> > 3:39 PM Ash Berlin-Taylor <[email protected]> wrote: >> >> I'm sorry to 
> >>> > say it, but this proposal right just doesn't contain enough detail to 
> >>> > say what the actual changes to the code would be, and what the impact 
> >>> > would be >> >> To take the one example you have so far: >> >> >> def 
> >>> > get_dag_run(self, dag_id, execution_date): >> return 
> >>> > self.db_client.get_dag_run(dag_id,execution_date) >> >> So form this 
> >>> > snippet I'm guessing it would be used like this: >> >> dag_run = 
> >>> > db_client.get_dag_run(dag_id, execution_date) >> >> What type of object 
> >>> > is returned? > > > As it replaces: > dag_run = session.query(DagRun) > 
> >>> > .filter(DagRun.dag_id == dag_id, DagRun.execution_date == 
> >>> > execution_date) > .first() > > then the type of the object will be 
> >>> > exactly the same (DagRun) . > >> >> >> Do we need one API method per 
> >>> > individual query we have in the source? > > > No, as explained by the 
> >>> > sentence: > > The method may be extended, accepting more optional 
> >>> > parameters to avoid having too many similar implementations. > > >> >> 
> >>> > >> Which components would use this new mode when it's enabled? > > > 
> >>> > You may read: > Airflow Database APi is a new independent component of 
> >>> > Airflow. It allows isolating some components (Worker, DagProcessor and 
> >>> > Triggerer) from direct access to DB. > >> >> But what you haven't said 
> >>> > the first thing about is what _other_ changes would be needed in the 
> >>> > code. To take a fairly simple example: >> >> dag_run = 
> >>> > db_client.get_dag_run(dag_id, execution_date) >> dag_run.queued_at = 
> >>> > timezone.now() >> # How do I save this? >> >> In short, you need to put 
> >>> > a lot more detail into this before we can even have an idea of the full 
> >>> > scope of the change this proposal would involve, and what code changes 
> >>> > would be needed for compnents to work with and without this setting 
> >>> > enabled. > > > For this particular example - it depends on the 
> >>> > intention of the code author > - If this should be in transaction - 
> >>> > then I would actually introduce new method like enqueue_dag_run(...) 
> >>> > that would run these two steps on Airflow DB API side > - if not then, 
> >>> > maybe just the "update_dag_run" method accepting the whole "dag_run" 
> >>> > object and saving it to the DB. > > In general - we could take naive 
> >>> > approach, eg replace code: > dag_run = session.query(DagRun) > 
> >>> > .filter(DagRun.dag_id == dag_id, DagRun.execution_date == 
> >>> > execution_date) > .first() > with: > if self.db_isolation: > dag_run = 
> >>> > session.query(DagRun) > .filter(DagRun.dag_id == dag_id, 
> >>> > DagRun.execution_date == execution_date) > .first() > else: > dag_run = 
> >>> > db_client.get_dag_run(self, dag_id, execution_date) > > The problem is 
> >>> > that Airflow DB API would need to have the same implementation for the 
> >>> > query - so duplicated code. That's why we propose moving this code to 
> >>> > the DBClient which is also used by the Airflow DB API(in DB direct 
> >>> > mode). > > I know there are many places where the code is much more 
> >>> > complicated than a single query, but they must be handled one-by-one, 
> >>> > during the implementation, otherwise this AIP would be way too big. > > 
> >>> > I just provided a general idea for the approach - but if you want me to 
> >>> > put more examples then I am happy to do that > > Best regards, > 
> >>> > Mateusz Henc > >> >> On Thu, Dec 2 2021 at 14:23:56 +0100, Mateusz Henc 
> >>> > <[email protected]> wrote: >> >> Hi, >> I just added a new AIP 
> >>> > for running some Airflow components in DB-isolation mode, without 
> >>> > direct access to the Airflow Database, but they will use a new API for 
> >>> > thi purpose. >> >> PTAL: >> 
> >>> > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Database+API
> >>> >  >> >> Open question: >> I called it "Airflow Database API" - however I 
> >>> > feel it could be more than just an access layer for the database. So if 
> >>> > you have a better name, please let me know, I am happy to change it. >> 
> >>> > >> Best regards, >> Mateusz Henc

Re: [Discuss] AIP-44 Airflow Database API

Reply via email to