Re: [Discuss] AIP-44 Airflow Database API

Ash Berlin-Taylor Fri, 03 Dec 2021 09:37:46 -0800

- make an inventory: It doesn't need to be exhaustive, but arepresentative sample.- More clearly define _what_ the API calls return -- object type,methods on them etc.


From the AIP you have this example:


 def get_dag_run(self, dag_id, execution_date):
   return self.db_client.get_dag_run(dag_id,execution_date)

What does that _actually_ return? What capabilities does it have?

(I have other thoughts but those are less fundamental and can bediscussed later)


-ash

On Fri, Dec 3 2021 at 18:20:21 +0100, Jarek Potiuk <ja...@potiuk.com>wrote:

Surely - if you think that we need to do some more work to get
confidence, that's fine. I am sure we can improve it to the level that
we will not have to do full performance tests, and you are confident
in the direction.

Just to clarify your concerns and make sure we are on the same page -
as I understand it should:

* make an inventory of "actual changes" the proposal will involve in
the database-low-level code of Airflow
* based on that either assessment that those changes are unlikely (or
likely make a performance impact)
* if we asses that it is likely to have an impact, some at least
rudimentary performance tests to prove that this is manageable

I think that might be a good exercise to do. Does it sound about
right? Or do you have any concerns about certain architectural
decisions taken?

No problem with Friday, but if we get answers today. I think it will
give us time to think about it over the weekend and address it next
week.

J.
On Fri, Dec 3, 2021 at 5:56 PM Ash Berlin-Taylor <a...@apache.org<mailto:a...@apache.org>> wrote:
This is a fundamental change to the architecture with significantpossible impacts on performance, and likely requires touching alarge portion of the code base.
Sorry, you're going to have to do expand on the details first andwork out what would actually be involved and what the impacts willbe: Right now I have serious reservations to this approach, so Ican't agree on the high level proposal without an actual proposal(The current document is, at best, an outline, not an actualproposal.)
 Sorry to be a grinch right before the weekend.

 Ash
On Thu, Dec 2 2021 at 22:47:34 +0100, Jarek Potiuk<ja...@potiuk.com <mailto:ja...@potiuk.com>> wrote:
Oh yeah - good point and we spoke about performancetesting/implications. Performance is something we were discussing asthe next step when we get general "OK" in the direction - we justwant to make sure that there are no "huge" blockers in the way thisis proposed and explain any doubts first, so that the investment inperformance part makes sense. We do not want to spend a lot of timeon getting the tests done and detailed inventory of methods/ APIcalls to get - only to find out that this is generally "baddirection". Just to clarify again - we also considered (alternativeoption) to automatically map all the DB methods in the remote calls.But we dropped that idea - precisely for the reason of performance,and transaction integrity. So we are NOT mapping DB calls into APIcalls. those will be "logical operations" on the database. Generallyspeaking, most of the API calls for the "airflow system-level butexecuted in worker" calls will be rather "coarse" than fine-grained.For example, the aforementioned "mini scheduler" - where we want tomake a single API call and run the whole of it on the DBAPI side. Sothere - performance impact is very limited IMHO. And If we see anyother "logic" like that in other parts of the code (zombie detectionas an example). We plan to make a detailed inventory of those oncewe get general "Looks good" for the direction. For now we did some"rough" checking and it seems a plausible approach and quite doable.One more note - the "fine-grained" ( "variable" update/retrieval,"connection update retrieval") - via REST API will still be used bythe user's code though (Parsing DAGs, operators, workers andcallbacks). We also plan to make sure that none of the "Community"operators are using "non-blessed" DB calls (we can do it in our CI).So at the end of the exercise, all operators, hooks, etc. from thecommunity will be guaranteed to only use the DB APIs that areavailable in the "DB API" module. But there I do not expect prettymuch any performance penalty as those are very fast and rareoperations (and good thing there is that we can cache results ofthose in workers/DAG processing). J. On Thu, Dec 2, 2021 at 7:16 PMAndrew Godwin <andrew.god...@astronomer.io.invalid<mailto:andrew.god...@astronomer.io.invalid>> wrote:
Ah, my bad, I missed that. I'd still like to see discussion of theperformance impacts, though. On Thu, Dec 2, 2021 at 11:14 AM AshBerlin-Taylor <a...@apache.org <mailto:a...@apache.org>> wrote:
The scheduler was excluded from the components that would use thedbapi - the mini scheduler is the odd one out here is it (currently)runs on the work but shares much of the code from the schedulingpath. -a On 2 December 2021 17:56:40 GMT, Andrew Godwin<andrew.god...@astronomer.io.INVALID<mailto:andrew.god...@astronomer.io.INVALID>> wrote:
I would also like to see some discussion in this AIP about how thedata is going to be serialised to and from the database instances(obviously Connexion is involved, but I presume more transformationcode is needed than that) and the potential slowdown this wouldcause. In my experience, a somewhat direct ORM mapping like this isgoing to result in considerably slower times for any complexoperation that's touching a few hundred rows. Is there a reason thisis being proposed for the scheduler code, too? In my mind, the bestapproach to multitenancy would be to remove all user-supplied codefrom the scheduler and leave it with direct DB access, rather thantrying to indirect all scheduler access through another API layer.Andrew On Thu, Dec 2, 2021 at 10:29 AM Jarek Potiuk<ja...@potiuk.com <mailto:ja...@potiuk.com>> wrote:
Yeah - I thik Ash you are completely right we need some more"detailed" clarification. I believe, I know what you are -rightfully - afraid of (re impact on the code), and maybe we havenot done a good job on explaining it with some of our assumptions wehad when we worked on it with Mateusz. Simply it was not clear thatour aim is to absolutely minimise the impact on the "internal DBtransactions" done in schedulers and workers. The idea is thatchange will at most result in moving an execution of thetransactions to another process but not changing what the DBtransactions do internally. Actually this was one of the reason forthe "alternative" approach (you can see it in the document) wediscussed about - hijack "sqlalchemy session" - this is far too lowlevel and the aim of the "DB-API" is NOT to replace direct DB calls(Hence we need to figure out a better name). The API is there toprovide "scheduler logic" API and "REST access to Airflow primitiveslike dags/tasks/variables/connections" etc.. As an example (which webriefly talked about in slack) the"_run_mini_scheduler_on_child_tasks" case(<https://github.com/apache/airflow/blob/main/airflow/jobs/local_task_job.py#L225-L274>)is an example (that we would put in the doc). As we thought of it -this is a "single DB-API operation". Those are not Pure REST callsof course, they are more RPC-like calls. That is why even initiallyI thought of separating the API completely. But since there are alot of common "primitive" calls that we can re-use, I think having aseparate DB-API component which will re-use connexionimplementation, replacing authentication with the custom worker <>DB-API authentication is the way to go. And yes if we agree on thegeneral idea, we need to choose the best way on how to best"connect" the REST API we have with the RPC-kind of API we need forsome cases in workers. But we wanted to make sure we are on the samepage with the direction. And yes it means that DB-API willpotentially have to handle quite a number of DB operations (and thatit has to be replicable and scalable as well) - but DB-API will be"stateless" similarly as the webserver is, so it will be scalable bydefinition. And yest performance tests will be part of POC - likelyeven before we finally ask for votes there. So in short: * nomodification or impact on current scheduler behaviour when DBIsolation is disabled * only higher level methods will be moved outto DB-API and we will reuse existing "REST" APIS where it makessense * we aim to have "0" changes to the logic of processing - bothin Dag Processing logic and DB API. We think with this architecturewe proposed it's perfectly doable I hope this clarifies a bit, andonce we agree on general direction, we will definitely work onadding more details and clarification (we actually already have alot of that but we just wanted to start with explaining the idea andgoing into more details later when we are sure there are no"high-level" blockers from the community. J, On Thu, Dec 2, 2021 at4:46 PM Ash Berlin-Taylor <a...@apache.org <mailto:a...@apache.org>>wrote: > > I just provided a general idea for the approach - but ifyou want me to put more examples then I am happy to do that > > >Yes please. > > It is too general for me and I can't work out whateffect it would actually have on the code base, especially how itwould look with the config option to enable/disable direct dbaccess. > > -ash > > On Thu, Dec 2 2021 at 16:36:57 +0100, MateuszHenc <mh...@google.com.INVALID <mailto:mh...@google.com.INVALID>>wrote: > > Hi, > I am sorry if it is not clear enough, let me try toexplain it here, so maybe it gives more light on the idea. > See mycomments below > > On Thu, Dec 2, 2021 at 3:39 PM Ash Berlin-Taylor<a...@apache.org <mailto:a...@apache.org>> wrote: >> >> I'm sorry tosay it, but this proposal right just doesn't contain enough detailto say what the actual changes to the code would be, and what theimpact would be >> >> To take the one example you have so far: >> >>>> def get_dag_run(self, dag_id, execution_date): >> returnself.db_client.get_dag_run(dag_id,execution_date) >> >> So form thissnippet I'm guessing it would be used like this: >> >> dag_run =db_client.get_dag_run(dag_id, execution_date) >> >> What type ofobject is returned? > > > As it replaces: > dag_run =session.query(DagRun) > .filter(DagRun.dag_id == dag_id,DagRun.execution_date == execution_date) > .first() > > then thetype of the object will be exactly the same (DagRun) . > >> >> >> Dowe need one API method per individual query we have in the source? >> > No, as explained by the sentence: > > The method may beextended, accepting more optional parameters to avoid having toomany similar implementations. > > >> >> >> Which components woulduse this new mode when it's enabled? > > > You may read: > AirflowDatabase APi is a new independent component of Airflow. It allowsisolating some components (Worker, DagProcessor and Triggerer) fromdirect access to DB. > >> >> But what you haven't said the firstthing about is what _other_ changes would be needed in the code. Totake a fairly simple example: >> >> dag_run =db_client.get_dag_run(dag_id, execution_date) >> dag_run.queued_at =timezone.now() >> # How do I save this? >> >> In short, you need toput a lot more detail into this before we can even have an idea ofthe full scope of the change this proposal would involve, and whatcode changes would be needed for compnents to work with and withoutthis setting enabled. > > > For this particular example - it dependson the intention of the code author > - If this should be intransaction - then I would actually introduce new method likeenqueue_dag_run(...) that would run these two steps on Airflow DBAPI side > - if not then, maybe just the "update_dag_run" methodaccepting the whole "dag_run" object and saving it to the DB. > > Ingeneral - we could take naive approach, eg replace code: > dag_run =session.query(DagRun) > .filter(DagRun.dag_id == dag_id,DagRun.execution_date == execution_date) > .first() > with: > ifself.db_isolation: > dag_run = session.query(DagRun) >.filter(DagRun.dag_id == dag_id, DagRun.execution_date ==execution_date) > .first() > else: > dag_run =db_client.get_dag_run(self, dag_id, execution_date) > > The problemis that Airflow DB API would need to have the same implementationfor the query - so duplicated code. That's why we propose movingthis code to the DBClient which is also used by the Airflow DBAPI(in DB direct mode). > > I know there are many places where thecode is much more complicated than a single query, but they must behandled one-by-one, during the implementation, otherwise this AIPwould be way too big. > > I just provided a general idea for theapproach - but if you want me to put more examples then I am happyto do that > > Best regards, > Mateusz Henc > >> >> On Thu, Dec 22021 at 14:23:56 +0100, Mateusz Henc <mh...@google.com.INVALID<mailto:mh...@google.com.INVALID>> wrote: >> >> Hi, >> I just addeda new AIP for running some Airflow components in DB-isolation mode,without direct access to the Airflow Database, but they will use anew API for thi purpose. >> >> PTAL: >><https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Database+API>>> >> Open question: >> I called it "Airflow Database API" - howeverI feel it could be more than just an access layer for the database.So if you have a better name, please let me know, I am happy tochange it. >> >> Best regards, >> Mateusz Henc

Re: [Discuss] AIP-44 Airflow Database API

Reply via email to