Re: [Discuss] AIP-44 Airflow Database API

Jarek Potiuk Fri, 03 Dec 2021 10:27:20 -0800

Cool. Thanks for the guidance.


On Fri, Dec 3, 2021 at 6:37 PM Ash Berlin-Taylor <a...@apache.org> wrote:
>
> - make an inventory: It doesn't need to be exhaustive, but a representative 
> sample.
> - More clearly define _what_ the API calls return -- object type, methods on 
> them etc.
>
> From the AIP you have this example:
>
>   def get_dag_run(self, dag_id, execution_date):
>     return self.db_client.get_dag_run(dag_id,execution_date)
>
> What does that _actually_ return? What capabilities does it have?
>
> (I have other thoughts but those are less fundamental and can be discussed 
> later)
>
> -ash
>
> On Fri, Dec 3 2021 at 18:20:21 +0100, Jarek Potiuk <ja...@potiuk.com> wrote:
>
> Surely - if you think that we need to do some more work to get confidence, 
> that's fine. I am sure we can improve it to the level that we will not have 
> to do full performance tests, and you are confident in the direction. Just to 
> clarify your concerns and make sure we are on the same page - as I understand 
> it should: * make an inventory of "actual changes" the proposal will involve 
> in the database-low-level code of Airflow * based on that either assessment 
> that those changes are unlikely (or likely make a performance impact) * if we 
> asses that it is likely to have an impact, some at least rudimentary 
> performance tests to prove that this is manageable I think that might be a 
> good exercise to do. Does it sound about right? Or do you have any concerns 
> about certain architectural decisions taken? No problem with Friday, but if 
> we get answers today. I think it will give us time to think about it over the 
> weekend and address it next week. J. On Fri, Dec 3, 2021 at 5:56 PM Ash 
> Berlin-Taylor <a...@apache.org> wrote:
>
> This is a fundamental change to the architecture with significant possible 
> impacts on performance, and likely requires touching a large portion of the 
> code base. Sorry, you're going to have to do expand on the details first and 
> work out what would actually be involved and what the impacts will be: Right 
> now I have serious reservations to this approach, so I can't agree on the 
> high level proposal without an actual proposal (The current document is, at 
> best, an outline, not an actual proposal.) Sorry to be a grinch right before 
> the weekend. Ash On Thu, Dec 2 2021 at 22:47:34 +0100, Jarek Potiuk 
> <ja...@potiuk.com> wrote: Oh yeah - good point and we spoke about performance 
> testing/implications. Performance is something we were discussing as the next 
> step when we get general "OK" in the direction - we just want to make sure 
> that there are no "huge" blockers in the way this is proposed and explain any 
> doubts first, so that the investment in performance part makes sense. We do 
> not want to spend a lot of time on getting the tests done and detailed 
> inventory of methods/ API calls to get - only to find out that this is 
> generally "bad direction". Just to clarify again - we also considered 
> (alternative option) to automatically map all the DB methods in the remote 
> calls. But we dropped that idea - precisely for the reason of performance, 
> and transaction integrity. So we are NOT mapping DB calls into API calls. 
> those will be "logical operations" on the database. Generally speaking, most 
> of the API calls for the "airflow system-level but executed in worker" calls 
> will be rather "coarse" than fine-grained. For example, the aforementioned 
> "mini scheduler" - where we want to make a single API call and run the whole 
> of it on the DBAPI side. So there - performance impact is very limited IMHO. 
> And If we see any other "logic" like that in other parts of the code (zombie 
> detection as an example). We plan to make a detailed inventory of those once 
> we get general "Looks good" for the direction. For now we did some "rough" 
> checking and it seems a plausible approach and quite doable. One more note - 
> the "fine-grained" ( "variable" update/retrieval, "connection update 
> retrieval") - via REST API will still be used by the user's code though 
> (Parsing DAGs, operators, workers and callbacks). We also plan to make sure 
> that none of the "Community" operators are using "non-blessed" DB calls (we 
> can do it in our CI). So at the end of the exercise, all operators, hooks, 
> etc. from the community will be guaranteed to only use the DB APIs that are 
> available in the "DB API" module. But there I do not expect pretty much any 
> performance penalty as those are very fast and rare operations (and good 
> thing there is that we can cache results of those in workers/DAG processing). 
> J. On Thu, Dec 2, 2021 at 7:16 PM Andrew Godwin 
> <andrew.god...@astronomer.io.invalid> wrote: Ah, my bad, I missed that. I'd 
> still like to see discussion of the performance impacts, though. On Thu, Dec 
> 2, 2021 at 11:14 AM Ash Berlin-Taylor <a...@apache.org> wrote: The scheduler 
> was excluded from the components that would use the dbapi - the mini 
> scheduler is the odd one out here is it (currently) runs on the work but 
> shares much of the code from the scheduling path. -a On 2 December 2021 
> 17:56:40 GMT, Andrew Godwin <andrew.god...@astronomer.io.INVALID> wrote: I 
> would also like to see some discussion in this AIP about how the data is 
> going to be serialised to and from the database instances (obviously 
> Connexion is involved, but I presume more transformation code is needed than 
> that) and the potential slowdown this would cause. In my experience, a 
> somewhat direct ORM mapping like this is going to result in considerably 
> slower times for any complex operation that's touching a few hundred rows. Is 
> there a reason this is being proposed for the scheduler code, too? In my 
> mind, the best approach to multitenancy would be to remove all user-supplied 
> code from the scheduler and leave it with direct DB access, rather than 
> trying to indirect all scheduler access through another API layer. Andrew On 
> Thu, Dec 2, 2021 at 10:29 AM Jarek Potiuk <ja...@potiuk.com> wrote: Yeah - I 
> thik Ash you are completely right we need some more "detailed" clarification. 
> I believe, I know what you are - rightfully - afraid of (re impact on the 
> code), and maybe we have not done a good job on explaining it with some of 
> our assumptions we had when we worked on it with Mateusz. Simply it was not 
> clear that our aim is to absolutely minimise the impact on the "internal DB 
> transactions" done in schedulers and workers. The idea is that change will at 
> most result in moving an execution of the transactions to another process but 
> not changing what the DB transactions do internally. Actually this was one of 
> the reason for the "alternative" approach (you can see it in the document) we 
> discussed about - hijack "sqlalchemy session" - this is far too low level and 
> the aim of the "DB-API" is NOT to replace direct DB calls (Hence we need to 
> figure out a better name). The API is there to provide "scheduler logic" API 
> and "REST access to Airflow primitives like dags/tasks/variables/connections" 
> etc.. As an example (which we briefly talked about in slack) the 
> "_run_mini_scheduler_on_child_tasks" case 
> (https://github.com/apache/airflow/blob/main/airflow/jobs/local_task_job.py#L225-L274)
>  is an example (that we would put in the doc). As we thought of it - this is 
> a "single DB-API operation". Those are not Pure REST calls of course, they 
> are more RPC-like calls. That is why even initially I thought of separating 
> the API completely. But since there are a lot of common "primitive" calls 
> that we can re-use, I think having a separate DB-API component which will 
> re-use connexion implementation, replacing authentication with the custom 
> worker <> DB-API authentication is the way to go. And yes if we agree on the 
> general idea, we need to choose the best way on how to best "connect" the 
> REST API we have with the RPC-kind of API we need for some cases in workers. 
> But we wanted to make sure we are on the same page with the direction. And 
> yes it means that DB-API will potentially have to handle quite a number of DB 
> operations (and that it has to be replicable and scalable as well) - but 
> DB-API will be "stateless" similarly as the webserver is, so it will be 
> scalable by definition. And yest performance tests will be part of POC - 
> likely even before we finally ask for votes there. So in short: * no 
> modification or impact on current scheduler behaviour when DB Isolation is 
> disabled * only higher level methods will be moved out to DB-API and we will 
> reuse existing "REST" APIS where it makes sense * we aim to have "0" changes 
> to the logic of processing - both in Dag Processing logic and DB API. We 
> think with this architecture we proposed it's perfectly doable I hope this 
> clarifies a bit, and once we agree on general direction, we will definitely 
> work on adding more details and clarification (we actually already have a lot 
> of that but we just wanted to start with explaining the idea and going into 
> more details later when we are sure there are no "high-level" blockers from 
> the community. J, On Thu, Dec 2, 2021 at 4:46 PM Ash Berlin-Taylor 
> <a...@apache.org> wrote: > > I just provided a general idea for the approach 
> - but if you want me to put more examples then I am happy to do that > > > 
> Yes please. > > It is too general for me and I can't work out what effect it 
> would actually have on the code base, especially how it would look with the 
> config option to enable/disable direct db access. > > -ash > > On Thu, Dec 2 
> 2021 at 16:36:57 +0100, Mateusz Henc <mh...@google.com.INVALID> wrote: > > 
> Hi, > I am sorry if it is not clear enough, let me try to explain it here, so 
> maybe it gives more light on the idea. > See my comments below > > On Thu, 
> Dec 2, 2021 at 3:39 PM Ash Berlin-Taylor <a...@apache.org> wrote: >> >> I'm 
> sorry to say it, but this proposal right just doesn't contain enough detail 
> to say what the actual changes to the code would be, and what the impact 
> would be >> >> To take the one example you have so far: >> >> >> def 
> get_dag_run(self, dag_id, execution_date): >> return 
> self.db_client.get_dag_run(dag_id,execution_date) >> >> So form this snippet 
> I'm guessing it would be used like this: >> >> dag_run = 
> db_client.get_dag_run(dag_id, execution_date) >> >> What type of object is 
> returned? > > > As it replaces: > dag_run = session.query(DagRun) > 
> .filter(DagRun.dag_id == dag_id, DagRun.execution_date == execution_date) > 
> .first() > > then the type of the object will be exactly the same (DagRun) . 
> > >> >> >> Do we need one API method per individual query we have in the 
> source? > > > No, as explained by the sentence: > > The method may be 
> extended, accepting more optional parameters to avoid having too many similar 
> implementations. > > >> >> >> Which components would use this new mode when 
> it's enabled? > > > You may read: > Airflow Database APi is a new independent 
> component of Airflow. It allows isolating some components (Worker, 
> DagProcessor and Triggerer) from direct access to DB. > >> >> But what you 
> haven't said the first thing about is what _other_ changes would be needed in 
> the code. To take a fairly simple example: >> >> dag_run = 
> db_client.get_dag_run(dag_id, execution_date) >> dag_run.queued_at = 
> timezone.now() >> # How do I save this? >> >> In short, you need to put a lot 
> more detail into this before we can even have an idea of the full scope of 
> the change this proposal would involve, and what code changes would be needed 
> for compnents to work with and without this setting enabled. > > > For this 
> particular example - it depends on the intention of the code author > - If 
> this should be in transaction - then I would actually introduce new method 
> like enqueue_dag_run(...) that would run these two steps on Airflow DB API 
> side > - if not then, maybe just the "update_dag_run" method accepting the 
> whole "dag_run" object and saving it to the DB. > > In general - we could 
> take naive approach, eg replace code: > dag_run = session.query(DagRun) > 
> .filter(DagRun.dag_id == dag_id, DagRun.execution_date == execution_date) > 
> .first() > with: > if self.db_isolation: > dag_run = session.query(DagRun) > 
> .filter(DagRun.dag_id == dag_id, DagRun.execution_date == execution_date) > 
> .first() > else: > dag_run = db_client.get_dag_run(self, dag_id, 
> execution_date) > > The problem is that Airflow DB API would need to have the 
> same implementation for the query - so duplicated code. That's why we propose 
> moving this code to the DBClient which is also used by the Airflow DB API(in 
> DB direct mode). > > I know there are many places where the code is much more 
> complicated than a single query, but they must be handled one-by-one, during 
> the implementation, otherwise this AIP would be way too big. > > I just 
> provided a general idea for the approach - but if you want me to put more 
> examples then I am happy to do that > > Best regards, > Mateusz Henc > >> >> 
> On Thu, Dec 2 2021 at 14:23:56 +0100, Mateusz Henc <mh...@google.com.INVALID> 
> wrote: >> >> Hi, >> I just added a new AIP for running some Airflow 
> components in DB-isolation mode, without direct access to the Airflow 
> Database, but they will use a new API for thi purpose. >> >> PTAL: >> 
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Database+API
>  >> >> Open question: >> I called it "Airflow Database API" - however I feel 
> it could be more than just an access layer for the database. So if you have a 
> better name, please let me know, I am happy to change it. >> >> Best regards, 
> >> Mateusz Henc

Re: [Discuss] AIP-44 Airflow Database API

Reply via email to