Cool. Thanks for the guidance.
On Fri, Dec 3, 2021 at 6:37 PM Ash Berlin-Taylor <a...@apache.org> wrote: > > - make an inventory: It doesn't need to be exhaustive, but a representative > sample. > - More clearly define _what_ the API calls return -- object type, methods on > them etc. > > From the AIP you have this example: > > def get_dag_run(self, dag_id, execution_date): > return self.db_client.get_dag_run(dag_id,execution_date) > > What does that _actually_ return? What capabilities does it have? > > (I have other thoughts but those are less fundamental and can be discussed > later) > > -ash > > On Fri, Dec 3 2021 at 18:20:21 +0100, Jarek Potiuk <ja...@potiuk.com> wrote: > > Surely - if you think that we need to do some more work to get confidence, > that's fine. I am sure we can improve it to the level that we will not have > to do full performance tests, and you are confident in the direction. Just to > clarify your concerns and make sure we are on the same page - as I understand > it should: * make an inventory of "actual changes" the proposal will involve > in the database-low-level code of Airflow * based on that either assessment > that those changes are unlikely (or likely make a performance impact) * if we > asses that it is likely to have an impact, some at least rudimentary > performance tests to prove that this is manageable I think that might be a > good exercise to do. Does it sound about right? Or do you have any concerns > about certain architectural decisions taken? No problem with Friday, but if > we get answers today. I think it will give us time to think about it over the > weekend and address it next week. J. On Fri, Dec 3, 2021 at 5:56 PM Ash > Berlin-Taylor <a...@apache.org> wrote: > > This is a fundamental change to the architecture with significant possible > impacts on performance, and likely requires touching a large portion of the > code base. Sorry, you're going to have to do expand on the details first and > work out what would actually be involved and what the impacts will be: Right > now I have serious reservations to this approach, so I can't agree on the > high level proposal without an actual proposal (The current document is, at > best, an outline, not an actual proposal.) Sorry to be a grinch right before > the weekend. Ash On Thu, Dec 2 2021 at 22:47:34 +0100, Jarek Potiuk > <ja...@potiuk.com> wrote: Oh yeah - good point and we spoke about performance > testing/implications. Performance is something we were discussing as the next > step when we get general "OK" in the direction - we just want to make sure > that there are no "huge" blockers in the way this is proposed and explain any > doubts first, so that the investment in performance part makes sense. We do > not want to spend a lot of time on getting the tests done and detailed > inventory of methods/ API calls to get - only to find out that this is > generally "bad direction". Just to clarify again - we also considered > (alternative option) to automatically map all the DB methods in the remote > calls. But we dropped that idea - precisely for the reason of performance, > and transaction integrity. So we are NOT mapping DB calls into API calls. > those will be "logical operations" on the database. Generally speaking, most > of the API calls for the "airflow system-level but executed in worker" calls > will be rather "coarse" than fine-grained. For example, the aforementioned > "mini scheduler" - where we want to make a single API call and run the whole > of it on the DBAPI side. So there - performance impact is very limited IMHO. > And If we see any other "logic" like that in other parts of the code (zombie > detection as an example). We plan to make a detailed inventory of those once > we get general "Looks good" for the direction. For now we did some "rough" > checking and it seems a plausible approach and quite doable. One more note - > the "fine-grained" ( "variable" update/retrieval, "connection update > retrieval") - via REST API will still be used by the user's code though > (Parsing DAGs, operators, workers and callbacks). We also plan to make sure > that none of the "Community" operators are using "non-blessed" DB calls (we > can do it in our CI). So at the end of the exercise, all operators, hooks, > etc. from the community will be guaranteed to only use the DB APIs that are > available in the "DB API" module. But there I do not expect pretty much any > performance penalty as those are very fast and rare operations (and good > thing there is that we can cache results of those in workers/DAG processing). > J. On Thu, Dec 2, 2021 at 7:16 PM Andrew Godwin > <andrew.god...@astronomer.io.invalid> wrote: Ah, my bad, I missed that. I'd > still like to see discussion of the performance impacts, though. On Thu, Dec > 2, 2021 at 11:14 AM Ash Berlin-Taylor <a...@apache.org> wrote: The scheduler > was excluded from the components that would use the dbapi - the mini > scheduler is the odd one out here is it (currently) runs on the work but > shares much of the code from the scheduling path. -a On 2 December 2021 > 17:56:40 GMT, Andrew Godwin <andrew.god...@astronomer.io.INVALID> wrote: I > would also like to see some discussion in this AIP about how the data is > going to be serialised to and from the database instances (obviously > Connexion is involved, but I presume more transformation code is needed than > that) and the potential slowdown this would cause. In my experience, a > somewhat direct ORM mapping like this is going to result in considerably > slower times for any complex operation that's touching a few hundred rows. Is > there a reason this is being proposed for the scheduler code, too? In my > mind, the best approach to multitenancy would be to remove all user-supplied > code from the scheduler and leave it with direct DB access, rather than > trying to indirect all scheduler access through another API layer. Andrew On > Thu, Dec 2, 2021 at 10:29 AM Jarek Potiuk <ja...@potiuk.com> wrote: Yeah - I > thik Ash you are completely right we need some more "detailed" clarification. > I believe, I know what you are - rightfully - afraid of (re impact on the > code), and maybe we have not done a good job on explaining it with some of > our assumptions we had when we worked on it with Mateusz. Simply it was not > clear that our aim is to absolutely minimise the impact on the "internal DB > transactions" done in schedulers and workers. The idea is that change will at > most result in moving an execution of the transactions to another process but > not changing what the DB transactions do internally. Actually this was one of > the reason for the "alternative" approach (you can see it in the document) we > discussed about - hijack "sqlalchemy session" - this is far too low level and > the aim of the "DB-API" is NOT to replace direct DB calls (Hence we need to > figure out a better name). The API is there to provide "scheduler logic" API > and "REST access to Airflow primitives like dags/tasks/variables/connections" > etc.. As an example (which we briefly talked about in slack) the > "_run_mini_scheduler_on_child_tasks" case > (https://github.com/apache/airflow/blob/main/airflow/jobs/local_task_job.py#L225-L274) > is an example (that we would put in the doc). As we thought of it - this is > a "single DB-API operation". Those are not Pure REST calls of course, they > are more RPC-like calls. That is why even initially I thought of separating > the API completely. But since there are a lot of common "primitive" calls > that we can re-use, I think having a separate DB-API component which will > re-use connexion implementation, replacing authentication with the custom > worker <> DB-API authentication is the way to go. And yes if we agree on the > general idea, we need to choose the best way on how to best "connect" the > REST API we have with the RPC-kind of API we need for some cases in workers. > But we wanted to make sure we are on the same page with the direction. And > yes it means that DB-API will potentially have to handle quite a number of DB > operations (and that it has to be replicable and scalable as well) - but > DB-API will be "stateless" similarly as the webserver is, so it will be > scalable by definition. And yest performance tests will be part of POC - > likely even before we finally ask for votes there. So in short: * no > modification or impact on current scheduler behaviour when DB Isolation is > disabled * only higher level methods will be moved out to DB-API and we will > reuse existing "REST" APIS where it makes sense * we aim to have "0" changes > to the logic of processing - both in Dag Processing logic and DB API. We > think with this architecture we proposed it's perfectly doable I hope this > clarifies a bit, and once we agree on general direction, we will definitely > work on adding more details and clarification (we actually already have a lot > of that but we just wanted to start with explaining the idea and going into > more details later when we are sure there are no "high-level" blockers from > the community. J, On Thu, Dec 2, 2021 at 4:46 PM Ash Berlin-Taylor > <a...@apache.org> wrote: > > I just provided a general idea for the approach > - but if you want me to put more examples then I am happy to do that > > > > Yes please. > > It is too general for me and I can't work out what effect it > would actually have on the code base, especially how it would look with the > config option to enable/disable direct db access. > > -ash > > On Thu, Dec 2 > 2021 at 16:36:57 +0100, Mateusz Henc <mh...@google.com.INVALID> wrote: > > > Hi, > I am sorry if it is not clear enough, let me try to explain it here, so > maybe it gives more light on the idea. > See my comments below > > On Thu, > Dec 2, 2021 at 3:39 PM Ash Berlin-Taylor <a...@apache.org> wrote: >> >> I'm > sorry to say it, but this proposal right just doesn't contain enough detail > to say what the actual changes to the code would be, and what the impact > would be >> >> To take the one example you have so far: >> >> >> def > get_dag_run(self, dag_id, execution_date): >> return > self.db_client.get_dag_run(dag_id,execution_date) >> >> So form this snippet > I'm guessing it would be used like this: >> >> dag_run = > db_client.get_dag_run(dag_id, execution_date) >> >> What type of object is > returned? > > > As it replaces: > dag_run = session.query(DagRun) > > .filter(DagRun.dag_id == dag_id, DagRun.execution_date == execution_date) > > .first() > > then the type of the object will be exactly the same (DagRun) . > > >> >> >> Do we need one API method per individual query we have in the > source? > > > No, as explained by the sentence: > > The method may be > extended, accepting more optional parameters to avoid having too many similar > implementations. > > >> >> >> Which components would use this new mode when > it's enabled? > > > You may read: > Airflow Database APi is a new independent > component of Airflow. It allows isolating some components (Worker, > DagProcessor and Triggerer) from direct access to DB. > >> >> But what you > haven't said the first thing about is what _other_ changes would be needed in > the code. To take a fairly simple example: >> >> dag_run = > db_client.get_dag_run(dag_id, execution_date) >> dag_run.queued_at = > timezone.now() >> # How do I save this? >> >> In short, you need to put a lot > more detail into this before we can even have an idea of the full scope of > the change this proposal would involve, and what code changes would be needed > for compnents to work with and without this setting enabled. > > > For this > particular example - it depends on the intention of the code author > - If > this should be in transaction - then I would actually introduce new method > like enqueue_dag_run(...) that would run these two steps on Airflow DB API > side > - if not then, maybe just the "update_dag_run" method accepting the > whole "dag_run" object and saving it to the DB. > > In general - we could > take naive approach, eg replace code: > dag_run = session.query(DagRun) > > .filter(DagRun.dag_id == dag_id, DagRun.execution_date == execution_date) > > .first() > with: > if self.db_isolation: > dag_run = session.query(DagRun) > > .filter(DagRun.dag_id == dag_id, DagRun.execution_date == execution_date) > > .first() > else: > dag_run = db_client.get_dag_run(self, dag_id, > execution_date) > > The problem is that Airflow DB API would need to have the > same implementation for the query - so duplicated code. That's why we propose > moving this code to the DBClient which is also used by the Airflow DB API(in > DB direct mode). > > I know there are many places where the code is much more > complicated than a single query, but they must be handled one-by-one, during > the implementation, otherwise this AIP would be way too big. > > I just > provided a general idea for the approach - but if you want me to put more > examples then I am happy to do that > > Best regards, > Mateusz Henc > >> >> > On Thu, Dec 2 2021 at 14:23:56 +0100, Mateusz Henc <mh...@google.com.INVALID> > wrote: >> >> Hi, >> I just added a new AIP for running some Airflow > components in DB-isolation mode, without direct access to the Airflow > Database, but they will use a new API for thi purpose. >> >> PTAL: >> > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Database+API > >> >> Open question: >> I called it "Airflow Database API" - however I feel > it could be more than just an access layer for the database. So if you have a > better name, please let me know, I am happy to change it. >> >> Best regards, > >> Mateusz Henc