Re: [Discuss] AIP-44 Airflow Database API

Jarek Potiuk Tue, 01 Feb 2022 06:12:21 -0800

Since we have AIP-43 already approved, I think I would love to have more
questions and discussions about AIP-44 - The "Airflow Internal API"
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API
(as the new name is).


For those who would like to get more context - recording of the meeting
where both AIP-43 and AIP-44 scope and proposals were discussed can be
found here:
https://drive.google.com/file/d/1SMFzazuY1kg4B4r11wNt8EQ_PmTDRKq6/view

In the AIP I just made a small clarification regarding some "future"
changes  - specifically the token security might be nicely handled together
with AIP-46 "Add support for docker runtime isolation" that is proposed by
Ping.

My goal is to gather comments till the end of the week, and if there will
be no big concerns, I would love to start voting next week.

J.


On Mon, Jan 3, 2022 at 2:48 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> Also AIP-44 - which is the DB isolation mode is much more detailed and
> ready for deeper discussion if needed.
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Database+API
>
> On Tue, Dec 14, 2021 at 12:07 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> > And just to add to that.
> >
> > Thanks again for the initial comments and pushing us to provide more
> > details. That allowed us to discuss and focus on many of the aspects
> > that were raised and we have many more answers now:
> >
> > * first of all - we focused on making sure impact on the existing code
> > and "behavior" of Airflow is minimal. In fact there should be
> > virtually no change vs. current behavior when db isolation is
> > disabled.
> > * secondly - we've done a full inventory of how the API needed should
> > look like and (not unsurprisingly) it turned out that the current
> > REST-style API is good for part of it but most of the 'logic" of
> > airflow can be done efficiently when we go to RPC-style API. However
> > we propose that the authorization/exposure of the API is the same as
> > we use currently in the REST API, this will allow us to reuse a big
> > part of the infrastructure we already have
> > * thirdly - we took deep into our hearts the comments about having to
> > maintain pretty much the same logic in a few different places. That's
> > an obvious maintenance problem. The proposal we came up with addresses
> > it - we are going to keep the logic of Airflow internals in one place
> > only and we will simply smartly route on where the logic will be
> > executed
> > * regarding the performance impact - we described the deployment
> > options that our proposal makes available - we do not want to favor
> > one deployment option over another, but we made sure the architecture
> > is done in the way, that you can choose which deployment is good for
> > you: "no isolation", "partial isolation", "full isolation" - each with
> > different performance/resource characteristics - all of them fully
> > horizontally scalable and nicely manageable.
> >
> > We look forward to comments, also for the API 43 -
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-43+DAG+Processor+separation
> > - AIP-43 is a prerequiste to AIP-44 and they both work together.
> >
> > We also will think and discuss more follow-up AIPs once we get those
> > approved (hopefully ;) ). The multi-tenancy is a 'long haul" and while
> > those two AIPs are foundational building blocks, there are at least
> > few more follow-up AIPs to be able to say "we're done with
> > Multi-tenancy" :).
> >
> > J.
> >
> >
> >
> > On Tue, Dec 14, 2021 at 9:58 AM Mateusz Henc <mh...@google.com.invalid>
> wrote:
> > >
> > > Hi,
> > >
> > > As promised we (credits to Jarek) updated the AIPs, added more
> details, did inventory and changed the way API endpoints are generated.
> > >
> > > We also renamed it to Airflow Internal API - so the url has changed:
> > >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API
> > >
> > > Please take a look, any comments are highly appreciated.
> > >
> > > Best regards,
> > > Mateusz Henc
> > >
> > >
> > > On Mon, Dec 6, 2021 at 3:09 PM Mateusz Henc <mh...@google.com> wrote:
> > >>
> > >> Hi,
> > >> Thank you Ash for your feedback (in both AIPs)
> > >>
> > >> We are working to address your concerns. We will update the AIPs in a
> few days.
> > >> I will let you know when it's done.
> > >>
> > >> Best regards,
> > >> Mateusz Henc
> > >>
> > >>
> > >> On Fri, Dec 3, 2021 at 7:27 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> > >>>
> > >>> Cool. Thanks for the guidance.
> > >>>
> > >>> On Fri, Dec 3, 2021 at 6:37 PM Ash Berlin-Taylor <a...@apache.org>
> wrote:
> > >>> >
> > >>> > - make an inventory: It doesn't need to be exhaustive, but a
> representative sample.
> > >>> > - More clearly define _what_ the API calls return -- object type,
> methods on them etc.
> > >>> >
> > >>> > From the AIP you have this example:
> > >>> >
> > >>> >   def get_dag_run(self, dag_id, execution_date):
> > >>> >     return self.db_client.get_dag_run(dag_id,execution_date)
> > >>> >
> > >>> > What does that _actually_ return? What capabilities does it have?
> > >>> >
> > >>> > (I have other thoughts but those are less fundamental and can be
> discussed later)
> > >>> >
> > >>> > -ash
> > >>> >
> > >>> > On Fri, Dec 3 2021 at 18:20:21 +0100, Jarek Potiuk <
> ja...@potiuk.com> wrote:
> > >>> >
> > >>> > Surely - if you think that we need to do some more work to get
> confidence, that's fine. I am sure we can improve it to the level that we
> will not have to do full performance tests, and you are confident in the
> direction. Just to clarify your concerns and make sure we are on the same
> page - as I understand it should: * make an inventory of "actual changes"
> the proposal will involve in the database-low-level code of Airflow * based
> on that either assessment that those changes are unlikely (or likely make a
> performance impact) * if we asses that it is likely to have an impact, some
> at least rudimentary performance tests to prove that this is manageable I
> think that might be a good exercise to do. Does it sound about right? Or do
> you have any concerns about certain architectural decisions taken? No
> problem with Friday, but if we get answers today. I think it will give us
> time to think about it over the weekend and address it next week. J. On
> Fri, Dec 3, 2021 at 5:56 PM Ash Berlin-Taylor <a...@apache.org> wrote:
> > >>> >
> > >>> > This is a fundamental change to the architecture with significant
> possible impacts on performance, and likely requires touching a large
> portion of the code base. Sorry, you're going to have to do expand on the
> details first and work out what would actually be involved and what the
> impacts will be: Right now I have serious reservations to this approach, so
> I can't agree on the high level proposal without an actual proposal (The
> current document is, at best, an outline, not an actual proposal.) Sorry to
> be a grinch right before the weekend. Ash On Thu, Dec 2 2021 at 22:47:34
> +0100, Jarek Potiuk <ja...@potiuk.com> wrote: Oh yeah - good point and we
> spoke about performance testing/implications. Performance is something we
> were discussing as the next step when we get general "OK" in the direction
> - we just want to make sure that there are no "huge" blockers in the way
> this is proposed and explain any doubts first, so that the investment in
> performance part makes sense. We do not want to spend a lot of time on
> getting the tests done and detailed inventory of methods/ API calls to get
> - only to find out that this is generally "bad direction". Just to clarify
> again - we also considered (alternative option) to automatically map all
> the DB methods in the remote calls. But we dropped that idea - precisely
> for the reason of performance, and transaction integrity. So we are NOT
> mapping DB calls into API calls. those will be "logical operations" on the
> database. Generally speaking, most of the API calls for the "airflow
> system-level but executed in worker" calls will be rather "coarse" than
> fine-grained. For example, the aforementioned "mini scheduler" - where we
> want to make a single API call and run the whole of it on the DBAPI side.
> So there - performance impact is very limited IMHO. And If we see any other
> "logic" like that in other parts of the code (zombie detection as an
> example). We plan to make a detailed inventory of those once we get general
> "Looks good" for the direction. For now we did some "rough" checking and it
> seems a plausible approach and quite doable. One more note - the
> "fine-grained" ( "variable" update/retrieval, "connection update
> retrieval") - via REST API will still be used by the user's code though
> (Parsing DAGs, operators, workers and callbacks). We also plan to make sure
> that none of the "Community" operators are using "non-blessed" DB calls (we
> can do it in our CI). So at the end of the exercise, all operators, hooks,
> etc. from the community will be guaranteed to only use the DB APIs that are
> available in the "DB API" module. But there I do not expect pretty much any
> performance penalty as those are very fast and rare operations (and good
> thing there is that we can cache results of those in workers/DAG
> processing). J. On Thu, Dec 2, 2021 at 7:16 PM Andrew Godwin <
> andrew.god...@astronomer.io.invalid> wrote: Ah, my bad, I missed that.
> I'd still like to see discussion of the performance impacts, though. On
> Thu, Dec 2, 2021 at 11:14 AM Ash Berlin-Taylor <a...@apache.org> wrote:
> The scheduler was excluded from the components that would use the dbapi -
> the mini scheduler is the odd one out here is it (currently) runs on the
> work but shares much of the code from the scheduling path. -a On 2 December
> 2021 17:56:40 GMT, Andrew Godwin <andrew.god...@astronomer.io.INVALID>
> wrote: I would also like to see some discussion in this AIP about how the
> data is going to be serialised to and from the database instances
> (obviously Connexion is involved, but I presume more transformation code is
> needed than that) and the potential slowdown this would cause. In my
> experience, a somewhat direct ORM mapping like this is going to result in
> considerably slower times for any complex operation that's touching a few
> hundred rows. Is there a reason this is being proposed for the scheduler
> code, too? In my mind, the best approach to multitenancy would be to remove
> all user-supplied code from the scheduler and leave it with direct DB
> access, rather than trying to indirect all scheduler access through another
> API layer. Andrew On Thu, Dec 2, 2021 at 10:29 AM Jarek Potiuk <
> ja...@potiuk.com> wrote: Yeah - I thik Ash you are completely right we
> need some more "detailed" clarification. I believe, I know what you are -
> rightfully - afraid of (re impact on the code), and maybe we have not done
> a good job on explaining it with some of our assumptions we had when we
> worked on it with Mateusz. Simply it was not clear that our aim is to
> absolutely minimise the impact on the "internal DB transactions" done in
> schedulers and workers. The idea is that change will at most result in
> moving an execution of the transactions to another process but not changing
> what the DB transactions do internally. Actually this was one of the reason
> for the "alternative" approach (you can see it in the document) we
> discussed about - hijack "sqlalchemy session" - this is far too low level
> and the aim of the "DB-API" is NOT to replace direct DB calls (Hence we
> need to figure out a better name). The API is there to provide "scheduler
> logic" API and "REST access to Airflow primitives like
> dags/tasks/variables/connections" etc.. As an example (which we briefly
> talked about in slack) the "_run_mini_scheduler_on_child_tasks" case (
> https://github.com/apache/airflow/blob/main/airflow/jobs/local_task_job.py#L225-L274)
> is an example (that we would put in the doc). As we thought of it - this is
> a "single DB-API operation". Those are not Pure REST calls of course, they
> are more RPC-like calls. That is why even initially I thought of separating
> the API completely. But since there are a lot of common "primitive" calls
> that we can re-use, I think having a separate DB-API component which will
> re-use connexion implementation, replacing authentication with the custom
> worker <> DB-API authentication is the way to go. And yes if we agree on
> the general idea, we need to choose the best way on how to best "connect"
> the REST API we have with the RPC-kind of API we need for some cases in
> workers. But we wanted to make sure we are on the same page with the
> direction. And yes it means that DB-API will potentially have to handle
> quite a number of DB operations (and that it has to be replicable and
> scalable as well) - but DB-API will be "stateless" similarly as the
> webserver is, so it will be scalable by definition. And yest performance
> tests will be part of POC - likely even before we finally ask for votes
> there. So in short: * no modification or impact on current scheduler
> behaviour when DB Isolation is disabled * only higher level methods will be
> moved out to DB-API and we will reuse existing "REST" APIS where it makes
> sense * we aim to have "0" changes to the logic of processing - both in Dag
> Processing logic and DB API. We think with this architecture we proposed
> it's perfectly doable I hope this clarifies a bit, and once we agree on
> general direction, we will definitely work on adding more details and
> clarification (we actually already have a lot of that but we just wanted to
> start with explaining the idea and going into more details later when we
> are sure there are no "high-level" blockers from the community. J, On Thu,
> Dec 2, 2021 at 4:46 PM Ash Berlin-Taylor <a...@apache.org> wrote: > > I
> just provided a general idea for the approach - but if you want me to put
> more examples then I am happy to do that > > > Yes please. > > It is too
> general for me and I can't work out what effect it would actually have on
> the code base, especially how it would look with the config option to
> enable/disable direct db access. > > -ash > > On Thu, Dec 2 2021 at
> 16:36:57 +0100, Mateusz Henc <mh...@google.com.INVALID> wrote: > > Hi, >
> I am sorry if it is not clear enough, let me try to explain it here, so
> maybe it gives more light on the idea. > See my comments below > > On Thu,
> Dec 2, 2021 at 3:39 PM Ash Berlin-Taylor <a...@apache.org> wrote: >> >>
> I'm sorry to say it, but this proposal right just doesn't contain enough
> detail to say what the actual changes to the code would be, and what the
> impact would be >> >> To take the one example you have so far: >> >> >> def
> get_dag_run(self, dag_id, execution_date): >> return
> self.db_client.get_dag_run(dag_id,execution_date) >> >> So form this
> snippet I'm guessing it would be used like this: >> >> dag_run =
> db_client.get_dag_run(dag_id, execution_date) >> >> What type of object is
> returned? > > > As it replaces: > dag_run = session.query(DagRun) >
> .filter(DagRun.dag_id == dag_id, DagRun.execution_date == execution_date) >
> .first() > > then the type of the object will be exactly the same (DagRun)
> . > >> >> >> Do we need one API method per individual query we have in the
> source? > > > No, as explained by the sentence: > > The method may be
> extended, accepting more optional parameters to avoid having too many
> similar implementations. > > >> >> >> Which components would use this new
> mode when it's enabled? > > > You may read: > Airflow Database APi is a new
> independent component of Airflow. It allows isolating some components
> (Worker, DagProcessor and Triggerer) from direct access to DB. > >> >> But
> what you haven't said the first thing about is what _other_ changes would
> be needed in the code. To take a fairly simple example: >> >> dag_run =
> db_client.get_dag_run(dag_id, execution_date) >> dag_run.queued_at =
> timezone.now() >> # How do I save this? >> >> In short, you need to put a
> lot more detail into this before we can even have an idea of the full scope
> of the change this proposal would involve, and what code changes would be
> needed for compnents to work with and without this setting enabled. > > >
> For this particular example - it depends on the intention of the code
> author > - If this should be in transaction - then I would actually
> introduce new method like enqueue_dag_run(...) that would run these two
> steps on Airflow DB API side > - if not then, maybe just the
> "update_dag_run" method accepting the whole "dag_run" object and saving it
> to the DB. > > In general - we could take naive approach, eg replace code:
> > dag_run = session.query(DagRun) > .filter(DagRun.dag_id == dag_id,
> DagRun.execution_date == execution_date) > .first() > with: > if
> self.db_isolation: > dag_run = session.query(DagRun) >
> .filter(DagRun.dag_id == dag_id, DagRun.execution_date == execution_date) >
> .first() > else: > dag_run = db_client.get_dag_run(self, dag_id,
> execution_date) > > The problem is that Airflow DB API would need to have
> the same implementation for the query - so duplicated code. That's why we
> propose moving this code to the DBClient which is also used by the Airflow
> DB API(in DB direct mode). > > I know there are many places where the code
> is much more complicated than a single query, but they must be handled
> one-by-one, during the implementation, otherwise this AIP would be way too
> big. > > I just provided a general idea for the approach - but if you want
> me to put more examples then I am happy to do that > > Best regards, >
> Mateusz Henc > >> >> On Thu, Dec 2 2021 at 14:23:56 +0100, Mateusz Henc
> <mh...@google.com.INVALID> wrote: >> >> Hi, >> I just added a new AIP for
> running some Airflow components in DB-isolation mode, without direct access
> to the Airflow Database, but they will use a new API for thi purpose. >> >>
> PTAL: >>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Database+API
> >> >> Open question: >> I called it "Airflow Database API" - however I feel
> it could be more than just an access layer for the database. So if you have
> a better name, please let me know, I am happy to change it. >> >> Best
> regards, >> Mateusz Henc
>

Re: [Discuss] AIP-44 Airflow Database API

Reply via email to