Re: [Discuss] AIP-44 Airflow Database API

Ash Berlin-Taylor Fri, 03 Dec 2021 08:56:28 -0800

This is a fundamental change to the architecture with significantpossible impacts on performance, and likely requires touching a largeportion of the code base.

Sorry, you're going to have to do expand on the details first and workout what would actually be involved and what the impacts will be: Rightnow I have serious reservations to this approach, so I can't agree onthe high level proposal without an /actual/ proposal (The currentdocument is, at best, an outline, not an actual proposal.)


Sorry to be a grinch right before the weekend.

Ash

On Thu, Dec 2 2021 at 22:47:34 +0100, Jarek Potiuk <ja...@potiuk.com>wrote:

Oh yeah - good point and we spoke about performancetesting/implications.
Performance is something we were discussing as the next step when we
get general "OK" in the direction - we just want to make sure that
there are no "huge" blockers in the way this is proposed and explain
any doubts first, so that the investment in performance part makes
sense. We do not want to spend a lot of time on getting the tests done
and detailed inventory of methods/ API calls to get - only to find out
that this is generally "bad direction".

Just to clarify again - we also considered (alternative option) to
automatically map all the DB methods in the remote calls. But we
dropped that idea - precisely for the reason of performance, and
transaction integrity. So we are NOT mapping DB calls into API calls.
those will be "logical operations" on the database. Generally
speaking, most of the API calls for the "airflow system-level but
executed in worker" calls will be rather "coarse" than fine-grained.
For example, the aforementioned "mini scheduler"  - where we want to
make a single API call and run the whole of it on the DBAPI side. So
there - performance impact is very limited IMHO. And If we see any
other "logic" like that in other parts of the code (zombie detection
as an example). We plan to make a detailed inventory of those once we
get general "Looks good" for the direction. For now we did some
"rough" checking and it seems a plausible approach and quite doable.

One more note - the "fine-grained" ( "variable" update/retrieval,
"connection update retrieval") - via REST API will still be used by
the user's code though (Parsing DAGs, operators, workers and
callbacks). We also plan to make sure that none of the "Community"
operators are using "non-blessed" DB calls (we can do it in our CI).
So at the end of the exercise, all operators, hooks, etc. from the
community will be guaranteed to only use the DB APIs that are
available in the "DB API" module. But there I do not expect pretty
much any performance penalty as those are very fast and rare
operations (and good thing there is that we can cache results of those
in workers/DAG processing).

J.

On Thu, Dec 2, 2021 at 7:16 PM Andrew Godwin
<andrew.god...@astronomer.io.invalid<mailto:andrew.god...@astronomer.io.invalid>> wrote:
Ah, my bad, I missed that. I'd still like to see discussion of theperformance impacts, though.
On Thu, Dec 2, 2021 at 11:14 AM Ash Berlin-Taylor <a...@apache.org<mailto:a...@apache.org>> wrote:
The scheduler was excluded from the components that would use thedbapi - the mini scheduler is the odd one out here is it(currently) runs on the work but shares much of the code from thescheduling path.
 -a
On 2 December 2021 17:56:40 GMT, Andrew Godwin<andrew.god...@astronomer.io.INVALID<mailto:andrew.god...@astronomer.io.INVALID>> wrote:
I would also like to see some discussion in this AIP about howthe data is going to be serialised to and from the databaseinstances (obviously Connexion is involved, but I presume moretransformation code is needed than that) and the potentialslowdown this would cause.
In my experience, a somewhat direct ORM mapping like this isgoing to result in considerably slower times for any complexoperation that's touching a few hundred rows.
Is there a reason this is being proposed for the scheduler code,too? In my mind, the best approach to multitenancy would be toremove all user-supplied code from the scheduler and leave it withdirect DB access, rather than trying to indirect all scheduleraccess through another API layer.
 Andrew
On Thu, Dec 2, 2021 at 10:29 AM Jarek Potiuk <ja...@potiuk.com<mailto:ja...@potiuk.com>> wrote:
 Yeah - I thik Ash you are completely right we need some more
 "detailed" clarification.
I believe, I know what you are - rightfully - afraid of (reimpact onthe code), and maybe we have not done a good job on explainingit with
 some of our assumptions we had when we worked on it with Mateusz.
Simply it was not clear that our aim is to absolutely minimisethe
 impact on the "internal DB transactions" done in schedulers and
 workers. The idea is that change will at most result in moving an
execution of the transactions to another process but notchanging whatthe DB transactions do internally. Actually this was one of thereasonfor the "alternative" approach (you can see it in the document)wediscussed about - hijack "sqlalchemy session" - this is far toolowlevel and the aim of the "DB-API" is NOT to replace direct DBcalls
 (Hence we need to figure out a better name). The API is there to
provide "scheduler logic" API and "REST access to Airflowprimitives
 like dags/tasks/variables/connections" etc..

 As an example (which we briefly talked about in slack) the
 "_run_mini_scheduler_on_child_tasks" case
(<https://github.com/apache/airflow/blob/main/airflow/jobs/local_task_job.py#L225-L274>)is an example (that we would put in the doc). As we thought ofit -this is a "single DB-API operation". Those are not Pure RESTcalls ofcourse, they are more RPC-like calls. That is why even initiallyIthought of separating the API completely. But since there are alot ofcommon "primitive" calls that we can re-use, I think having aseparateDB-API component which will re-use connexion implementation,replacingauthentication with the custom worker <> DB-API authenticationis theway to go. And yes if we agree on the general idea, we need tochoosethe best way on how to best "connect" the REST API we have withtheRPC-kind of API we need for some cases in workers. But we wantedtomake sure we are on the same page with the direction. And yes itmeans
 that DB-API will potentially have to handle quite a number of DB
operations (and that it has to be replicable and scalable aswell) -but DB-API will be "stateless" similarly as the webserver is, soitwill be scalable by definition. And yest performance tests willbe
 part of POC - likely even before we finally ask for votes there.

 So in short:
* no modification or impact on current scheduler behaviour whenDB
 Isolation is disabled
* only higher level methods will be moved out to DB-API and wewill
 reuse existing "REST" APIS where it makes sense
* we aim to have "0" changes to the logic of processing - bothin Dag
 Processing logic and DB API. We think with this architecture we
 proposed it's perfectly doable
I hope this clarifies a bit, and once we agree on generaldirection,we will definitely work on adding more details and clarification(weactually already have a lot of that but we just wanted to startwithexplaining the idea and going into more details later when weare sure
 there are no "high-level" blockers from the community.

 J,
On Thu, Dec 2, 2021 at 4:46 PM Ash Berlin-Taylor <a...@apache.org<mailto:a...@apache.org>> wrote:
 >
> I just provided a general idea for the approach - but if youwant me to put more examples then I am happy to do that
 >
 >
 > Yes please.
 >
> It is too general for me and I can't work out what effect itwould actually have on the code base, especially how it wouldlook with the config option to enable/disable direct db access.
 >
 > -ash
 >
> On Thu, Dec 2 2021 at 16:36:57 +0100, Mateusz Henc<mh...@google.com.INVALID <mailto:mh...@google.com.INVALID>>wrote:
 >
 > Hi,
> I am sorry if it is not clear enough, let me try to explain ithere, so maybe it gives more light on the idea.
 > See my comments below
 >
> On Thu, Dec 2, 2021 at 3:39 PM Ash Berlin-Taylor<a...@apache.org <mailto:a...@apache.org>> wrote:
 >>
>> I'm sorry to say it, but this proposal right just doesn'tcontain enough detail to say what the actual changes to the codewould be, and what the impact would be
 >>
 >> To take the one example you have so far:
 >>
 >>
 >>   def get_dag_run(self, dag_id, execution_date):
 >>     return self.db_client.get_dag_run(dag_id,execution_date)
 >>
 >> So form this snippet I'm guessing it would be used like this:
 >>
 >>     dag_run = db_client.get_dag_run(dag_id, execution_date)
 >>
 >> What type of object is returned?
 >
 >
 > As it replaces:
 > dag_run = session.query(DagRun)
> .filter(DagRun.dag_id == dag_id, DagRun.execution_date ==execution_date)
 >   .first()
 >
 > then the type of the object will be exactly the same (DagRun) .
 >
 >>
 >>
>> Do we need one API method per individual query we have in thesource?
 >
 >
 > No, as explained by the sentence:
 >
> The method may be extended, accepting more optional parametersto avoid having too many similar implementations.
 >
 >
 >>
 >>
 >> Which components would use this new mode when it's enabled?
 >
 >
 > You may read:
> Airflow Database APi is a new independent component ofAirflow. It allows isolating some components (Worker,DagProcessor and Triggerer) from direct access to DB.
 >
 >>
>> But what you haven't said the first thing about is what_other_ changes would be needed in the code. To take a fairlysimple example:
 >>
 >>     dag_run = db_client.get_dag_run(dag_id, execution_date)
 >>     dag_run.queued_at = timezone.now()
 >>     # How do I save this?
 >>
>> In short, you need to put a lot more detail into this beforewe can even have an idea of the full scope of the change thisproposal would involve, and what code changes would be needed forcompnents to work with and without this setting enabled.
 >
 >
> For this particular example - it depends on the intention ofthe code author> - If this should be in transaction - then I would actuallyintroduce new method like enqueue_dag_run(...) that would runthese two steps on Airflow DB API side> - if not then, maybe just the "update_dag_run" methodaccepting the whole "dag_run" object and saving it to the DB.
 >
 > In general - we could take naive approach, eg replace code:
 > dag_run = session.query(DagRun)
> .filter(DagRun.dag_id == dag_id, DagRun.execution_date ==execution_date)
 >   .first()
 > with:
 > if self.db_isolation:
 >   dag_run = session.query(DagRun)
> .filter(DagRun.dag_id == dag_id, DagRun.execution_date ==execution_date)
 >     .first()
 > else:
 >   dag_run = db_client.get_dag_run(self, dag_id, execution_date)
 >
> The problem is that Airflow DB API would need to have the sameimplementation for the query - so duplicated code. That's why wepropose moving this code to the DBClient which is also used bythe Airflow DB API(in DB direct mode).
 >
> I know there are many places where the code is much morecomplicated than a single query, but they must be handledone-by-one, during the implementation, otherwise this AIP wouldbe way too big.
 >
> I just provided a general idea for the approach - but if youwant me to put more examples then I am happy to do that
 >
 > Best regards,
 > Mateusz Henc
 >
 >>
>> On Thu, Dec 2 2021 at 14:23:56 +0100, Mateusz Henc<mh...@google.com.INVALID <mailto:mh...@google.com.INVALID>>wrote:
 >>
 >> Hi,
>> I just added a new AIP for running some Airflow components inDB-isolation mode, without direct access to the Airflow Database,but they will use a new API for thi purpose.
 >>
 >> PTAL:
>><https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Database+API>
 >>
 >> Open question:
>> I called it "Airflow Database API" - however I feel it couldbe more than just an access layer for the database. So if youhave a better name, please let me know, I am happy to change it.
 >>
 >> Best regards,
 >> Mateusz Henc

Re: [Discuss] AIP-44 Airflow Database API

Reply via email to