Re: [VOTE] AIP-44 - Airflow Internal API

Ash Berlin-Taylor Thu, 11 Aug 2022 01:44:42 -0700

+1 (binding) now with that change. Thank you very much.

I'm also okay with us to proceed with only gRPC for now -- with thisarchitectural change it's much easier to replace it in future if wewant/need to, and for others (such as service providers likeGoogle/Amazon/Astronomer) to experiment with custom API implementations.

I'm not able to point at any examples, as the ones I know of personallyaren't open source; they were all written for companiesprojects/products.

I'm am sorry that we missed it -- speaking personally I've had _a lot_going on at home and I've barely had time to "look around" at what elseis going on in Airflow over the last few months. I'm only now justhaving the head space to actually think in detail about other thingsthat what is right in front of me, hence why it came now.


-ash

On Thu, Aug 11 2022 at 09:27:39 +02:00:00, Jarek Potiuk<ja...@potiuk.com> wrote:

I made the updates in<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API>to reflect the above. As I suspected, there were really very fewchanges needed, to make it "GRPC/JSON OpenAPI" agnostic. Ir does notchange the "gist" and "purpose" of the AIP and we can easily turn itinto implementation detail after extended POC is complete.
On Thu, Aug 11, 2022 at 9:08 AM Jarek Potiuk <ja...@potiuk.com<mailto:ja...@potiuk.com>> wrote:
Hey Ash, Andrew,
TL;DR; I slept over it to try to understand what just happened, andI have a proposal.
* I am happy to extend my POC with the pure JSon/Open API approach -providing that Andrew/Ash point me to some good examples where theRPC style API has been implemented. If you could point me in theright direction, I am a maverick kind of person and just find myways there.* I propose we conclude the vote (with added reservation in it, thatfinal choice of the transport will be based on the extended POC) asplanned.* I am rather curious to try the things that both of you mentionedyou are concerned about (threading/multiprocessing/load balancing)and see how they compare in both approaches - before making finaldecision and doing first, founding PR
More comments (personal feeling):
I really treat your comments - (now that they finally came),seriously. And value your expert knowledge and experience. And Ipersonally expect the same in return. I always respond to anyAIPs/discussions when I am given enough time and opportunity tocomment, I do - In due time. This time it did not happen - not surewhy. I personally feel this is a little disrespectful to my work,but I do not hold any grudges. I am happy to put that completelyaside - now that I have your attention and can use your experience,the goal is that we do not make any personal complaints orescalations here, we are all here to make Airflow a better product.Community is the most important, respect for each other's work is animportant part of it. And I hope this can improve in the future AIPdiscussions.
But going back to the merit of the discussion:
Just to let everyone know I am not too "vested" in gRPC . I spend alittle time on implementing the POC with it and after doing the POCand reading testimonies of people I feel this is the right choice.But for this AIP. I made deliberate decisions in the way that thetransport can be swapped out if we find reasons we should.
I am reasonably happy with the way proposed by Ash (In fact I amgoing to update the AIP with it in a moment). For me the way how weactually implement the "if" is an implementation detail that willbe worked out on the first PR. The way proposed is good for me,though I would rather experiment a bit with decorators and see if wecan make it a bit nicer - but this is not a dealbreaker at all forme. One concern I have is that we will have another abstractionlayer (possibly needless) and that we will have to again repeat allthe methods signatures and parameters and keep them in sync (theywill now be repeated in 4 or 5 places). But again - this issomething that can be likely done in the first "founding" PR we aregoing to iterate on and work out the best balance betweenduplication and flexibility/maintainability. And we can alwaysupdate the AIP with this implementation detail later - very muchlike it happend with a number of AIPs before - including AIP-39,AIP-40, AIP-42 - all of them went through similar rounds of updatesand clarifications as implementation was progressing. It's hard towork out all the details "on paper" or even in "POC". Also I wouldREALLY love to tap in the experience of people like the both of you,but it seems that the only way to get some good and serious feedbackis to call a vote (or make a PR with the intention of merging it).
But I even can go further than that - I think independently fromvoting whether the whole AIP-44 is a good or bad idea. I think thereis no doubt we need it, and the "general scope" and approach seemsto already reach general consensus, so if we can just continue andcomplete the vote. I am happy to continue running the POC on usingOpenAPI spec and gRPC in parallel and see how they compare. I amalways eager to learn and try other things, and if you have validconcerns I am happy to address them by trying out. I wouldpersonally like to compare both from the development "friction"point of view, performance, as well as doing some tests trying toaddress and test the operational (process/thread/load balancing)concerns you both have and see how they can be solved in both andcompare. I think there is nothing like "show me the code" andperforming actual working POC.
And I am super happy to continue with the POC and extend it with apure JSON/OpenAPI based proposal after voting completes and make thefinal decision during the first founding PR. And we can even arrangeextra votes or lazy consensus before the first PR lands - afterseeing all the "ins/outs".The Founding PR is still quite a bit away - I do not want to makeany commits before we branch-off the 2.4 - and I don't even want totake too much of your time for that other than discussion andraising concerns and commenting on my findings. I am happy to do allthe ground-work here.
That's all I ask for. Just treating the work I do seriously.

So Ash, Andrew
Can you please point me to some examples where RPC-API like ours hasbeen implemented with Open API/JSON? I am curious to learn fromthose and turn them into POC.
And I propose - let's just continue the vote as planned. We alreadyhave 3 binding votes, and more +1s than -1s and the time has alreadypassed, but I am happy to let it run till the end of day tomorrow tosee if my proposal above will be good to conclude the vote with moreconsent from all the people involved.
J.
On Thu, Aug 11, 2022 at 7:49 AM Eugen Kosteev <eu...@kosteev.com<mailto:eu...@kosteev.com>> wrote:
+1 (non-binding)
On Thu, Aug 11, 2022 at 12:08 AM Ash Berlin-Taylor <a...@apache.org<mailto:a...@apache.org>> wrote:
So my concerns (in addition to the ones Andrew already pointedout) with gRPC: it uses threads! Threads + python always makes menervous. Doubly so when you then couple that with DB access viasqlalchemy - we're heading down paths less well travelled if we dothis.
From <https://github.com/grpc/grpc/blob/master/doc/fork_support.md>
> gRPC Python wraps gRPC core, which uses multithreading forperformance, and hence doesn't support fork(). Historically, wedidn't support forking in gRPC, but some users seemed to be doingfine until their code started to break on version 1.6. This waslikely caused by the addition of background c-threads and abackground Python thread
And there's <https://github.com/grpc/grpc/issues/16001> which mayor may not be a problem for us, I'm unclear:
> The main constraint that must be satisified is that your processmust only invoke the fork() syscall /before/ you create your gRPCserver(s).
But anyway, the only change I'd like to see is to make theinternal API more pluggable.
So instead of something like this:

    def process_file(
        self,
        file_path: str,
        callback_requests: List[CallbackRequest],
        pickle_dags: bool = False,
    ) -> Tuple[int, int]:
        if self.use_grpc:
            return self.process_file_grpc(
file_path=file_path,callback_requests=callback_requests, pickle_dags=pickle_dags
            )
        return self.process_file_db(
file_path=file_path,callback_requests=callback_requests, pickle_dags=pickle_dags
        )

I'd very much like us to have it be something like this:

    def process_file(
        self,
        file_path: str,
        callback_requests: List[CallbackRequest],
        pickle_dags: bool = False,
    ) -> Tuple[int, int]:
        if settings.DATABASE_ACCESS_ISOLATION:
            return InternalAPIClient.dagbag_process_file(
file_path=file_path,callback_requests=callback_requests, pickle_dags=pickle_dags
            )
        return self.process_file_db(
file_path=file_path,callback_requests=callback_requests, pickle_dags=pickle_dags
        )
i.e. all API access is "marshalled" (perhaps the wrong word) via asingle pluggable (and eventually configurable/replaceable) APIclient. Additionally (though less important) `self.use_grpc` ischanged to a property on the `airflow.settings` module (as it is aglobal setting, not an attribute/property of any single instance.)
(In my mind the API client would include thefrom_protobuff/to_protobuff methods that you added to TaskInstanceon your POC PR. Or the from_protbuff could live in theFileProcessorServiceServicer class in your example, but thatprobably doesn't scale when multiple services would take aTI/SimpleTI. I guess they could still also live on TaskInstance etal and just not be used - but that doesn't feel as clean to me isall)
Thoughts? It's not too big change to encapsulate things like thisI hope?
Sorry again that we didn't look at the recent work on this AIPsooner.
-ash
On Wed, Aug 10 2022 at 13:51:02 -06:00:00, Andrew Godwin<andrew.god...@astronomer.io.INVALID> wrote:
I also wish we'd had this discussion before!
I've converted several million lines of code into API-drivenservices over the past decade so I have my ways and I'm set inthem, I guess :)
Notice that I didn't say "use REST" - I don't think REST mapswell to RPC style codebases, and agree the whole method thing isconfusing. I just mean "ship JSON in a POST and receive JSON inthe response".
As you said before though, the line where you draw theabstraction is what matters more than the transport layer, and"fat endpoints" (doing transactions and multiple calls on the APIserver etc.) is, I agree, the way this has to go, so it's notlike this is something I think is totally wrong, just that I'verepeatedly tried gRPC on projects like this and been disappointedwith what it actually takes to ship changes and prototype thingsquickly. Nothing beats the ability to just throw curl at aproblem - or maybe that's just me.
Anyway, I'll leave you to it - I have my own ideas in this areaI'll be noodling on, but it's a bit of a different take and aimedmore at execution as a whole, so I'll come back and discuss themif they're successful.
Andrew
On Wed, Aug 10, 2022 at 1:36 PM Jarek Potiuk <ja...@potiuk.com<mailto:ja...@potiuk.com>> wrote:
1st of all - I wish we had this discussion before :)
> The autogenerated code also is available from OpenAPI specs,of course, and the request/response semantics in the same threadare precisely what make load balancing these calls a littleharder, and horizontal scaling to multiple threads comes withevery HTTP server, but I digress - clearly you've made a choicehere to write an RPC layer rather than a lightweight API layer,and go with the more RPC-style features, and I get that.
OpenAPI is REST not RPC and it does not map well to RPC stylecalls. I tried to explain in detail in the AIP (and in earlierdiscussions). And the choice is basically made for us because ofour expectation to have minimal impact on the existing code (andability to switch off the remote part). We really do not want tointroduce new API calls. We want to make sure that existing "DBtransactions" (i.e. coarse grained calls) are remotely executed.So we are not really talking about lightweight API almost bydefinition. Indeed, Open API also maps a definition described ina declarative way to python code. but it has this non-nice partthat OpenAPI/REST, it is supposed to be run on some resources.We have no resources to run it on - every single method we callis doing something. Even from the REST/OpenAPI semantics I'dhave a really hard time to decide whether it should be GET, POSTor PUT or DELETE. In most cases this will be a combination ofthose 4 on several different resources. All the "nice parts" ofOpen API (Swagger UI etc.) become next to useless if you try tomap such remote procedures we have, to REST calls.
> I still think it's choosing something that's more complex tomaintain over a simpler, more accessible option, but since Iwon't be the one writing it, that's not really for me to worryabout. I am very curious to see how this evolves as the worktowards multitenancy really kicks in and all the API schemasneed to change to add namespacing!
The gRPC (proto) is designed from ground-up with maintainabilityin mind. The ability to evolve the API, add new parameters etc.is built-in the protobuf definition. From the user perspectiveit's actually easier to use than OpenAPI when it comes to remotemethod calls, because you basically - call a method.
And also coming back to monitoring - literally every monitoringplatform supports gRPC to monitor. Grafana, NewRelic,CloudWatch, Datadog, you name it. In our case.
Also load-balancing:
Regardless of the choice we talk about HTTP request/responsehappening for 1 call. This is the boundary. Each of the calls wehave will be a separate transaction, separate call, notconnected to any other call. The server handling it will bestateless (state will be stored in a DB when each callcompletes). I deliberately put the "boundary" of each of theremotely callable methods, to be a complete DB transaction toachieve it.
So It really does not change whether we use gRPC or REST/JSON.REST/JSON vs. gRPC is just the content of the message, but thisis the very same HTTP call, with same authentication added ontop, same headers - just how the message is encoded isdifferent. The same tools for load balancing works in the sameway regardless if we use gRPC or REST/JSON. This is really ahigher layer than the one involved in load balancing.
On Wed, Aug 10, 2022 at 9:10 PM Andrew Godwin<andrew.god...@astronomer.io.invalid> wrote:
Well, binary representation serialization performance is worsefor protobuf than for JSON in my experience (in Pythonspecifically) - unless you mean size-on-the-wire, which I canagree with but tend to not worry about very much since it'srarely a bottleneck.
The autogenerated code also is available from OpenAPI specs, ofcourse, and the request/response semantics in the same threadare precisely what make load balancing these calls a littleharder, and horizontal scaling to multiple threads comes withevery HTTP server, but I digress - clearly you've made a choicehere to write an RPC layer rather than a lightweight API layer,and go with the more RPC-style features, and I get that.
I still think it's choosing something that's more complex tomaintain over a simpler, more accessible option, but since Iwon't be the one writing it, that's not really for me to worryabout. I am very curious to see how this evolves as the worktowards multitenancy really kicks in and all the API schemasneed to change to add namespacing!
Andrew
On Wed, Aug 10, 2022 at 12:52 PM Jarek Potiuk <ja...@potiuk.com<mailto:ja...@potiuk.com>> wrote:
Sure I can explain - the main reasons are:
- 1) binary representation performance - impact of this israther limited because our API calls are doing rather heavyprocessing compared to the data being transmitted. But Ibelieve it's not negligible.- 2) automated tools to automatically generate strongly typedPython code (that's the ProtoBuf part). The strongly typedPython code is what convinced me (see my POC). The tooling wegot for that is excellent. Far more superior than dealing withjson-encoded data even with schema.- 2) built-in "remote procedure" layer - where we haverequest/response semantics optimisations (for multiple callsover the same chanel) and exception handling done for us (Thisis basically what "Remote Procedure" interface provide us)- 3) built-in server that efficiently distributes the methodcalled from multiple client into amulti-threaded/multi-threaded execution (all individual callsare stateless so multi-processing can be added on topregardless from the "transport" chosen).
BTW. If you look at my POC code, there is nothing thatstrongly "binds" us to gRPC. The nice thing is that once it isimplemented, it can be swapped out very easily. The onlyProto/gRPC code that "leaks" to "generic" Airflow code ismapping of some (most complex) parameters to Proto . And thisis only for most complex cases - literally only few of ourtypes require custom serialisation - most of the mapping ishandled automatically by generated protobuf code. And we caneasily put the "complex" mapping in a separate package. Plusthere is an "if" statement for each of the ~ 30 or so methodsthat we will have to turn into remotely-callable. We can even(as I proposed it as an option) add a little python magic andadd a simple decorator to handle the "ifs". Then the decorator"if" can be swapped with some other "remote call"implementation.
The actual bulk of the implementation is to make sure that allthe places are covered (that's the testing harness).
On Wed, Aug 10, 2022 at 8:25 PM Andrew Godwin<andrew.god...@astronomer.io.invalid> wrote:
Hi Jarek,
Apologies - I was as not involved as I wanted to be in theAIP-44 process, and obviously my vote is non-binding anyway -but having done a lot of Python API development over the pastdecade or so I wanted to know why the decision was made to gowith gRPC over just plain HTTP+JSON (with a schema, ofcourse).
The AIP covers why XMLRPC and Thrift lost out to gRPC, whichI agree with - but does not go into the option of using astandard Python HTTP server with JSON schema enforcement,such as FastAPI. In my experience, the tools for loadbalancing, debugging, testing and monitoring JSON/HTTP aresuperior and more numerous than those for gRPC, and inaddition the asynchronous support for gRPC servers is lackingcompared to their plain HTTP counterparts, and the fact thatyou can interact and play with the APIs in prototyping stageswithout having to handle obtaining correct protobuf versionsfor the Airflow version you're using.
I wouldn't go so far as to suggest a veto, but I do want tosee the AIP address why gRPC would win over this option.Apologies again for the late realisation that gRPC got chosenand was being voted on - it's been a very busy summer.
Thanks,
Andrew
On Wed, Aug 10, 2022 at 12:12 PM Jarek Potiuk<ja...@potiuk.com <mailto:ja...@potiuk.com>> wrote:
Just let me express my rather strong dissatisfaction withthe way this "last minute" is raised.
It is very late to come up with such a statement - not thatit comes at all, but when it comes when everyone had achance to take a look and comment, including taking a lookat the POC and result of checks. This has never been raisedeven 4 months ago where the only choices were Thrift andgRPc).
I REALLY hope the arguments are very strong and backed byreal examples and data why it is a bad choice rather thanopinions.
J,.
On Wed, Aug 10, 2022 at 7:50 PM Ash Berlin-Taylor<a...@apache.org <mailto:a...@apache.org>> wrote:
Sorry to weigh in at the last minute, but I'm wary of gRPCover just JSON, so -1 to that specific choice. Everythingelse I'm happy with.
I (or Andrew G) will follow up with more details shortly.

-ash
On Wed, Aug 10 2022 at 19:38:59 +02:00:00, Jarek Potiuk<ja...@potiuk.com <mailto:ja...@potiuk.com>> wrote:
Oh yeah :)
On Wed, Aug 10, 2022 at 7:23 PM Ping Zhang<pin...@umich.edu <mailto:pin...@umich.edu>> wrote:
ah, good call.

I guess the email template can be updated:
Only votes from PMC members are binding, but members ofthe community are encouraged to check the AIP and votewith "(non-binding)".
+1 (binding)

Thanks,

Ping
On Wed, Aug 10, 2022 at 10:20 AM Jarek Potiuk<ja...@potiuk.com <mailto:ja...@potiuk.com>> wrote:
Thank you . And BTW. It's binding Ping :). For AIP'scommiter's votes are binding. See<https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#commit-policy>:D
On Wed, Aug 10, 2022 at 7:16 PM Ping Zhang<pin...@umich.edu <mailto:pin...@umich.edu>> wrote:
+1 (non-binding)

Thanks,

Ping
On Thu, Aug 4, 2022 at 1:42 AM Jarek Potiuk<ja...@potiuk.com <mailto:ja...@potiuk.com>> wrote:
Hey everyone,
I would like to cast a vote for "AIP-44 - AirflowInternal API".
The AIP-44 is here:<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API>
Discussion thread:<https://lists.apache.org/thread/nsmo339m618kjzsdkwq83z8omrt08zh3>
The voting will last for 5 days (until 9th of August2022 11:00 CEST), and until at least 3 binding voteshave been cast.
Please vote accordingly:

[ ] + 1 approve
[ ] + 0 no opinion
[ ] - 1 disapprove with the reason
Only votes from PMC members are binding, but membersof the community are encouraged to check the AIP andvote with "(non-binding)".
----

Just a summary of where we are:
It's been long in the making, but I think it might bea great step-forward to our long-term multi-tenancygoal. I believe the proposal I have is quite wellthought out and discussed a lot in the past:
* we have a working POC for implementation used forperformance testing:<https://github.com/apache/airflow/pull/25094>* it is based on on industry-standard open-source gRPC(which is already our dependency) which fits betterthe RPC "model" we need than our public REST API.* it has moderate performance impact and rather goodmaintainability features (very little impact onregular development effort)* it is fully backwards compatible - the new DBisolation will be an optional feature
* has a solid plan for full test coverage in our CI
* has a backing and plans for more extensive completetesting in "real" environment with Google Composerteam support* allows for further extensions as part of AIP-1 (I amplanning to re-establish sig-multitenancy effort forfollow up AIPs once this one is well in progress).
J.
--
Eugene

Re: [VOTE] AIP-44 - Airflow Internal API

Reply via email to