+1 (non-binding) On Thu, Aug 11, 2022 at 12:08 AM Ash Berlin-Taylor <a...@apache.org> wrote:
> So my concerns (in addition to the ones Andrew already pointed out) with > gRPC: it uses threads! Threads + python always makes me nervous. Doubly so > when you then couple that with DB access via sqlalchemy - we're heading > down paths less well travelled if we do this. > > From https://github.com/grpc/grpc/blob/master/doc/fork_support.md > > > gRPC Python wraps gRPC core, which uses multithreading for performance, > and hence doesn't support fork(). Historically, we didn't support forking > in gRPC, but some users seemed to be doing fine until their code started to > break on version 1.6. This was likely caused by the addition of background > c-threads and a background Python thread > > And there's https://github.com/grpc/grpc/issues/16001 which may or may > not be a problem for us, I'm unclear: > > > The main constraint that must be satisified is that your process must > only invoke the fork() syscall *before* you create your gRPC server(s). > > > But anyway, the only change I'd like to see is to make the internal API > more pluggable. > > So instead of something like this: > > def process_file( > self, > file_path: str, > callback_requests: List[CallbackRequest], > pickle_dags: bool = False, > ) -> Tuple[int, int]: > if self.use_grpc: > return self.process_file_grpc( > file_path=file_path, callback_requests=callback_requests, > pickle_dags=pickle_dags > ) > return self.process_file_db( > file_path=file_path, callback_requests=callback_requests, > pickle_dags=pickle_dags > ) > > I'd very much like us to have it be something like this: > > def process_file( > self, > file_path: str, > callback_requests: List[CallbackRequest], > pickle_dags: bool = False, > ) -> Tuple[int, int]: > if settings.DATABASE_ACCESS_ISOLATION: > return InternalAPIClient.dagbag_process_file( > file_path=file_path, callback_requests=callback_requests, > pickle_dags=pickle_dags > ) > return self.process_file_db( > file_path=file_path, callback_requests=callback_requests, > pickle_dags=pickle_dags > ) > > i.e. all API access is "marshalled" (perhaps the wrong word) via a single > pluggable (and eventually configurable/replaceable) API client. > Additionally (though less important) `self.use_grpc` is changed to a > property on the `airflow.settings` module (as it is a global setting, not > an attribute/property of any single instance.) > > (In my mind the API client would include the from_protobuff/to_protobuff > methods that you added to TaskInstance on your POC PR. Or the from_protbuff > could live in the FileProcessorServiceServicer class in your example, but > that probably doesn't scale when multiple services would take a > TI/SimpleTI. I guess they could still also live on TaskInstance et al and > just not be used - but that doesn't feel as clean to me is all) > > Thoughts? It's not too big change to encapsulate things like this I hope? > > Sorry again that we didn't look at the recent work on this AIP sooner. > > -ash > > > > On Wed, Aug 10 2022 at 13:51:02 -06:00:00, Andrew Godwin > <andrew.god...@astronomer.io.INVALID> wrote: > > I also wish we'd had this discussion before! > > I've converted several million lines of code into API-driven services over > the past decade so I have my ways and I'm set in them, I guess :) > > Notice that I didn't say "use REST" - I don't think REST maps well to RPC > style codebases, and agree the whole method thing is confusing. I just mean > "ship JSON in a POST and receive JSON in the response". > > As you said before though, the line where you draw the abstraction is what > matters more than the transport layer, and "fat endpoints" (doing > transactions and multiple calls on the API server etc.) is, I agree, the > way this has to go, so it's not like this is something I think is totally > wrong, just that I've repeatedly tried gRPC on projects like this and been > disappointed with what it actually takes to ship changes and > prototype things quickly. Nothing beats the ability to just throw curl at a > problem - or maybe that's just me. > > Anyway, I'll leave you to it - I have my own ideas in this area I'll be > noodling on, but it's a bit of a different take and aimed more at execution > as a whole, so I'll come back and discuss them if they're successful. > > Andrew > > On Wed, Aug 10, 2022 at 1:36 PM Jarek Potiuk <ja...@potiuk.com> wrote: > >> 1st of all - I wish we had this discussion before :) >> >> > The autogenerated code also is available from OpenAPI specs, of course, >> and the request/response semantics in the same thread are precisely what >> make load balancing these calls a little harder, and horizontal scaling to >> multiple threads comes with every HTTP server, but I digress - clearly >> you've made a choice here to write an RPC layer rather than a lightweight >> API layer, and go with the more RPC-style features, and I get that. >> >> OpenAPI is REST not RPC and it does not map well to RPC style calls. I >> tried to explain in detail in the AIP (and in earlier discussions). And the >> choice is basically made for us because of our expectation to have >> minimal impact on the existing code (and ability to switch off the remote >> part). We really do not want to introduce new API calls. We want to make >> sure that existing "DB transactions" (i.e. coarse grained calls) are >> remotely executed. So we are not really talking about lightweight API >> almost by definition. Indeed, Open API also maps a definition described in >> a declarative way to python code. but it has this non-nice part that >> OpenAPI/REST, it is supposed to be run on some resources. We have no >> resources to run it on - every single method we call is doing something. >> Even from the REST/OpenAPI semantics I'd have a really hard time to decide >> whether it should be GET, POST or PUT or DELETE. In most cases this will be >> a combination of those 4 on several different resources. All the "nice >> parts" of Open API (Swagger UI etc.) become next to useless if you try to >> map such remote procedures we have, to REST calls. >> >> >> > I still think it's choosing something that's more complex to maintain >> over a simpler, more accessible option, but since I won't be the one >> writing it, that's not really for me to worry about. I am very curious to >> see how this evolves as the work towards multitenancy really kicks in and >> all the API schemas need to change to add namespacing! >> >> The gRPC (proto) is designed from ground-up with maintainability in mind. >> The ability to evolve the API, add new parameters etc. is built-in the >> protobuf definition. From the user perspective it's actually easier to use >> than OpenAPI when it comes to remote method calls, because you basically - >> call a method. >> >> And also coming back to monitoring - literally every monitoring platform >> supports gRPC to monitor. Grafana, NewRelic, CloudWatch, Datadog, you name >> it. In our case. >> >> Also load-balancing: >> >> Regardless of the choice we talk about HTTP request/response happening >> for 1 call. This is the boundary. Each of the calls we have will be a >> separate transaction, separate call, not connected to any other call. The >> server handling it will be stateless (state will be stored in a DB when >> each call completes). I deliberately put the "boundary" of each of >> the remotely callable methods, to be a complete DB transaction to achieve >> it. >> >> So It really does not change whether we use gRPC or REST/JSON. REST/JSON >> vs. gRPC is just the content of the message, but this is the very same HTTP >> call, with same authentication added on top, same headers - just how the >> message is encoded is different. The same tools for load balancing works in >> the same way regardless if we use gRPC or REST/JSON. This is really a >> higher layer than the one involved in load balancing. >> >> >> >> >> >> On Wed, Aug 10, 2022 at 9:10 PM Andrew Godwin >> <andrew.god...@astronomer.io.invalid> wrote: >> >>> Well, binary representation serialization performance is worse for >>> protobuf than for JSON in my experience (in Python specifically) - unless >>> you mean size-on-the-wire, which I can agree with but tend to not worry >>> about very much since it's rarely a bottleneck. >>> >>> The autogenerated code also is available from OpenAPI specs, of course, >>> and the request/response semantics in the same thread are precisely what >>> make load balancing these calls a little harder, and horizontal scaling to >>> multiple threads comes with every HTTP server, but I digress - clearly >>> you've made a choice here to write an RPC layer rather than a lightweight >>> API layer, and go with the more RPC-style features, and I get that. >>> >>> I still think it's choosing something that's more complex to maintain >>> over a simpler, more accessible option, but since I won't be the one >>> writing it, that's not really for me to worry about. I am very curious to >>> see how this evolves as the work towards multitenancy really kicks in and >>> all the API schemas need to change to add namespacing! >>> >>> Andrew >>> >>> On Wed, Aug 10, 2022 at 12:52 PM Jarek Potiuk <ja...@potiuk.com> wrote: >>> >>>> Sure I can explain - the main reasons are: >>>> >>>> - 1) binary representation performance - impact of this is >>>> rather limited because our API calls are doing rather heavy processing >>>> compared to the data being transmitted. But I believe it's not negligible. >>>> - 2) automated tools to automatically generate strongly typed Python >>>> code (that's the ProtoBuf part). The strongly typed Python code is what >>>> convinced me (see my POC). The tooling we got for that is excellent. Far >>>> more superior than dealing with json-encoded data even with schema. >>>> - 2) built-in "remote procedure" layer - where we have request/response >>>> semantics optimisations (for multiple calls over the same chanel) and >>>> exception handling done for us (This is basically what "Remote Procedure" >>>> interface provide us) >>>> - 3) built-in server that efficiently distributes the method called >>>> from multiple client into a multi-threaded/multi-threaded execution (all >>>> individual calls are stateless so multi-processing can be added on top >>>> regardless from the "transport" chosen). >>>> >>>> BTW. If you look at my POC code, there is nothing that strongly "binds" >>>> us to gRPC. The nice thing is that once it is implemented, it can be >>>> swapped out very easily. The only Proto/gRPC code that "leaks" to "generic" >>>> Airflow code is mapping of some (most complex) parameters to Proto . And >>>> this is only for most complex cases - literally only few of our types >>>> require custom serialisation - most of the mapping is handled automatically >>>> by generated protobuf code. And we can easily put the "complex" mapping in >>>> a separate package. Plus there is an "if" statement for each of the ~ 30 or >>>> so methods that we will have to turn into remotely-callable. We can even >>>> (as I proposed it as an option) add a little python magic and add a simple >>>> decorator to handle the "ifs". Then the decorator "if" can be swapped with >>>> some other "remote call" implementation. >>>> >>>> The actual bulk of the implementation is to make sure that all the >>>> places are covered (that's the testing harness). >>>> >>>> >>>> >>>> On Wed, Aug 10, 2022 at 8:25 PM Andrew Godwin >>>> <andrew.god...@astronomer.io.invalid> wrote: >>>> >>>>> Hi Jarek, >>>>> >>>>> Apologies - I was as not involved as I wanted to be in the AIP-44 >>>>> process, and obviously my vote is non-binding anyway - but having done a >>>>> lot of Python API development over the past decade or so I wanted to know >>>>> why the decision was made to go with gRPC over just plain HTTP+JSON (with >>>>> a >>>>> schema, of course). >>>>> >>>>> The AIP covers why XMLRPC and Thrift lost out to gRPC, which I agree >>>>> with - but does not go into the option of using a standard Python HTTP >>>>> server with JSON schema enforcement, such as FastAPI. In my experience, >>>>> the >>>>> tools for load balancing, debugging, testing and monitoring JSON/HTTP are >>>>> superior and more numerous than those for gRPC, and in addition the >>>>> asynchronous support for gRPC servers is lacking compared to their plain >>>>> HTTP counterparts, and the fact that you can interact and play with the >>>>> APIs in prototyping stages without having to handle obtaining correct >>>>> protobuf versions for the Airflow version you're using. >>>>> >>>>> I wouldn't go so far as to suggest a veto, but I do want to see the >>>>> AIP address why gRPC would win over this option. Apologies again for the >>>>> late realisation that gRPC got chosen and was being voted on - it's been a >>>>> very busy summer. >>>>> >>>>> Thanks, >>>>> Andrew >>>>> >>>>> On Wed, Aug 10, 2022 at 12:12 PM Jarek Potiuk <ja...@potiuk.com> >>>>> wrote: >>>>> >>>>>> Just let me express my rather strong dissatisfaction with the way >>>>>> this "last minute" is raised. >>>>>> >>>>>> It is very late to come up with such a statement - not that it comes >>>>>> at all, but when it comes when everyone had a chance to take a look and >>>>>> comment, including taking a look at the POC and result of checks. This >>>>>> has >>>>>> never been raised even 4 months ago where the only choices were Thrift >>>>>> and >>>>>> gRPc). >>>>>> >>>>>> I REALLY hope the arguments are very strong and backed by real >>>>>> examples and data why it is a bad choice rather than opinions. >>>>>> >>>>>> J,. >>>>>> >>>>>> On Wed, Aug 10, 2022 at 7:50 PM Ash Berlin-Taylor <a...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> Sorry to weigh in at the last minute, but I'm wary of gRPC over just >>>>>>> JSON, so -1 to that specific choice. Everything else I'm happy with. >>>>>>> >>>>>>> I (or Andrew G) will follow up with more details shortly. >>>>>>> >>>>>>> -ash >>>>>>> >>>>>>> On Wed, Aug 10 2022 at 19:38:59 +02:00:00, Jarek Potiuk < >>>>>>> ja...@potiuk.com> wrote: >>>>>>> >>>>>>> Oh yeah :) >>>>>>> >>>>>>> On Wed, Aug 10, 2022 at 7:23 PM Ping Zhang <pin...@umich.edu> wrote: >>>>>>> >>>>>>>> ah, good call. >>>>>>>> >>>>>>>> I guess the email template can be updated: >>>>>>>> >>>>>>>> Only votes from PMC members are binding, but members of the >>>>>>>>> community are encouraged to check the AIP and vote with >>>>>>>>> "(non-binding)". >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> +1 (binding) >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Ping >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Aug 10, 2022 at 10:20 AM Jarek Potiuk <ja...@potiuk.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thank you . And BTW. It's binding Ping :). For AIP's commiter's >>>>>>>>> votes are binding. See >>>>>>>>> https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#commit-policy >>>>>>>>> :D >>>>>>>>> >>>>>>>>> On Wed, Aug 10, 2022 at 7:16 PM Ping Zhang <pin...@umich.edu> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> +1 (non-binding) >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Ping >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Aug 4, 2022 at 1:42 AM Jarek Potiuk <ja...@potiuk.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hey everyone, >>>>>>>>>>> >>>>>>>>>>> I would like to cast a vote for "AIP-44 - Airflow Internal API". >>>>>>>>>>> >>>>>>>>>>> The AIP-44 is here: >>>>>>>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API >>>>>>>>>>> >>>>>>>>>>> Discussion thread: >>>>>>>>>>> https://lists.apache.org/thread/nsmo339m618kjzsdkwq83z8omrt08zh3 >>>>>>>>>>> >>>>>>>>>>> The voting will last for 5 days (until 9th of August 2022 11:00 >>>>>>>>>>> CEST), and until at least 3 binding votes have been cast. >>>>>>>>>>> >>>>>>>>>>> Please vote accordingly: >>>>>>>>>>> >>>>>>>>>>> [ ] + 1 approve >>>>>>>>>>> [ ] + 0 no opinion >>>>>>>>>>> [ ] - 1 disapprove with the reason >>>>>>>>>>> >>>>>>>>>>> Only votes from PMC members are binding, but members of the >>>>>>>>>>> community are encouraged to check the AIP and vote with >>>>>>>>>>> "(non-binding)". >>>>>>>>>>> >>>>>>>>>>> ---- >>>>>>>>>>> >>>>>>>>>>> Just a summary of where we are: >>>>>>>>>>> >>>>>>>>>>> It's been long in the making, but I think it might be a great >>>>>>>>>>> step-forward to our long-term multi-tenancy goal. I believe the >>>>>>>>>>> proposal I >>>>>>>>>>> have is quite well thought out and discussed a lot in the past: >>>>>>>>>>> >>>>>>>>>>> * we have a working POC for implementation used for performance >>>>>>>>>>> testing: https://github.com/apache/airflow/pull/25094 >>>>>>>>>>> * it is based on on industry-standard open-source gRPC (which is >>>>>>>>>>> already our dependency) which fits better the RPC "model" we need >>>>>>>>>>> than our >>>>>>>>>>> public REST API. >>>>>>>>>>> * it has moderate performance impact and rather good >>>>>>>>>>> maintainability features (very little impact on regular development >>>>>>>>>>> effort) >>>>>>>>>>> * it is fully backwards compatible - the new DB isolation will >>>>>>>>>>> be an optional feature >>>>>>>>>>> * has a solid plan for full test coverage in our CI >>>>>>>>>>> * has a backing and plans for more extensive complete testing in >>>>>>>>>>> "real" environment with Google Composer team support >>>>>>>>>>> * allows for further extensions as part of AIP-1 (I am planning >>>>>>>>>>> to re-establish sig-multitenancy effort for follow up AIPs once >>>>>>>>>>> this one is >>>>>>>>>>> well in progress). >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> J. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- Eugene