Re: [VOTE] AIP-44 - Airflow Internal API

Eugen Kosteev Wed, 10 Aug 2022 22:49:54 -0700

+1 (non-binding)

On Thu, Aug 11, 2022 at 12:08 AM Ash Berlin-Taylor <[email protected]> wrote:


> So my concerns (in addition to the ones Andrew already pointed out) with
> gRPC: it uses threads! Threads + python always makes me nervous. Doubly so
> when you then couple that with DB access via sqlalchemy - we're heading
> down paths less well travelled if we do this.
>
> From https://github.com/grpc/grpc/blob/master/doc/fork_support.md
>
> > gRPC Python wraps gRPC core, which uses multithreading for performance,
> and hence doesn't support fork(). Historically, we didn't support forking
> in gRPC, but some users seemed to be doing fine until their code started to
> break on version 1.6. This was likely caused by the addition of background
> c-threads and a background Python thread
>
> And there's https://github.com/grpc/grpc/issues/16001 which may or may
> not be a problem for us, I'm unclear:
>
> > The main constraint that must be satisified is that your process must
> only invoke the fork() syscall *before* you create your gRPC server(s).
>
>
> But anyway, the only change I'd like to see is to make the internal API
> more pluggable.
>
> So instead of something like this:
>
>     def process_file(
>         self,
>         file_path: str,
>         callback_requests: List[CallbackRequest],
>         pickle_dags: bool = False,
>     ) -> Tuple[int, int]:
>         if self.use_grpc:
>             return self.process_file_grpc(
>                 file_path=file_path, callback_requests=callback_requests,
> pickle_dags=pickle_dags
>             )
>         return self.process_file_db(
>             file_path=file_path, callback_requests=callback_requests,
> pickle_dags=pickle_dags
>         )
>
> I'd very much like us to have it be something like this:
>
>     def process_file(
>         self,
>         file_path: str,
>         callback_requests: List[CallbackRequest],
>         pickle_dags: bool = False,
>     ) -> Tuple[int, int]:
>         if settings.DATABASE_ACCESS_ISOLATION:
>             return InternalAPIClient.dagbag_process_file(
>                 file_path=file_path, callback_requests=callback_requests,
> pickle_dags=pickle_dags
>             )
>         return self.process_file_db(
>             file_path=file_path, callback_requests=callback_requests,
> pickle_dags=pickle_dags
>         )
>
> i.e. all API access is "marshalled" (perhaps the wrong word) via a single
> pluggable (and eventually configurable/replaceable) API client.
> Additionally (though less important) `self.use_grpc` is changed to a
> property on the `airflow.settings` module (as it is a global setting, not
> an attribute/property of any single instance.)
>
> (In my mind the API client would include the from_protobuff/to_protobuff
> methods that you added to TaskInstance on your POC PR. Or the from_protbuff
> could live in the FileProcessorServiceServicer class in your example, but
> that probably doesn't scale when multiple services would take a
> TI/SimpleTI. I guess they could still also live on TaskInstance et al and
> just not be used - but that doesn't feel as clean to me is all)
>
> Thoughts? It's not too big change to encapsulate things like this I hope?
>
> Sorry again that we didn't look at the recent work on this AIP sooner.
>
> -ash
>
>
>
> On Wed, Aug 10 2022 at 13:51:02 -06:00:00, Andrew Godwin
> <[email protected]> wrote:
>
> I also wish we'd had this discussion before!
>
> I've converted several million lines of code into API-driven services over
> the past decade so I have my ways and I'm set in them, I guess :)
>
> Notice that I didn't say "use REST" - I don't think REST maps well to RPC
> style codebases, and agree the whole method thing is confusing. I just mean
> "ship JSON in a POST and receive JSON in the response".
>
> As you said before though, the line where you draw the abstraction is what
> matters more than the transport layer, and "fat endpoints" (doing
> transactions and multiple calls on the API server etc.) is, I agree, the
> way this has to go, so it's not like this is something I think is totally
> wrong, just that I've repeatedly tried gRPC on projects like this and been
> disappointed with what it actually takes to ship changes and
> prototype things quickly. Nothing beats the ability to just throw curl at a
> problem - or maybe that's just me.
>
> Anyway, I'll leave you to it - I have my own ideas in this area I'll be
> noodling on, but it's a bit of a different take and aimed more at execution
> as a whole, so I'll come back and discuss them if they're successful.
>
> Andrew
>
> On Wed, Aug 10, 2022 at 1:36 PM Jarek Potiuk <[email protected]> wrote:
>
>> 1st of all - I wish we had this discussion before :)
>>
>> > The autogenerated code also is available from OpenAPI specs, of course,
>> and the request/response semantics in the same thread are precisely what
>> make load balancing these calls a little harder, and horizontal scaling to
>> multiple threads comes with every HTTP server, but I digress - clearly
>> you've made a choice here to write an RPC layer rather than a lightweight
>> API layer, and go with the more RPC-style features, and I get that.
>>
>> OpenAPI is REST not RPC and it does not map well to RPC style calls. I
>> tried to explain in detail in the AIP (and in earlier discussions). And the
>> choice is basically made for us because of our expectation to have
>> minimal impact on the existing code (and ability to switch off the remote
>> part). We really do not want to introduce new API calls.  We want to make
>> sure that existing "DB transactions" (i.e. coarse grained calls) are
>> remotely executed. So we are not really talking about lightweight API
>> almost by definition. Indeed, Open API also maps a definition described in
>> a declarative way to python code.  but it has this non-nice part that
>> OpenAPI/REST, it is supposed to be run on some resources. We have no
>> resources to run it on - every single method we call is doing something.
>> Even from the REST/OpenAPI semantics I'd have a really hard time to decide
>> whether it should be GET, POST or PUT or DELETE. In most cases this will be
>> a combination of those 4 on several different resources. All the "nice
>> parts" of Open API (Swagger UI etc.) become next to useless if you try to
>> map such remote procedures we have, to REST calls.
>>
>>
>> > I still think it's choosing something that's more complex to maintain
>> over a simpler, more accessible option, but since I won't be the one
>> writing it, that's not really for me to worry about. I am very curious to
>> see how this evolves as the work towards multitenancy really kicks in and
>> all the API schemas need to change to add namespacing!
>>
>> The gRPC (proto) is designed from ground-up with maintainability in mind.
>> The ability to evolve the API, add new parameters etc. is built-in the
>> protobuf definition. From the user perspective it's actually easier to use
>> than OpenAPI when it comes to remote method calls, because you basically -
>> call a method.
>>
>> And also coming back to monitoring - literally every monitoring platform
>> supports gRPC to monitor. Grafana, NewRelic, CloudWatch, Datadog, you name
>> it. In our case.
>>
>> Also load-balancing:
>>
>> Regardless of the choice we talk about HTTP request/response happening
>> for 1 call. This is the boundary. Each of the calls we have will be a
>> separate transaction, separate call, not connected to any other call. The
>> server handling it will be stateless (state will be stored in a DB when
>> each call completes). I deliberately put the "boundary" of each of
>> the remotely callable methods, to be a complete DB transaction to achieve
>> it.
>>
>> So It really does not change whether we use gRPC or REST/JSON. REST/JSON
>> vs. gRPC is just the content of the message, but this is the very same HTTP
>> call, with same authentication added on top, same headers - just how the
>> message is encoded is different. The same tools for load balancing works in
>> the same way regardless if we use gRPC or REST/JSON. This is really a
>> higher layer than the one involved in load balancing.
>>
>>
>>
>>
>>
>> On Wed, Aug 10, 2022 at 9:10 PM Andrew Godwin
>> <[email protected]> wrote:
>>
>>> Well, binary representation serialization performance is worse for
>>> protobuf than for JSON in my experience (in Python specifically) - unless
>>> you mean size-on-the-wire, which I can agree with but tend to not worry
>>> about very much since it's rarely a bottleneck.
>>>
>>> The autogenerated code also is available from OpenAPI specs, of course,
>>> and the request/response semantics in the same thread are precisely what
>>> make load balancing these calls a little harder, and horizontal scaling to
>>> multiple threads comes with every HTTP server, but I digress - clearly
>>> you've made a choice here to write an RPC layer rather than a lightweight
>>> API layer, and go with the more RPC-style features, and I get that.
>>>
>>> I still think it's choosing something that's more complex to maintain
>>> over a simpler, more accessible option, but since I won't be the one
>>> writing it, that's not really for me to worry about. I am very curious to
>>> see how this evolves as the work towards multitenancy really kicks in and
>>> all the API schemas need to change to add namespacing!
>>>
>>> Andrew
>>>
>>> On Wed, Aug 10, 2022 at 12:52 PM Jarek Potiuk <[email protected]> wrote:
>>>
>>>> Sure I can explain - the main reasons are:
>>>>
>>>> - 1) binary representation performance - impact of this is
>>>> rather limited because our API calls are doing rather heavy processing
>>>> compared to the data being transmitted. But I believe it's not negligible.
>>>> - 2) automated tools to automatically generate strongly typed Python
>>>> code (that's the ProtoBuf part). The strongly typed Python code is what
>>>> convinced me (see my POC). The tooling we got for that is excellent. Far
>>>> more superior than dealing with json-encoded data even with schema.
>>>> - 2) built-in "remote procedure" layer - where we have request/response
>>>> semantics optimisations (for multiple calls over the same chanel) and
>>>> exception handling done for us (This is basically what "Remote Procedure"
>>>> interface provide us)
>>>> - 3) built-in server that efficiently distributes the method called
>>>> from multiple client into a multi-threaded/multi-threaded execution (all
>>>> individual calls are stateless so multi-processing can be added on top
>>>> regardless from the "transport" chosen).
>>>>
>>>> BTW. If you look at my POC code, there is nothing that strongly "binds"
>>>> us to gRPC. The nice thing is that once it is implemented, it can be
>>>> swapped out very easily. The only Proto/gRPC code that "leaks" to "generic"
>>>> Airflow code is mapping of some (most complex) parameters to Proto . And
>>>> this is only for most complex cases - literally only few of our types
>>>> require custom serialisation - most of the mapping is handled automatically
>>>> by generated protobuf code. And we can easily put the "complex" mapping in
>>>> a separate package. Plus there is an "if" statement for each of the ~ 30 or
>>>> so methods that we will have to turn into remotely-callable. We can even
>>>> (as I proposed it as an option) add a little python magic and add a simple
>>>> decorator to handle the "ifs". Then the decorator "if" can be swapped with
>>>> some other "remote call" implementation.
>>>>
>>>> The actual bulk of the implementation is to make sure that all the
>>>> places are covered (that's the testing harness).
>>>>
>>>>
>>>>
>>>> On Wed, Aug 10, 2022 at 8:25 PM Andrew Godwin
>>>> <[email protected]> wrote:
>>>>
>>>>> Hi Jarek,
>>>>>
>>>>> Apologies - I was as not involved as I wanted to be in the AIP-44
>>>>> process, and obviously my vote is non-binding anyway - but having done a
>>>>> lot of Python API development over the past decade or so I wanted to know
>>>>> why the decision was made to go with gRPC over just plain HTTP+JSON (with 
>>>>> a
>>>>> schema, of course).
>>>>>
>>>>> The AIP covers why XMLRPC and Thrift lost out to gRPC, which I agree
>>>>> with - but does not go into the option of using a standard Python HTTP
>>>>> server with JSON schema enforcement, such as FastAPI. In my experience, 
>>>>> the
>>>>> tools for load balancing, debugging, testing and monitoring JSON/HTTP are
>>>>> superior and more numerous than those for gRPC, and in addition the
>>>>> asynchronous support for gRPC servers is lacking compared to their plain
>>>>> HTTP counterparts, and the fact that you can interact and play with the
>>>>> APIs in prototyping stages without having to handle obtaining correct
>>>>> protobuf versions for the Airflow version you're using.
>>>>>
>>>>> I wouldn't go so far as to suggest a veto, but I do want to see the
>>>>> AIP address why gRPC would win over this option. Apologies again for the
>>>>> late realisation that gRPC got chosen and was being voted on - it's been a
>>>>> very busy summer.
>>>>>
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>> On Wed, Aug 10, 2022 at 12:12 PM Jarek Potiuk <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Just let me express my rather strong dissatisfaction with the way
>>>>>> this "last minute" is raised.
>>>>>>
>>>>>> It is very late to come up with such a statement - not that it comes
>>>>>> at all, but when it comes when everyone had a chance to take a look and
>>>>>> comment, including taking a look at the POC and result of checks. This 
>>>>>> has
>>>>>> never been raised even 4 months ago where the only choices were Thrift 
>>>>>> and
>>>>>> gRPc).
>>>>>>
>>>>>> I REALLY hope the arguments are very strong and backed by real
>>>>>> examples and data why it is a bad choice rather than opinions.
>>>>>>
>>>>>> J,.
>>>>>>
>>>>>> On Wed, Aug 10, 2022 at 7:50 PM Ash Berlin-Taylor <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Sorry to weigh in at the last minute, but I'm wary of gRPC over just
>>>>>>> JSON, so -1 to that specific choice. Everything else I'm happy with.
>>>>>>>
>>>>>>> I (or Andrew G) will follow up with more details shortly.
>>>>>>>
>>>>>>> -ash
>>>>>>>
>>>>>>> On Wed, Aug 10 2022 at 19:38:59 +02:00:00, Jarek Potiuk <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>> Oh yeah :)
>>>>>>>
>>>>>>> On Wed, Aug 10, 2022 at 7:23 PM Ping Zhang <[email protected]> wrote:
>>>>>>>
>>>>>>>> ah, good call.
>>>>>>>>
>>>>>>>> I guess the email template can be updated:
>>>>>>>>
>>>>>>>> Only votes from PMC members are binding, but members of the
>>>>>>>>> community are encouraged to check the AIP and vote with 
>>>>>>>>> "(non-binding)".
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> +1 (binding)
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Ping
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Aug 10, 2022 at 10:20 AM Jarek Potiuk <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thank you . And BTW. It's binding Ping :). For AIP's commiter's
>>>>>>>>> votes are binding. See
>>>>>>>>> https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#commit-policy
>>>>>>>>> :D
>>>>>>>>>
>>>>>>>>> On Wed, Aug 10, 2022 at 7:16 PM Ping Zhang <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1 (non-binding)
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Ping
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 4, 2022 at 1:42 AM Jarek Potiuk <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>
>>>>>>>>>>> I would like to cast a vote for "AIP-44 - Airflow Internal API".
>>>>>>>>>>>
>>>>>>>>>>> The AIP-44 is here:
>>>>>>>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API
>>>>>>>>>>>
>>>>>>>>>>> Discussion thread:
>>>>>>>>>>> https://lists.apache.org/thread/nsmo339m618kjzsdkwq83z8omrt08zh3
>>>>>>>>>>>
>>>>>>>>>>> The voting will last for 5 days (until 9th of August 2022 11:00
>>>>>>>>>>> CEST), and until at least 3 binding votes have been cast.
>>>>>>>>>>>
>>>>>>>>>>> Please vote accordingly:
>>>>>>>>>>>
>>>>>>>>>>> [ ] + 1 approve
>>>>>>>>>>> [ ] + 0 no opinion
>>>>>>>>>>> [ ] - 1 disapprove with the reason
>>>>>>>>>>>
>>>>>>>>>>> Only votes from PMC members are binding, but members of the
>>>>>>>>>>> community are encouraged to check the AIP and vote with
>>>>>>>>>>> "(non-binding)".
>>>>>>>>>>>
>>>>>>>>>>> ----
>>>>>>>>>>>
>>>>>>>>>>> Just a summary of where we are:
>>>>>>>>>>>
>>>>>>>>>>> It's been long in the making, but I think it might be a great
>>>>>>>>>>> step-forward to our long-term multi-tenancy goal. I believe the 
>>>>>>>>>>> proposal I
>>>>>>>>>>> have is quite well thought out and discussed a lot in the past:
>>>>>>>>>>>
>>>>>>>>>>> * we have a working POC for implementation used for performance
>>>>>>>>>>> testing:  https://github.com/apache/airflow/pull/25094
>>>>>>>>>>> * it is based on on industry-standard open-source gRPC (which is
>>>>>>>>>>> already our dependency) which fits better the RPC "model" we need 
>>>>>>>>>>> than our
>>>>>>>>>>> public REST API.
>>>>>>>>>>> * it has moderate performance impact and rather good
>>>>>>>>>>> maintainability features (very little impact on regular development 
>>>>>>>>>>> effort)
>>>>>>>>>>> * it is fully backwards compatible - the new DB isolation will
>>>>>>>>>>> be an optional feature
>>>>>>>>>>> * has a solid plan for full test coverage in our CI
>>>>>>>>>>> * has a backing and plans for more extensive complete testing in
>>>>>>>>>>> "real" environment with Google Composer team support
>>>>>>>>>>> * allows for further extensions as part of AIP-1 (I am planning
>>>>>>>>>>> to re-establish sig-multitenancy effort for follow up AIPs once 
>>>>>>>>>>> this one is
>>>>>>>>>>> well in progress).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> J.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>

-- 
Eugene

Re: [VOTE] AIP-44 - Airflow Internal API

Reply via email to