So my concerns (in addition to the ones Andrew already pointed out)
with gRPC: it uses threads! Threads + python always makes me nervous.
Doubly so when you then couple that with DB access via sqlalchemy -
we're heading down paths less well travelled if we do this.
From <https://github.com/grpc/grpc/blob/master/doc/fork_support.md>
> gRPC Python wraps gRPC core, which uses multithreading for
performance, and hence doesn't support fork(). Historically, we didn't
support forking in gRPC, but some users seemed to be doing fine until
their code started to break on version 1.6. This was likely caused by
the addition of background c-threads and a background Python thread
And there's <https://github.com/grpc/grpc/issues/16001> which may or
may not be a problem for us, I'm unclear:
> The main constraint that must be satisified is that your process
must only invoke the fork() syscall /before/ you create your gRPC
server(s).
But anyway, the only change I'd like to see is to make the internal API
more pluggable.
So instead of something like this:
def process_file(
self,
file_path: str,
callback_requests: List[CallbackRequest],
pickle_dags: bool = False,
) -> Tuple[int, int]:
if self.use_grpc:
return self.process_file_grpc(
file_path=file_path,
callback_requests=callback_requests, pickle_dags=pickle_dags
)
return self.process_file_db(
file_path=file_path, callback_requests=callback_requests,
pickle_dags=pickle_dags
)
I'd very much like us to have it be something like this:
def process_file(
self,
file_path: str,
callback_requests: List[CallbackRequest],
pickle_dags: bool = False,
) -> Tuple[int, int]:
if settings.DATABASE_ACCESS_ISOLATION:
return InternalAPIClient.dagbag_process_file(
file_path=file_path,
callback_requests=callback_requests, pickle_dags=pickle_dags
)
return self.process_file_db(
file_path=file_path, callback_requests=callback_requests,
pickle_dags=pickle_dags
)
i.e. all API access is "marshalled" (perhaps the wrong word) via a
single pluggable (and eventually configurable/replaceable) API client.
Additionally (though less important) `self.use_grpc` is changed to a
property on the `airflow.settings` module (as it is a global setting,
not an attribute/property of any single instance.)
(In my mind the API client would include the
from_protobuff/to_protobuff methods that you added to TaskInstance on
your POC PR. Or the from_protbuff could live in the
FileProcessorServiceServicer class in your example, but that probably
doesn't scale when multiple services would take a TI/SimpleTI. I guess
they could still also live on TaskInstance et al and just not be used -
but that doesn't feel as clean to me is all)
Thoughts? It's not too big change to encapsulate things like this I
hope?
Sorry again that we didn't look at the recent work on this AIP sooner.
-ash
On Wed, Aug 10 2022 at 13:51:02 -06:00:00, Andrew Godwin
<andrew.god...@astronomer.io.INVALID> wrote:
I also wish we'd had this discussion before!
I've converted several million lines of code into API-driven services
over the past decade so I have my ways and I'm set in them, I guess :)
Notice that I didn't say "use REST" - I don't think REST maps well to
RPC style codebases, and agree the whole method thing is confusing. I
just mean "ship JSON in a POST and receive JSON in the response".
As you said before though, the line where you draw the abstraction is
what matters more than the transport layer, and "fat endpoints"
(doing transactions and multiple calls on the API server etc.) is, I
agree, the way this has to go, so it's not like this is something I
think is totally wrong, just that I've repeatedly tried gRPC on
projects like this and been disappointed with what it actually takes
to ship changes and prototype things quickly. Nothing beats the
ability to just throw curl at a problem - or maybe that's just me.
Anyway, I'll leave you to it - I have my own ideas in this area I'll
be noodling on, but it's a bit of a different take and aimed more at
execution as a whole, so I'll come back and discuss them if they're
successful.
Andrew
On Wed, Aug 10, 2022 at 1:36 PM Jarek Potiuk <ja...@potiuk.com
<mailto:ja...@potiuk.com>> wrote:
1st of all - I wish we had this discussion before :)
> The autogenerated code also is available from OpenAPI specs, of
course, and the request/response semantics in the same thread are
precisely what make load balancing these calls a little harder, and
horizontal scaling to multiple threads comes with every HTTP server,
but I digress - clearly you've made a choice here to write an RPC
layer rather than a lightweight API layer, and go with the more
RPC-style features, and I get that.
OpenAPI is REST not RPC and it does not map well to RPC style calls.
I tried to explain in detail in the AIP (and in earlier
discussions). And the choice is basically made for us because of our
expectation to have minimal impact on the existing code (and ability
to switch off the remote part). We really do not want to introduce
new API calls. We want to make sure that existing "DB transactions"
(i.e. coarse grained calls) are remotely executed. So we are not
really talking about lightweight API almost by definition. Indeed,
Open API also maps a definition described in a declarative way to
python code. but it has this non-nice part that OpenAPI/REST, it is
supposed to be run on some resources. We have no resources to run it
on - every single method we call is doing something. Even from the
REST/OpenAPI semantics I'd have a really hard time to decide whether
it should be GET, POST or PUT or DELETE. In most cases this will be
a combination of those 4 on several different resources. All the
"nice parts" of Open API (Swagger UI etc.) become next to useless if
you try to map such remote procedures we have, to REST calls.
> I still think it's choosing something that's more complex to
maintain over a simpler, more accessible option, but since I won't
be the one writing it, that's not really for me to worry about. I am
very curious to see how this evolves as the work towards
multitenancy really kicks in and all the API schemas need to change
to add namespacing!
The gRPC (proto) is designed from ground-up with maintainability in
mind. The ability to evolve the API, add new parameters etc. is
built-in the protobuf definition. From the user perspective it's
actually easier to use than OpenAPI when it comes to remote method
calls, because you basically - call a method.
And also coming back to monitoring - literally every monitoring
platform supports gRPC to monitor. Grafana, NewRelic, CloudWatch,
Datadog, you name it. In our case.
Also load-balancing:
Regardless of the choice we talk about HTTP request/response
happening for 1 call. This is the boundary. Each of the calls we
have will be a separate transaction, separate call, not connected to
any other call. The server handling it will be stateless (state will
be stored in a DB when each call completes). I deliberately put the
"boundary" of each of the remotely callable methods, to be a
complete DB transaction to achieve it.
So It really does not change whether we use gRPC or REST/JSON.
REST/JSON vs. gRPC is just the content of the message, but this is
the very same HTTP call, with same authentication added on top, same
headers - just how the message is encoded is different. The same
tools for load balancing works in the same way regardless if we use
gRPC or REST/JSON. This is really a higher layer than the one
involved in load balancing.
On Wed, Aug 10, 2022 at 9:10 PM Andrew Godwin
<andrew.god...@astronomer.io.invalid> wrote:
Well, binary representation serialization performance is worse for
protobuf than for JSON in my experience (in Python specifically) -
unless you mean size-on-the-wire, which I can agree with but tend
to not worry about very much since it's rarely a bottleneck.
The autogenerated code also is available from OpenAPI specs, of
course, and the request/response semantics in the same thread are
precisely what make load balancing these calls a little harder, and
horizontal scaling to multiple threads comes with every HTTP
server, but I digress - clearly you've made a choice here to write
an RPC layer rather than a lightweight API layer, and go with the
more RPC-style features, and I get that.
I still think it's choosing something that's more complex to
maintain over a simpler, more accessible option, but since I won't
be the one writing it, that's not really for me to worry about. I
am very curious to see how this evolves as the work towards
multitenancy really kicks in and all the API schemas need to change
to add namespacing!
Andrew
On Wed, Aug 10, 2022 at 12:52 PM Jarek Potiuk <ja...@potiuk.com
<mailto:ja...@potiuk.com>> wrote:
Sure I can explain - the main reasons are:
- 1) binary representation performance - impact of this is rather
limited because our API calls are doing rather heavy processing
compared to the data being transmitted. But I believe it's not
negligible.
- 2) automated tools to automatically generate strongly typed
Python code (that's the ProtoBuf part). The strongly typed Python
code is what convinced me (see my POC). The tooling we got for
that is excellent. Far more superior than dealing with
json-encoded data even with schema.
- 2) built-in "remote procedure" layer - where we have
request/response semantics optimisations (for multiple calls over
the same chanel) and exception handling done for us (This is
basically what "Remote Procedure" interface provide us)
- 3) built-in server that efficiently distributes the method
called from multiple client into a multi-threaded/multi-threaded
execution (all individual calls are stateless so multi-processing
can be added on top regardless from the "transport" chosen).
BTW. If you look at my POC code, there is nothing that strongly
"binds" us to gRPC. The nice thing is that once it is implemented,
it can be swapped out very easily. The only Proto/gRPC code that
"leaks" to "generic" Airflow code is mapping of some (most
complex) parameters to Proto . And this is only for most complex
cases - literally only few of our types require custom
serialisation - most of the mapping is handled automatically by
generated protobuf code. And we can easily put the "complex"
mapping in a separate package. Plus there is an "if" statement for
each of the ~ 30 or so methods that we will have to turn into
remotely-callable. We can even (as I proposed it as an option) add
a little python magic and add a simple decorator to handle the
"ifs". Then the decorator "if" can be swapped with some other
"remote call" implementation.
The actual bulk of the implementation is to make sure that all the
places are covered (that's the testing harness).
On Wed, Aug 10, 2022 at 8:25 PM Andrew Godwin
<andrew.god...@astronomer.io.invalid> wrote:
Hi Jarek,
Apologies - I was as not involved as I wanted to be in the AIP-44
process, and obviously my vote is non-binding anyway - but having
done a lot of Python API development over the past decade or so I
wanted to know why the decision was made to go with gRPC over
just plain HTTP+JSON (with a schema, of course).
The AIP covers why XMLRPC and Thrift lost out to gRPC, which I
agree with - but does not go into the option of using a standard
Python HTTP server with JSON schema enforcement, such as FastAPI.
In my experience, the tools for load balancing, debugging,
testing and monitoring JSON/HTTP are superior and more numerous
than those for gRPC, and in addition the asynchronous support for
gRPC servers is lacking compared to their plain HTTP
counterparts, and the fact that you can interact and play with
the APIs in prototyping stages without having to handle obtaining
correct protobuf versions for the Airflow version you're using.
I wouldn't go so far as to suggest a veto, but I do want to see
the AIP address why gRPC would win over this option. Apologies
again for the late realisation that gRPC got chosen and was being
voted on - it's been a very busy summer.
Thanks,
Andrew
On Wed, Aug 10, 2022 at 12:12 PM Jarek Potiuk <ja...@potiuk.com
<mailto:ja...@potiuk.com>> wrote:
Just let me express my rather strong dissatisfaction with the
way this "last minute" is raised.
It is very late to come up with such a statement - not that it
comes at all, but when it comes when everyone had a chance to
take a look and comment, including taking a look at the POC and
result of checks. This has never been raised even 4 months ago
where the only choices were Thrift and gRPc).
I REALLY hope the arguments are very strong and backed by real
examples and data why it is a bad choice rather than opinions.
J,.
On Wed, Aug 10, 2022 at 7:50 PM Ash Berlin-Taylor
<a...@apache.org <mailto:a...@apache.org>> wrote:
Sorry to weigh in at the last minute, but I'm wary of gRPC over
just JSON, so -1 to that specific choice. Everything else I'm
happy with.
I (or Andrew G) will follow up with more details shortly.
-ash
On Wed, Aug 10 2022 at 19:38:59 +02:00:00, Jarek Potiuk
<ja...@potiuk.com <mailto:ja...@potiuk.com>> wrote:
Oh yeah :)
On Wed, Aug 10, 2022 at 7:23 PM Ping Zhang <pin...@umich.edu
<mailto:pin...@umich.edu>> wrote:
ah, good call.
I guess the email template can be updated:
Only votes from PMC members are binding, but members of the
community are encouraged to check the AIP and vote with
"(non-binding)".
+1 (binding)
Thanks,
Ping
On Wed, Aug 10, 2022 at 10:20 AM Jarek Potiuk
<ja...@potiuk.com <mailto:ja...@potiuk.com>> wrote:
Thank you . And BTW. It's binding Ping :). For AIP's
commiter's votes are binding. See
<https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#commit-policy>
:D
On Wed, Aug 10, 2022 at 7:16 PM Ping Zhang <pin...@umich.edu
<mailto:pin...@umich.edu>> wrote:
+1 (non-binding)
Thanks,
Ping
On Thu, Aug 4, 2022 at 1:42 AM Jarek Potiuk
<ja...@potiuk.com <mailto:ja...@potiuk.com>> wrote:
Hey everyone,
I would like to cast a vote for "AIP-44 - Airflow Internal
API".
The AIP-44 is here:
<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API>
Discussion thread:
<https://lists.apache.org/thread/nsmo339m618kjzsdkwq83z8omrt08zh3>
The voting will last for 5 days (until 9th of August 2022
11:00 CEST), and until at least 3 binding votes have been
cast.
Please vote accordingly:
[ ] + 1 approve
[ ] + 0 no opinion
[ ] - 1 disapprove with the reason
Only votes from PMC members are binding, but members of
the community are encouraged to check the AIP and vote
with "(non-binding)".
----
Just a summary of where we are:
It's been long in the making, but I think it might be a
great step-forward to our long-term multi-tenancy goal. I
believe the proposal I have is quite well thought out and
discussed a lot in the past:
* we have a working POC for implementation used for
performance testing:
<https://github.com/apache/airflow/pull/25094>
* it is based on on industry-standard open-source gRPC
(which is already our dependency) which fits better the
RPC "model" we need than our public REST API.
* it has moderate performance impact and rather good
maintainability features (very little impact on regular
development effort)
* it is fully backwards compatible - the new DB isolation
will be an optional feature
* has a solid plan for full test coverage in our CI
* has a backing and plans for more extensive complete
testing in "real" environment with Google Composer team
support
* allows for further extensions as part of AIP-1 (I am
planning to re-establish sig-multitenancy effort for
follow up AIPs once this one is well in progress).
J.