+1 (binding) now with that change. Thank you very much.
I'm also okay with us to proceed with only gRPC for now -- with this
architectural change it's much easier to replace it in future if we
want/need to, and for others (such as service providers like
Google/Amazon/Astronomer) to experiment with custom API implementations.
I'm not able to point at any examples, as the ones I know of personally
aren't open source; they were all written for companies
projects/products.
I'm am sorry that we missed it -- speaking personally I've had _a lot_
going on at home and I've barely had time to "look around" at what else
is going on in Airflow over the last few months. I'm only now just
having the head space to actually think in detail about other things
that what is right in front of me, hence why it came now.
-ash
On Thu, Aug 11 2022 at 09:27:39 +02:00:00, Jarek Potiuk
<ja...@potiuk.com> wrote:
I made the updates in
<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API>
to reflect the above. As I suspected, there were really very few
changes needed, to make it "GRPC/JSON OpenAPI" agnostic. Ir does not
change the "gist" and "purpose" of the AIP and we can easily turn it
into implementation detail after extended POC is complete.
On Thu, Aug 11, 2022 at 9:08 AM Jarek Potiuk <ja...@potiuk.com
<mailto:ja...@potiuk.com>> wrote:
Hey Ash, Andrew,
TL;DR; I slept over it to try to understand what just happened, and
I have a proposal.
* I am happy to extend my POC with the pure JSon/Open API approach -
providing that Andrew/Ash point me to some good examples where the
RPC style API has been implemented. If you could point me in the
right direction, I am a maverick kind of person and just find my
ways there.
* I propose we conclude the vote (with added reservation in it, that
final choice of the transport will be based on the extended POC) as
planned.
* I am rather curious to try the things that both of you mentioned
you are concerned about (threading/multiprocessing/load balancing)
and see how they compare in both approaches - before making final
decision and doing first, founding PR
More comments (personal feeling):
I really treat your comments - (now that they finally came),
seriously. And value your expert knowledge and experience. And I
personally expect the same in return. I always respond to any
AIPs/discussions when I am given enough time and opportunity to
comment, I do - In due time. This time it did not happen - not sure
why. I personally feel this is a little disrespectful to my work,
but I do not hold any grudges. I am happy to put that completely
aside - now that I have your attention and can use your experience,
the goal is that we do not make any personal complaints or
escalations here, we are all here to make Airflow a better product.
Community is the most important, respect for each other's work is an
important part of it. And I hope this can improve in the future AIP
discussions.
But going back to the merit of the discussion:
Just to let everyone know I am not too "vested" in gRPC . I spend a
little time on implementing the POC with it and after doing the POC
and reading testimonies of people I feel this is the right choice.
But for this AIP. I made deliberate decisions in the way that the
transport can be swapped out if we find reasons we should.
I am reasonably happy with the way proposed by Ash (In fact I am
going to update the AIP with it in a moment). For me the way how we
actually implement the "if" is an implementation detail that will
be worked out on the first PR. The way proposed is good for me,
though I would rather experiment a bit with decorators and see if we
can make it a bit nicer - but this is not a dealbreaker at all for
me. One concern I have is that we will have another abstraction
layer (possibly needless) and that we will have to again repeat all
the methods signatures and parameters and keep them in sync (they
will now be repeated in 4 or 5 places). But again - this is
something that can be likely done in the first "founding" PR we are
going to iterate on and work out the best balance between
duplication and flexibility/maintainability. And we can always
update the AIP with this implementation detail later - very much
like it happend with a number of AIPs before - including AIP-39,
AIP-40, AIP-42 - all of them went through similar rounds of updates
and clarifications as implementation was progressing. It's hard to
work out all the details "on paper" or even in "POC". Also I would
REALLY love to tap in the experience of people like the both of you,
but it seems that the only way to get some good and serious feedback
is to call a vote (or make a PR with the intention of merging it).
But I even can go further than that - I think independently from
voting whether the whole AIP-44 is a good or bad idea. I think there
is no doubt we need it, and the "general scope" and approach seems
to already reach general consensus, so if we can just continue and
complete the vote. I am happy to continue running the POC on using
OpenAPI spec and gRPC in parallel and see how they compare. I am
always eager to learn and try other things, and if you have valid
concerns I am happy to address them by trying out. I would
personally like to compare both from the development "friction"
point of view, performance, as well as doing some tests trying to
address and test the operational (process/thread/load balancing)
concerns you both have and see how they can be solved in both and
compare. I think there is nothing like "show me the code" and
performing actual working POC.
And I am super happy to continue with the POC and extend it with a
pure JSON/OpenAPI based proposal after voting completes and make the
final decision during the first founding PR. And we can even arrange
extra votes or lazy consensus before the first PR lands - after
seeing all the "ins/outs".
The Founding PR is still quite a bit away - I do not want to make
any commits before we branch-off the 2.4 - and I don't even want to
take too much of your time for that other than discussion and
raising concerns and commenting on my findings. I am happy to do all
the ground-work here.
That's all I ask for. Just treating the work I do seriously.
So Ash, Andrew
Can you please point me to some examples where RPC-API like ours has
been implemented with Open API/JSON? I am curious to learn from
those and turn them into POC.
And I propose - let's just continue the vote as planned. We already
have 3 binding votes, and more +1s than -1s and the time has already
passed, but I am happy to let it run till the end of day tomorrow to
see if my proposal above will be good to conclude the vote with more
consent from all the people involved.
J.
On Thu, Aug 11, 2022 at 7:49 AM Eugen Kosteev <eu...@kosteev.com
<mailto:eu...@kosteev.com>> wrote:
+1 (non-binding)
On Thu, Aug 11, 2022 at 12:08 AM Ash Berlin-Taylor <a...@apache.org
<mailto:a...@apache.org>> wrote:
So my concerns (in addition to the ones Andrew already pointed
out) with gRPC: it uses threads! Threads + python always makes me
nervous. Doubly so when you then couple that with DB access via
sqlalchemy - we're heading down paths less well travelled if we do
this.
From <https://github.com/grpc/grpc/blob/master/doc/fork_support.md>
> gRPC Python wraps gRPC core, which uses multithreading for
performance, and hence doesn't support fork(). Historically, we
didn't support forking in gRPC, but some users seemed to be doing
fine until their code started to break on version 1.6. This was
likely caused by the addition of background c-threads and a
background Python thread
And there's <https://github.com/grpc/grpc/issues/16001> which may
or may not be a problem for us, I'm unclear:
> The main constraint that must be satisified is that your process
must only invoke the fork() syscall /before/ you create your gRPC
server(s).
But anyway, the only change I'd like to see is to make the
internal API more pluggable.
So instead of something like this:
def process_file(
self,
file_path: str,
callback_requests: List[CallbackRequest],
pickle_dags: bool = False,
) -> Tuple[int, int]:
if self.use_grpc:
return self.process_file_grpc(
file_path=file_path,
callback_requests=callback_requests, pickle_dags=pickle_dags
)
return self.process_file_db(
file_path=file_path,
callback_requests=callback_requests, pickle_dags=pickle_dags
)
I'd very much like us to have it be something like this:
def process_file(
self,
file_path: str,
callback_requests: List[CallbackRequest],
pickle_dags: bool = False,
) -> Tuple[int, int]:
if settings.DATABASE_ACCESS_ISOLATION:
return InternalAPIClient.dagbag_process_file(
file_path=file_path,
callback_requests=callback_requests, pickle_dags=pickle_dags
)
return self.process_file_db(
file_path=file_path,
callback_requests=callback_requests, pickle_dags=pickle_dags
)
i.e. all API access is "marshalled" (perhaps the wrong word) via a
single pluggable (and eventually configurable/replaceable) API
client. Additionally (though less important) `self.use_grpc` is
changed to a property on the `airflow.settings` module (as it is a
global setting, not an attribute/property of any single instance.)
(In my mind the API client would include the
from_protobuff/to_protobuff methods that you added to TaskInstance
on your POC PR. Or the from_protbuff could live in the
FileProcessorServiceServicer class in your example, but that
probably doesn't scale when multiple services would take a
TI/SimpleTI. I guess they could still also live on TaskInstance et
al and just not be used - but that doesn't feel as clean to me is
all)
Thoughts? It's not too big change to encapsulate things like this
I hope?
Sorry again that we didn't look at the recent work on this AIP
sooner.
-ash
On Wed, Aug 10 2022 at 13:51:02 -06:00:00, Andrew Godwin
<andrew.god...@astronomer.io.INVALID> wrote:
I also wish we'd had this discussion before!
I've converted several million lines of code into API-driven
services over the past decade so I have my ways and I'm set in
them, I guess :)
Notice that I didn't say "use REST" - I don't think REST maps
well to RPC style codebases, and agree the whole method thing is
confusing. I just mean "ship JSON in a POST and receive JSON in
the response".
As you said before though, the line where you draw the
abstraction is what matters more than the transport layer, and
"fat endpoints" (doing transactions and multiple calls on the API
server etc.) is, I agree, the way this has to go, so it's not
like this is something I think is totally wrong, just that I've
repeatedly tried gRPC on projects like this and been disappointed
with what it actually takes to ship changes and prototype things
quickly. Nothing beats the ability to just throw curl at a
problem - or maybe that's just me.
Anyway, I'll leave you to it - I have my own ideas in this area
I'll be noodling on, but it's a bit of a different take and aimed
more at execution as a whole, so I'll come back and discuss them
if they're successful.
Andrew
On Wed, Aug 10, 2022 at 1:36 PM Jarek Potiuk <ja...@potiuk.com
<mailto:ja...@potiuk.com>> wrote:
1st of all - I wish we had this discussion before :)
> The autogenerated code also is available from OpenAPI specs,
of course, and the request/response semantics in the same thread
are precisely what make load balancing these calls a little
harder, and horizontal scaling to multiple threads comes with
every HTTP server, but I digress - clearly you've made a choice
here to write an RPC layer rather than a lightweight API layer,
and go with the more RPC-style features, and I get that.
OpenAPI is REST not RPC and it does not map well to RPC style
calls. I tried to explain in detail in the AIP (and in earlier
discussions). And the choice is basically made for us because of
our expectation to have minimal impact on the existing code (and
ability to switch off the remote part). We really do not want to
introduce new API calls. We want to make sure that existing "DB
transactions" (i.e. coarse grained calls) are remotely executed.
So we are not really talking about lightweight API almost by
definition. Indeed, Open API also maps a definition described in
a declarative way to python code. but it has this non-nice part
that OpenAPI/REST, it is supposed to be run on some resources.
We have no resources to run it on - every single method we call
is doing something. Even from the REST/OpenAPI semantics I'd
have a really hard time to decide whether it should be GET, POST
or PUT or DELETE. In most cases this will be a combination of
those 4 on several different resources. All the "nice parts" of
Open API (Swagger UI etc.) become next to useless if you try to
map such remote procedures we have, to REST calls.
> I still think it's choosing something that's more complex to
maintain over a simpler, more accessible option, but since I
won't be the one writing it, that's not really for me to worry
about. I am very curious to see how this evolves as the work
towards multitenancy really kicks in and all the API schemas
need to change to add namespacing!
The gRPC (proto) is designed from ground-up with maintainability
in mind. The ability to evolve the API, add new parameters etc.
is built-in the protobuf definition. From the user perspective
it's actually easier to use than OpenAPI when it comes to remote
method calls, because you basically - call a method.
And also coming back to monitoring - literally every monitoring
platform supports gRPC to monitor. Grafana, NewRelic,
CloudWatch, Datadog, you name it. In our case.
Also load-balancing:
Regardless of the choice we talk about HTTP request/response
happening for 1 call. This is the boundary. Each of the calls we
have will be a separate transaction, separate call, not
connected to any other call. The server handling it will be
stateless (state will be stored in a DB when each call
completes). I deliberately put the "boundary" of each of the
remotely callable methods, to be a complete DB transaction to
achieve it.
So It really does not change whether we use gRPC or REST/JSON.
REST/JSON vs. gRPC is just the content of the message, but this
is the very same HTTP call, with same authentication added on
top, same headers - just how the message is encoded is
different. The same tools for load balancing works in the same
way regardless if we use gRPC or REST/JSON. This is really a
higher layer than the one involved in load balancing.
On Wed, Aug 10, 2022 at 9:10 PM Andrew Godwin
<andrew.god...@astronomer.io.invalid> wrote:
Well, binary representation serialization performance is worse
for protobuf than for JSON in my experience (in Python
specifically) - unless you mean size-on-the-wire, which I can
agree with but tend to not worry about very much since it's
rarely a bottleneck.
The autogenerated code also is available from OpenAPI specs, of
course, and the request/response semantics in the same thread
are precisely what make load balancing these calls a little
harder, and horizontal scaling to multiple threads comes with
every HTTP server, but I digress - clearly you've made a choice
here to write an RPC layer rather than a lightweight API layer,
and go with the more RPC-style features, and I get that.
I still think it's choosing something that's more complex to
maintain over a simpler, more accessible option, but since I
won't be the one writing it, that's not really for me to worry
about. I am very curious to see how this evolves as the work
towards multitenancy really kicks in and all the API schemas
need to change to add namespacing!
Andrew
On Wed, Aug 10, 2022 at 12:52 PM Jarek Potiuk <ja...@potiuk.com
<mailto:ja...@potiuk.com>> wrote:
Sure I can explain - the main reasons are:
- 1) binary representation performance - impact of this is
rather limited because our API calls are doing rather heavy
processing compared to the data being transmitted. But I
believe it's not negligible.
- 2) automated tools to automatically generate strongly typed
Python code (that's the ProtoBuf part). The strongly typed
Python code is what convinced me (see my POC). The tooling we
got for that is excellent. Far more superior than dealing with
json-encoded data even with schema.
- 2) built-in "remote procedure" layer - where we have
request/response semantics optimisations (for multiple calls
over the same chanel) and exception handling done for us (This
is basically what "Remote Procedure" interface provide us)
- 3) built-in server that efficiently distributes the method
called from multiple client into a
multi-threaded/multi-threaded execution (all individual calls
are stateless so multi-processing can be added on top
regardless from the "transport" chosen).
BTW. If you look at my POC code, there is nothing that
strongly "binds" us to gRPC. The nice thing is that once it is
implemented, it can be swapped out very easily. The only
Proto/gRPC code that "leaks" to "generic" Airflow code is
mapping of some (most complex) parameters to Proto . And this
is only for most complex cases - literally only few of our
types require custom serialisation - most of the mapping is
handled automatically by generated protobuf code. And we can
easily put the "complex" mapping in a separate package. Plus
there is an "if" statement for each of the ~ 30 or so methods
that we will have to turn into remotely-callable. We can even
(as I proposed it as an option) add a little python magic and
add a simple decorator to handle the "ifs". Then the decorator
"if" can be swapped with some other "remote call"
implementation.
The actual bulk of the implementation is to make sure that all
the places are covered (that's the testing harness).
On Wed, Aug 10, 2022 at 8:25 PM Andrew Godwin
<andrew.god...@astronomer.io.invalid> wrote:
Hi Jarek,
Apologies - I was as not involved as I wanted to be in the
AIP-44 process, and obviously my vote is non-binding anyway -
but having done a lot of Python API development over the past
decade or so I wanted to know why the decision was made to go
with gRPC over just plain HTTP+JSON (with a schema, of
course).
The AIP covers why XMLRPC and Thrift lost out to gRPC, which
I agree with - but does not go into the option of using a
standard Python HTTP server with JSON schema enforcement,
such as FastAPI. In my experience, the tools for load
balancing, debugging, testing and monitoring JSON/HTTP are
superior and more numerous than those for gRPC, and in
addition the asynchronous support for gRPC servers is lacking
compared to their plain HTTP counterparts, and the fact that
you can interact and play with the APIs in prototyping stages
without having to handle obtaining correct protobuf versions
for the Airflow version you're using.
I wouldn't go so far as to suggest a veto, but I do want to
see the AIP address why gRPC would win over this option.
Apologies again for the late realisation that gRPC got chosen
and was being voted on - it's been a very busy summer.
Thanks,
Andrew
On Wed, Aug 10, 2022 at 12:12 PM Jarek Potiuk
<ja...@potiuk.com <mailto:ja...@potiuk.com>> wrote:
Just let me express my rather strong dissatisfaction with
the way this "last minute" is raised.
It is very late to come up with such a statement - not that
it comes at all, but when it comes when everyone had a
chance to take a look and comment, including taking a look
at the POC and result of checks. This has never been raised
even 4 months ago where the only choices were Thrift and
gRPc).
I REALLY hope the arguments are very strong and backed by
real examples and data why it is a bad choice rather than
opinions.
J,.
On Wed, Aug 10, 2022 at 7:50 PM Ash Berlin-Taylor
<a...@apache.org <mailto:a...@apache.org>> wrote:
Sorry to weigh in at the last minute, but I'm wary of gRPC
over just JSON, so -1 to that specific choice. Everything
else I'm happy with.
I (or Andrew G) will follow up with more details shortly.
-ash
On Wed, Aug 10 2022 at 19:38:59 +02:00:00, Jarek Potiuk
<ja...@potiuk.com <mailto:ja...@potiuk.com>> wrote:
Oh yeah :)
On Wed, Aug 10, 2022 at 7:23 PM Ping Zhang
<pin...@umich.edu <mailto:pin...@umich.edu>> wrote:
ah, good call.
I guess the email template can be updated:
Only votes from PMC members are binding, but members of
the community are encouraged to check the AIP and vote
with "(non-binding)".
+1 (binding)
Thanks,
Ping
On Wed, Aug 10, 2022 at 10:20 AM Jarek Potiuk
<ja...@potiuk.com <mailto:ja...@potiuk.com>> wrote:
Thank you . And BTW. It's binding Ping :). For AIP's
commiter's votes are binding. See
<https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#commit-policy>
:D
On Wed, Aug 10, 2022 at 7:16 PM Ping Zhang
<pin...@umich.edu <mailto:pin...@umich.edu>> wrote:
+1 (non-binding)
Thanks,
Ping
On Thu, Aug 4, 2022 at 1:42 AM Jarek Potiuk
<ja...@potiuk.com <mailto:ja...@potiuk.com>> wrote:
Hey everyone,
I would like to cast a vote for "AIP-44 - Airflow
Internal API".
The AIP-44 is here:
<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API>
Discussion thread:
<https://lists.apache.org/thread/nsmo339m618kjzsdkwq83z8omrt08zh3>
The voting will last for 5 days (until 9th of August
2022 11:00 CEST), and until at least 3 binding votes
have been cast.
Please vote accordingly:
[ ] + 1 approve
[ ] + 0 no opinion
[ ] - 1 disapprove with the reason
Only votes from PMC members are binding, but members
of the community are encouraged to check the AIP and
vote with "(non-binding)".
----
Just a summary of where we are:
It's been long in the making, but I think it might be
a great step-forward to our long-term multi-tenancy
goal. I believe the proposal I have is quite well
thought out and discussed a lot in the past:
* we have a working POC for implementation used for
performance testing:
<https://github.com/apache/airflow/pull/25094>
* it is based on on industry-standard open-source gRPC
(which is already our dependency) which fits better
the RPC "model" we need than our public REST API.
* it has moderate performance impact and rather good
maintainability features (very little impact on
regular development effort)
* it is fully backwards compatible - the new DB
isolation will be an optional feature
* has a solid plan for full test coverage in our CI
* has a backing and plans for more extensive complete
testing in "real" environment with Google Composer
team support
* allows for further extensions as part of AIP-1 (I am
planning to re-establish sig-multitenancy effort for
follow up AIPs once this one is well in progress).
J.
--
Eugene