Some comments from my side:

Bolke:

Direct database access for users is a no go. Direct queries that in anyway
> manipulate the database or any state in Airfl ow (trigger dags, connections
> etc) from the CLI should be prevented.


Absolutely agree in principle. But we have to discuss and accept profound
consequences
of such change. For me - this also means that those commands cannot be run
on the machine where airflow is installed and CLI users should have no
access to those
machines - otherwise they can run arbitrary python method/class script
using airflow and
access the DB directly (it also means that there direct DB connection
should not be allowed)
This pretty much calls for either using scheduler as API server or
introduce a new
"API-server" entity.

This is quite a departure from the current Airflow deployment model and I
think it should be
"very breaking change" (architecturally/deployment wise). It could still
look like a local client for
the user - so it's not a breaking change from user point of view.

This also means that we do not need maybe 95% of airflow code and that we
should either
not use "airflow" package (airflow-cli maybe?), or do something with the
"__init__.py" of
the 'airflow' package. Currently we have a very bad import model if we want
to use "airflow" package
for everything. By doing 'import airflow.<whatever>" we are pulling in
almost all airflow code
because it is either directly or indirectly imported in the
airflow/__init__.py. Including DB
models. If we stick to having this __init__.py it means that we will
continue to have many
seconds of delays for starting up the CLI. I think it is very bad to have
seconds delays.

For me it really calls even for another package ("airflow-cli") that should
be installable
separately, Surely for local installation/development etc.  both could be
installed alongside and look
and behave as the current airflow does - and use localhost for
communication (and even we
should have the same backwards-compatible "airflow" binary). We could
gradually migrate all
the current CLI methods to switch to the new APIs.

I fundamentally disagree with your position that remote working with
> Airflow should be done through the Web UI.


I agree. As I see it - all the modern tools for cloud world have all three
modes for management -
Web UI (mostly used for reading and discovery), CLI - mostly used for
management and sometimes
for querying, and REST-ish APIS (for whatever custom integration is
needed).  Look at docker, gcloud
and many others. From our survey - it's clear that this is also similar in
our case. UI is used for monitoring
accessing logs etc more frequently than for management, and CLI - while
used less frequently in general
has a strong usage as well.

On your points:
>
> 1-3: Solved by using FAB
>

Indeed those three points might be solved by using FAB REST API. I am not
100% sure if FAB is the best
choice (although I think it is a clear contender). What I really like about
it that we seem to have a way to
use the same mechanism for authorisation/authentication as in UI. Using a
standard approach here seems
like a good idea, especially that out-of-the-box we can use all the
"standard" authentication
mechanisms: https://flask-appbuilder.readthedocs.io/en/latest/security.html# .
(DB, OpenAI, LDAP, Oauth, custom) -
they are all available out-of-the-box. What I also like is that FAB REST
API has everything you expect from
a modern API (
https://flask-appbuilder.readthedocs.io/en/latest/rest_api.html#openapi-spec
)
It implements "open-api" standard, has built-in swagger integration and the
capability of not only generating the
documentation, having a swagger test server for API testing but also
capability of automatically generating
clients in different languages (I'd love however to see POC showing all
that capabilities).
IMHO any proposal for API implementation will have to have all those
properties and
FAB is hard to beat on that especially that we already use it..


> 4. You main points are solved by using FAB. On B) I don’t think it is a
> good idea to have an extra layer between json_client and local_client. Both
> are pretty thin wrappers around `airflow.api.common.*`.
>

I think we should get rid of the local client asap and introduce only REST
API and API server. It's inevitable
IMHO and the sooner we introduce it - the better. This way we could
immediately compress the layers there
to 1 layer.


> 5. Non existent ;-)
>
Off-by-one indeed :)


> 6. I do agree that there is some code duplication going on, but its
> intention was there in order to have it in place until `local_client`
> wasn’t required anymore and the Rest API is mature. So mature the API, get
> rid of local_client.
>

I think we should do it ASAP and get rid of the local client (but provide
equivalent binary CLI to do
the same that current CLI does). Once we remove the local implementation we
should switch all the
CLI methods to use the API and eventually decouple it from main airflow
code once it is done.

7. Having the CLI manipulate `airflow.cfg` is fine as long as it checks if
> it has the right permissions to do so, the OS will enforce that anyway.
>
Agree.


> 8. They are not in the same namespace. `airflow.api.client.*` vs
> `airflow.www.api`. If you mean by ‘package’ ‘Airflow’, then I suggest maybe
> packaging the API/Client separately of break up the Airflow package in
> “airflow-common”, “airflow-client”, “airflow-server”.

Same thought exactly ^^^


> 9. That’s not unexpected: the interface is documented whilst not perfectly.
> The CLI *should* use the public API. As in eat you own dog food. The (json)
> client api resides in its own namespace which does *not* import Airflow at
> all. So interfacing clients should not use the local_client.
>

Unfortunately this is not true - due to the way our __init_.py in airflow
package is defined,
pretty much everything in airflow has an implicit dependency on pretty much
all the code. See above.
Even if there are no "explicit" imports, just startup time is something
that we cannot do anything about.

In my opinion you are mixing several issues. The CLI has its issues, the
> API has its issues and maybe our packaging has its issues. We should
> address them separately.
>

I think Kamil is right here in mixing the issues. They are already mixed
unfortunately. We need to decide now
what is the target we want to achieve (separate api server? Separate
client? Separate binaries? Separate python packages?
Separate pypi packages?). Even if we do not implement it all and go through
a transition period, we need to know where we
are heading. Unfortunately we have this nasty airflow.__init__.py problem
that I think severely hampers any decoupling effort
we want to do - client/server and we need to decide first what we want to
do and how to decouple (in the future) and define
our roadmap how to get there and where to break backwards compatibility.



Kamil:


> I don't like the current remote mode in CLI for several reasons:
> 1) The default API security configuration makes all API data public
> 2) We do not provide sufficient security mechanisms to secure the API.
> 3) The current API does not allow you to limit permissions so that the
> user can only use the selected feature. If the user has access to the
> API, he can do anything.
>

I think we should not be put-off by current API implementation, It was
experimental after all. I think we have three options
- implement completely new architecture/deployment approach
- promote the current API to be "official" by adding the above
- implement some kind of mixture with using the same technology -
FAB but more of the features (open-api) and changing architecture to be
more of a client-server separation (my preference).


> 4) An abstraction leak between json_client and local_client.
> a) The API Clients have no defined exceptions.
> b) The API Clients have no defined return types
>

I think this can all be handled by adding it. And FAB's OpenApi + swagger
specs is one of the more common/standard way of defining those, where we
have documentation, generating clients in multiple languages etc.


> 6) This forces duplicate code. This is dangerous because one method
> may not be sufficiently tested and thus behave unexpectedly.
>

I agree we should not have duplicated code. We should remove local client.

7) Not everything should be possible via the API. e.g. Tomek Urbaszek
> is now working to add new commands that allow editing of the
> airflow.cfg file.
> PR: https://github.com/apache/airflow/pull/7130


I think this depends on the general architecture we come up with. If we
decide
to come up with clear client/server split, the API for editing configuration
will be the ONLY way to change it and it will be only available remotely.
I think - similarly to Bolke - that in the "production" deployments,
even the admins should not have access to the DB of Airflow, nor to
configuration
of Airflow. It should all be protected via API access mechanisms.


> 8) We do not have sufficient separation of the API client and API
> server. The API client and API server are in the same python package.
> This seriously threatens the interoperability of this API. It's now
> very easy to build an API that will return data in an unexpected
> format.
>

Agree - what we have now is bad and can be easily mixed and abused.
I think we should agree on the level of separation (python package/
pypi package/ client-server deployment) we want to have and base all our
further decisions on that.

9) This means that people who want to use the API will only use the
> specified client API. The same API client that is used by the CLI,
> which is very unexpected because it has severe limitations.
>

Indeed. But I think we should remove those imitations.


> I suggest that we do not develop remote mode in current CLI, and even
> delete it in fear future.


I think quite the opposite - we should turn all our interactions into
remote ones.
It might be that you want to achieve the same eventually. Maybe we talk
about the same
but look at it from a different perspective and different sequence of
reaching the target.
We need to talk a bit more about it to really understand whether we are not
talking about
the same long term goal - and then agree on what will be the technology to
get us there
and agree the way how to get there.


> a) Build an API that meets the basic requirements:


Yes. But I think it can be achieved by improving and extending features of
current API. I think
it is a matter of technology choice and what steps we are going to take.


> b) Build an API client. It is best if the client is generated
> automatically.
>
Yep. That's what FAB + OpenAPI + Swagger is very likely to deliver. We do
not need to do it
ourselves - this is a mostly solved problem.


> c) We should build a CLI that provides access to the API in a secure
> manner.
>
Agree. And again - this is what FAB + Open API seems to deliver pretty much
out-of-the box.


> Of course, we can try to build temporary solutions if the users need
> it, but this should be very limited and should have a warning about
> the level of stability and security. So if the user needs it can add
> commands that support remote mode, but the user should be aware that
> this is not a stable and secure solution.
>

I think it's already there -> "experimental" is a clear indication that
this will likely
change in the future.


> I have plans for Q1 to build a stable API that does not have the above
> issues.


I think we need that fairly fast - for Airflow 2.0. And I am pretty
convinced that if we use some existing
framework we can implement it rather fast - providing that we will "stand
on the shoulders of giants"
and use what's already there. APIs in python programs are already solved
problem. We do not need
to build our own solution for that - just decide on architecture/split,
plugin a framework and
implement endpoints.


J.

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Reply via email to