Re: [DISCUSS] AIP-1 and Airflow multi-tenancy

Daniel Imberman Wed, 14 Apr 2021 11:15:03 -0700

Thank you Ian,
I’ve invited everyone on this thread to the meeting with that zoom link. Anyone 
else who wants to join can add the calendar event here 
https://calendar.google.com/event?action=TEMPLATE&tmeid=Mm4zN2Q3MnFwNnBqbW9hMmNocXMyNzJpdHYgZGFuaWVsQGFzdHJvbm9tZXIuaW8&tmsrc=daniel%40astronomer.io


On Wed, Apr 14, 2021 at 11:05 AM, Ian Buss <ianjb...@gmail.com> wrote:
If this works for everyone, here's a zoom link for Thursday 8AM PST: 
https://cloudera.zoom.us/j/99928254235?pwd=VTFlQk4vQjQ5Z2JzUDM3ZWZKKy9MQT09 
[https://cloudera.zoom.us/j/99928254235?pwd=VTFlQk4vQjQ5Z2JzUDM3ZWZKKy9MQT09]
Happy to move or use an alternate method as needed.
On Wed, Apr 14, 2021 at 6:58 PM Daniel Imberman < daniel.imber...@gmail.com 
[daniel.imber...@gmail.com] > wrote:
Thursday works for me!

On Wed, Apr 14, 2021 at 10:05 AM, Ian Buss < ianjb...@gmail.com 
[ianjb...@gmail.com] > wrote:
Hi all,

I actually can’t do Wednesday next week as I’m moving house :) Any chance we 
could do Thursday or Friday at the same time?

Cheers

Ian On 14 Apr 2021, 17:49 +0100, Kaxil Naik < kaxiln...@gmail.com 
[kaxiln...@gmail.com] >, wrote:
Just few comments here:
Currently -- atleast for the foreseeable future Airflow workers will need 
access to the DAG Files, so workers can not run using the Serialized DAGs.
Also serialized DAGs do not even have all the info needed for it to run it. 
Currently the serialization happens in the parsing process in the scheduler 
which can be in future separated as a separator "parsining" component, but that 
won't solve the "isolation" problem you are trying to solve. The only current 
way it can be solved is pickling -- and we have strictly decided against using 
pickling for DAGs.
The idea in Statement (2) & (3) would help solve the isolation problem in (1) 
and can be done with some work now.
Happy to talk about it in more detail here or on call, the time Daniel 
suggested works for me.
Regards, Kaxil
On Wed, Apr 14, 2021 at 5:35 PM Daniel Imberman < daniel.imber...@gmail.com 
[daniel.imber...@gmail.com] > wrote:
How about Wednesday, April 21 at 8:00AM PST?

On Wed, Apr 14, 2021 at 9:33 AM, Xinbin Huang < bin.huan...@gmail.com 
[bin.huan...@gmail.com] > wrote:
I am available any days.
On Wed, Apr 14, 2021, 9:32 AM Daniel Imberman < daniel.imber...@gmail.com 
[daniel.imber...@gmail.com] > wrote:
Hi everyone!
Would people be available around 8AM/9AM PST some point next week? I’m in PST 
and Ian is UTC+1 so would be great to find a timezone that accomodates everyone.
Daniel
On Wed, Apr 14, 2021 at 6:26 AM, Ryan Hatter < ryannhat...@gmail.com 
[ryannhat...@gmail.com] > wrote:
I’d also like to be added please :)

On Apr 13, 2021, at 21:27, Xinbin Huang < bin.huan...@gmail.com 
[bin.huan...@gmail.com] > wrote:

Hi Daniel & Ian,
I am also interested in the idea of a serialization representation that can be 
executed by workers directly. Can you also add me to the call?
Thanks Bin
On Tue, Apr 13, 2021 at 2:49 PM Ian Buss < ianjb...@gmail.com 
[ianjb...@gmail.com] > wrote:
Daniel,
Thanks for your warm welcome and quick response and the advice on providers! 
Will certainly check out the examples you sent.
1. An "airflow register" command definitely sounds promising, would love to 
collaborate on an AIP there so let's set something up. 2. We use 
KubernetesExecutor exclusively as well. We've noticed significant additional 
load on the metadata DB as we scale up task pods so I've also thought about an 
API-based approach. Such an API could also open up the possibility of per-task 
security tokens which are injected by the scheduler, which should improve the 
security of such a system. Food for thought at least. I will start putting some 
of these thoughts down on paper in a sharable format.
Ian
On Tue, Apr 13, 2021 at 7:46 PM Daniel Imberman < daniel.imber...@gmail.com 
[daniel.imber...@gmail.com] > wrote:
Hi Ian,




Firstly, welcome to the Airflow community :). I'm glad to hear you've had a 
positive experience so far. It's great to hear that you want to contribute 
back, and I think that multi-tenancy/DAG isolation is a pretty fantastic 
project for the community as a whole (a lot of things are are things we want 
but are limited by hours in a day).




1. I've personally been kicking around some ideas lately about an "airflow 
register" command that would write the DAG into the metadata DB in a way that 
could be "gettable" by the workers via the API. This work is very early. I'd 
love to get some help on it. Perhaps we can set up a zoom chat to discuss 
drafting an AIP?




2. Limiting worker access to the DB is not only good security practice; it also 
opens up the door to a lot of valuable features. This feature would be 
especially close to my heart as it would make the KubernetesExecutor 
significantly more efficient. It should be possible to set up a system where 
the workers only ever speak to an API server and never need to touch the DB.




3. This is not something I personally have insight into, but I think it sounds 
like a good idea.




Finally, addressing your question about a Cloudera provider. If anything, it 
would probably give the provider _more_ legitimacy if you hosted it under the 
Cloudera GitHub org (we very purposely created the provider packages with this 
workflow in mind). There are multiple places where we can work to surface this 
provider so it is easy to find and use.




Astronomer has a pretty good sample provider here 
[https://github.com/astronomer/airflow-provider-sample] . One example of it 
running in the wild is the Great Expectations provider here 
[https://github.com/great-expectations/airflow-provider-great-expectations] . 
I'd also be glad to get you in contact with people who have built providers in 
the past to help you with that process.




Looking forward to seeing some of these things come to fruition!




Daniel


On Tue, Apr 13, 2021 at 9:43 AM, Ian Buss < ianjb...@gmail.com 
[ianjb...@gmail.com] > wrote:
Hi all,
First a quick introduction: I'm an engineer with Cloudera working on our Data 
Engineering product (CDE). Airflow is working great for us so far. We've been 
looking into how we can enhance the multi-tenancy story of Apache Airflow as we 
currently deploy it. We have the following areas which we'd like (with 
community consensus) to work on and contribute back to Apache Airflow to 
enhance the isolation between tenants in a single Airflow deployment.
1. Isolating code execution and parsing of DAG files. At the moment, DAG files 
are parsed in a few locations in Airflow, including the scheduler and in tasks. 
There is already the concept of DAG serialization (and we're using that for the 
web component) but we'd be interested to see if we can sandbox the execution of 
arbitrary user code to a locked down process/container without full access to 
the metadata DB and connection secrets etc. The idea would be to parse and 
serialize the DAG in this isolated container and pass back a serialized 
representation for persistence in the DB. Has anyone explored this idea?
2. Limiting task access to the metadata DB. It would be great if we could 
remove the requirement for tasks to have full access to the metadata DB and to 
report task status in a different (but still scalable) way. We'd need to tackle 
access or injection of connection, variable and xcom data as well for each task 
naturally.
3. Finer-grained access controls on connection secrets. Right now, although 
there are nice at-rest encryption options with Fernet or Vault, IIUC any DAG 
can access any connection (and thus any secret). Since the "run as" user is 
largely defined within the DAG and its tasks, this is challenging for a 
multi-tenant environment (see caveat below)
Caveat: It's definitely noted that to some extent we should assume that an 
Airflow deployment is a "trusted" environment and that best practices such as 
git+PR workflows are the gold standard and that any malicious code and 
dependencies should be identified through this process. Also that there is a 
clear admin role for connection management etc.
We have some ideas informally sketched out as to how to address the above but 
would be keen to hear the community opinion on this and to see if anyone is 
keen to collaborate on designs and implementation, or to hear if anything is 
already in the works. In particular I noticed that the very first improvement 
proposal (AIP-1) addresses much of the above :). However, it seems fairly 
dormant at the moment.
One other question: we have a provider (operators and hooks) for interacting 
with Cloudera components that we'd like to contribute to the project. The 
provider FAQs indicate that new provider contributions are still welcome in the 
project in 2.x, is that accurate?
Thanks in advance!
Ian

Re: [DISCUSS] AIP-1 and Airflow multi-tenancy

Reply via email to