Re: [DISCUSS] AIP-1 and Airflow multi-tenancy

Ryan Hatter Wed, 14 Apr 2021 06:26:22 -0700

I’d also like to be added please :)


> On Apr 13, 2021, at 21:27, Xinbin Huang <bin.huan...@gmail.com> wrote:
> 
> 
> Hi Daniel & Ian, 
> 
> I am also interested in the idea of a serialization representation that can 
> be executed by workers directly. Can you also add me to the call?
> 
> Thanks
> Bin
> 
>> On Tue, Apr 13, 2021 at 2:49 PM Ian Buss <ianjb...@gmail.com> wrote:
>> Daniel,
>> 
>> Thanks for your warm welcome and quick response and the advice on providers! 
>> Will certainly check out the examples you sent.
>> 
>> 1. An "airflow register" command definitely sounds promising, would love to 
>> collaborate on an AIP there so let's set something up.
>> 2. We use KubernetesExecutor exclusively as well. We've noticed significant 
>> additional load on the metadata DB as we scale up task pods so I've also 
>> thought about an API-based approach. Such an API could also open up the 
>> possibility of per-task security tokens which are injected by the scheduler, 
>> which should improve the security of such a system. Food for thought at 
>> least. I will start putting some of these thoughts down on paper in a 
>> sharable format.
>> 
>> Ian
>> 
>>> On Tue, Apr 13, 2021 at 7:46 PM Daniel Imberman <daniel.imber...@gmail.com> 
>>> wrote:
>>> Hi Ian,
>>> 
>>> Firstly, welcome to the Airflow community :). I'm glad to hear you've had a 
>>> positive experience so far. It's great to hear that you want to contribute 
>>> back, and I think that multi-tenancy/DAG isolation is a pretty fantastic 
>>> project for the community as a whole (a lot of things are are things we 
>>> want but are limited by hours in a day).
>>> 
>>> 1. I've personally been kicking around some ideas lately about an "airflow 
>>> register" command that would write the DAG into the metadata DB in a way 
>>> that could be "gettable" by the workers via the API. This work is very 
>>> early. I'd love to get some help on it. Perhaps we can set up a zoom chat 
>>> to discuss drafting an AIP?
>>> 
>>> 2. Limiting worker access to the DB is not only good security practice; it 
>>> also opens up the door to a lot of valuable features. This feature would be 
>>> especially close to my heart as it would make the KubernetesExecutor 
>>> significantly more efficient. It should be possible to set up a system 
>>> where the workers only ever speak to an API server and never need to touch 
>>> the DB.
>>> 
>>> 3. This is not something I personally have insight into, but I think it 
>>> sounds like a good idea. 
>>> 
>>> Finally, addressing your question about a Cloudera provider. If anything, 
>>> it would probably give the provider _more_ legitimacy if you hosted it 
>>> under the Cloudera GitHub org (we very purposely created the provider 
>>> packages with this workflow in mind). There are multiple places where we 
>>> can work to surface this provider so it is easy to find and use.
>>> 
>>> Astronomer has a pretty good sample provider here. One example of it 
>>> running in the wild is the Great Expectations provider here. I'd also be 
>>> glad to get you in contact with people who have built providers in the past 
>>> to help you with that process.
>>> 
>>> Looking forward to seeing some of these things come to fruition!
>>> 
>>> Daniel
>>> 
>>> On Tue, Apr 13, 2021 at 9:43 AM, Ian Buss <ianjb...@gmail.com> wrote:
>>> Hi all,
>>> 
>>> First a quick introduction: I'm an engineer with Cloudera working on our 
>>> Data Engineering product (CDE). Airflow is working great for us so far. 
>>> We've been looking into how we can enhance the multi-tenancy story of 
>>> Apache Airflow as we currently deploy it. We have the following areas which 
>>> we'd like (with community consensus) to work on and contribute back to 
>>> Apache Airflow to enhance the isolation between tenants in a single Airflow 
>>> deployment.
>>> 
>>> 1. Isolating code execution and parsing of DAG files. At the moment, DAG 
>>> files are parsed in a few locations in Airflow, including the scheduler and 
>>> in tasks. There is already the concept of DAG serialization (and we're 
>>> using that for the web component) but we'd be interested to see if we can 
>>> sandbox the execution of arbitrary user code to a locked down 
>>> process/container without full access to the metadata DB and connection 
>>> secrets etc. The idea would be to parse and serialize the DAG in this 
>>> isolated container and pass back a serialized representation for 
>>> persistence in the DB. Has anyone explored this idea?
>>> 
>>> 2. Limiting task access to the metadata DB. It would be great if we could 
>>> remove the requirement for tasks to have full access to the metadata DB and 
>>> to report task status in a different (but still scalable) way. We'd need to 
>>> tackle access or injection of connection, variable and xcom data as well 
>>> for each task naturally.
>>> 
>>> 3. Finer-grained access controls on connection secrets. Right now, although 
>>> there are nice at-rest encryption options with Fernet or Vault, IIUC any 
>>> DAG can access any connection (and thus any secret). Since the "run as" 
>>> user is largely defined within the DAG and its tasks, this is challenging 
>>> for a multi-tenant environment (see caveat below)
>>> 
>>> Caveat: It's definitely noted that to some extent we should assume that an 
>>> Airflow deployment is a "trusted" environment and that best practices such 
>>> as git+PR workflows are the gold standard and that any malicious code and 
>>> dependencies should be identified through this process. Also that there is 
>>> a clear admin role for connection management etc.
>>> 
>>> We have some ideas informally sketched out as to how to address the above 
>>> but would be keen to hear the community opinion on this and to see if 
>>> anyone is keen to collaborate on designs and implementation, or to hear if 
>>> anything is already in the works. In particular I noticed that the very 
>>> first improvement proposal (AIP-1) addresses much of the above :). However, 
>>> it seems fairly dormant at the moment.
>>> 
>>> One other question: we have a provider (operators and hooks) for 
>>> interacting with Cloudera components that we'd like to contribute to the 
>>> project. The provider FAQs indicate that new provider contributions are 
>>> still welcome in the project in 2.x, is that accurate?
>>> 
>>> Thanks in advance!
>>> 
>>> Ian

Re: [DISCUSS] AIP-1 and Airflow multi-tenancy

Reply via email to