Thanks Jarek! Will join the channel. This is still something close to my heart 
and an area where I think we could make great progress. I agree that we should 
get together and start a draft AIP around this to gather the various strands of 
this together.
On 7 Nov 2021, 13:19 +0000, Jarek Potiuk <ja...@potiuk.com>, wrote:
> For now I created a channel for the SIG:
> https://apache-airflow.slack.com/archives/C02M551UDA4 - feel free to
> join anyone and in the next weeks once all the people involved so far
> expressed their interest, we should set some plan on getting the
> AIP(s)? drafted/discussed and start implementing it.
>
> On Sat, Nov 6, 2021 at 11:22 PM Xinbin Huang <bin.huan...@gmail.com> wrote:
> >
> > Hi Jarek,
> >
> > The plan sounds great! And +1 to a special interest group. Please add me to 
> > the group if you do create one.
> >
> > Here is the doc ( Airflow Multi-tenancy discussion ) we used to discuss 
> > back in April. It's not a note per-se, but I think it can shed some light 
> > on what we talked about. Other folks may have an actual note or even a 
> > draft proposal on this topic.
> >
> > I'm excited for us to move forward with this.
> >
> > Bin
> >
> > On Fri, Nov 5, 2021 at 10:38 AM Jarek Potiuk <ja...@potiuk.com> wrote:
> > >
> > > Hello Ian, Everyone,
> > >
> > > I wonder if there are any notes from the meeting in April? Has there
> > > been any more work on that one from Cloudera to formalize and plan
> > > work on it?
> > >
> > > I was not able to participate, but I think it's about the time to
> > > seriously start work on that and I am super happy to take more lead on
> > > this project and involve all the interested parties. The ideas
> > > described in the email and discussed after are I think super
> > > reasonable and definitely necessary to get to the multi-tenancy and I
> > > believe that there are already ideas that can be turned into reality
> > > rather soon. I had a talk today also with the Google Composer team and
> > > they are also fully on board with dedicating a lot of effort on this
> > > one (and their ideas are I think super-aligned with Cloudera's), so I
> > > think we have a critical mass and engineering power to make it happen
> > > :)
> > >
> > > I plan to put quite a lot of focus on that one over the coming months
> > > and I am happy to lead or co-lead the AIP and take a big part in
> > > implementation.
> > >
> > > Possibly we should create a special interest group around that and
> > > start drafting the AIP proposals in a smaller group of people who are
> > > interested and start planning the work. I already have some ideas
> > > where we could start gradually implementing it (of course after we
> > > prepare the AIP and get it through the community's approval process).
> > >
> > > How does it sound?
> > >
> > > J.
> > >
> > > On Wed, Apr 21, 2021 at 8:56 AM Ian Buss <ianjb...@gmail.com> wrote:
> > > >
> > > > Yes, no invite required. See you tomorrow!
> > > > On 21 Apr 2021, 07:46 +0100, Sumit Maheshwari <msu...@apache.org>, 
> > > > wrote:
> > > >
> > > > I'll join as well (I believe the zoom link will work without an invite)
> > > >
> > > > On Wed, Apr 21, 2021 at 10:48 AM Dimitris Stafylarakis 
> > > > <xan...@gmail.com> wrote:
> > > > >
> > > > > hi all,
> > > > >
> > > > > great to read about this, I'd like to join in! Can I just join using 
> > > > > the zoom link tomorrow or do I need an invitation? (If I do need one, 
> > > > > please invite me :))
> > > > >
> > > > > cheers
> > > > >
> > > > >
> > > > > On Wed, Apr 14, 2021 at 8:15 PM Daniel Imberman 
> > > > > <daniel.imber...@gmail.com> wrote:
> > > > > >
> > > > > > Thank you Ian,
> > > > > >
> > > > > > I’ve invited everyone on this thread to the meeting with that zoom 
> > > > > > link. Anyone else who wants to join can add the calendar event here 
> > > > > > calendar.google.com/event?action=TEMPLATE&tmeid=Mm4zN2Q3MnFwNnBqbW9hMmNocXMyNzJpdHYgZGFuaWVsQGFzdHJvbm9tZXIuaW8&tmsrc=dan...@astronomer.io
> > > > > >
> > > > > > On Wed, Apr 14, 2021 at 11:05 AM, Ian Buss <ianjb...@gmail.com> 
> > > > > > wrote:
> > > > > >
> > > > > > If this works for everyone, here's a zoom link for Thursday 8AM 
> > > > > > PST: 
> > > > > > https://cloudera.zoom.us/j/99928254235?pwd=VTFlQk4vQjQ5Z2JzUDM3ZWZKKy9MQT09
> > > > > >
> > > > > > Happy to move or use an alternate method as needed.
> > > > > >
> > > > > > On Wed, Apr 14, 2021 at 6:58 PM Daniel Imberman 
> > > > > > <daniel.imber...@gmail.com> wrote:
> > > > > > >
> > > > > > > Thursday works for me!
> > > > > > >
> > > > > > > On Wed, Apr 14, 2021 at 10:05 AM, Ian Buss <ianjb...@gmail.com> 
> > > > > > > wrote:
> > > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > I actually can’t do Wednesday next week as I’m moving house :) 
> > > > > > > Any chance we could do Thursday or Friday at the same time?
> > > > > > >
> > > > > > > Cheers
> > > > > > >
> > > > > > > Ian
> > > > > > > On 14 Apr 2021, 17:49 +0100, Kaxil Naik <kaxiln...@gmail.com>, 
> > > > > > > wrote:
> > > > > > >
> > > > > > > Just few comments here:
> > > > > > >
> > > > > > > Currently -- atleast for the foreseeable future Airflow workers 
> > > > > > > will need access to the DAG Files, so workers can not run using 
> > > > > > > the Serialized DAGs.
> > > > > > >
> > > > > > > Also serialized DAGs do not even have all the info needed for it 
> > > > > > > to run it. Currently the serialization happens in the parsing 
> > > > > > > process in the scheduler which can be in future separated as a 
> > > > > > > separator "parsining" component, but that won't solve the 
> > > > > > > "isolation" problem you are trying to solve. The only current way 
> > > > > > > it can be solved is pickling -- and we have strictly decided 
> > > > > > > against using pickling for DAGs.
> > > > > > >
> > > > > > > The idea in Statement (2) & (3) would help solve the isolation 
> > > > > > > problem in (1) and can be done with some work now.
> > > > > > >
> > > > > > > Happy to talk about it in more detail here or on call, the time 
> > > > > > > Daniel suggested works for me.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Kaxil
> > > > > > >
> > > > > > > On Wed, Apr 14, 2021 at 5:35 PM Daniel Imberman 
> > > > > > > <daniel.imber...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > How about Wednesday, April 21 at 8:00AM PST?
> > > > > > > >
> > > > > > > > On Wed, Apr 14, 2021 at 9:33 AM, Xinbin Huang 
> > > > > > > > <bin.huan...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > I am available any days.
> > > > > > > >
> > > > > > > > On Wed, Apr 14, 2021, 9:32 AM Daniel Imberman 
> > > > > > > > <daniel.imber...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > Hi everyone!
> > > > > > > > >
> > > > > > > > > Would people be available around 8AM/9AM PST some point next 
> > > > > > > > > week? I’m in PST and Ian is UTC+1 so would be great to find a 
> > > > > > > > > timezone that accomodates everyone.
> > > > > > > > >
> > > > > > > > > Daniel
> > > > > > > > > On Wed, Apr 14, 2021 at 6:26 AM, Ryan Hatter 
> > > > > > > > > <ryannhat...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > I’d also like to be added please :)
> > > > > > > > >
> > > > > > > > > On Apr 13, 2021, at 21:27, Xinbin Huang 
> > > > > > > > > <bin.huan...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Hi Daniel & Ian,
> > > > > > > > >
> > > > > > > > > I am also interested in the idea of a serialization 
> > > > > > > > > representation that can be executed by workers directly. Can 
> > > > > > > > > you also add me to the call?
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > > Bin
> > > > > > > > >
> > > > > > > > > On Tue, Apr 13, 2021 at 2:49 PM Ian Buss <ianjb...@gmail.com> 
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Daniel,
> > > > > > > > > >
> > > > > > > > > > Thanks for your warm welcome and quick response and the 
> > > > > > > > > > advice on providers! Will certainly check out the examples 
> > > > > > > > > > you sent.
> > > > > > > > > >
> > > > > > > > > > 1. An "airflow register" command definitely sounds 
> > > > > > > > > > promising, would love to collaborate on an AIP there so 
> > > > > > > > > > let's set something up.
> > > > > > > > > > 2. We use KubernetesExecutor exclusively as well. We've 
> > > > > > > > > > noticed significant additional load on the metadata DB as 
> > > > > > > > > > we scale up task pods so I've also thought about an 
> > > > > > > > > > API-based approach. Such an API could also open up the 
> > > > > > > > > > possibility of per-task security tokens which are injected 
> > > > > > > > > > by the scheduler, which should improve the security of such 
> > > > > > > > > > a system. Food for thought at least. I will start putting 
> > > > > > > > > > some of these thoughts down on paper in a sharable format.
> > > > > > > > > >
> > > > > > > > > > Ian
> > > > > > > > > >
> > > > > > > > > > On Tue, Apr 13, 2021 at 7:46 PM Daniel Imberman 
> > > > > > > > > > <daniel.imber...@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi Ian,
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Firstly, welcome to the Airflow community :). I'm glad to 
> > > > > > > > > > > hear you've had a positive experience so far. It's great 
> > > > > > > > > > > to hear that you want to contribute back, and I think 
> > > > > > > > > > > that multi-tenancy/DAG isolation is a pretty fantastic 
> > > > > > > > > > > project for the community as a whole (a lot of things are 
> > > > > > > > > > > are things we want but are limited by hours in a day).
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > 1. I've personally been kicking around some ideas lately 
> > > > > > > > > > > about an "airflow register" command that would write the 
> > > > > > > > > > > DAG into the metadata DB in a way that could be 
> > > > > > > > > > > "gettable" by the workers via the API. This work is very 
> > > > > > > > > > > early. I'd love to get some help on it. Perhaps we can 
> > > > > > > > > > > set up a zoom chat to discuss drafting an AIP?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > 2. Limiting worker access to the DB is not only good 
> > > > > > > > > > > security practice; it also opens up the door to a lot of 
> > > > > > > > > > > valuable features. This feature would be especially close 
> > > > > > > > > > > to my heart as it would make the KubernetesExecutor 
> > > > > > > > > > > significantly more efficient. It should be possible to 
> > > > > > > > > > > set up a system where the workers only ever speak to an 
> > > > > > > > > > > API server and never need to touch the DB.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > 3. This is not something I personally have insight into, 
> > > > > > > > > > > but I think it sounds like a good idea.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Finally, addressing your question about a Cloudera 
> > > > > > > > > > > provider. If anything, it would probably give the 
> > > > > > > > > > > provider _more_ legitimacy if you hosted it under the 
> > > > > > > > > > > Cloudera GitHub org (we very purposely created the 
> > > > > > > > > > > provider packages with this workflow in mind). There are 
> > > > > > > > > > > multiple places where we can work to surface this 
> > > > > > > > > > > provider so it is easy to find and use.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Astronomer has a pretty good sample provider here. One 
> > > > > > > > > > > example of it running in the wild is the Great 
> > > > > > > > > > > Expectations provider here. I'd also be glad to get you 
> > > > > > > > > > > in contact with people who have built providers in the 
> > > > > > > > > > > past to help you with that process.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Looking forward to seeing some of these things come to 
> > > > > > > > > > > fruition!
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Daniel
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Apr 13, 2021 at 9:43 AM, Ian Buss 
> > > > > > > > > > > <ianjb...@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > First a quick introduction: I'm an engineer with Cloudera 
> > > > > > > > > > > working on our Data Engineering product (CDE). Airflow is 
> > > > > > > > > > > working great for us so far. We've been looking into how 
> > > > > > > > > > > we can enhance the multi-tenancy story of Apache Airflow 
> > > > > > > > > > > as we currently deploy it. We have the following areas 
> > > > > > > > > > > which we'd like (with community consensus) to work on and 
> > > > > > > > > > > contribute back to Apache Airflow to enhance the 
> > > > > > > > > > > isolation between tenants in a single Airflow deployment.
> > > > > > > > > > >
> > > > > > > > > > > 1. Isolating code execution and parsing of DAG files. At 
> > > > > > > > > > > the moment, DAG files are parsed in a few locations in 
> > > > > > > > > > > Airflow, including the scheduler and in tasks. There is 
> > > > > > > > > > > already the concept of DAG serialization (and we're using 
> > > > > > > > > > > that for the web component) but we'd be interested to see 
> > > > > > > > > > > if we can sandbox the execution of arbitrary user code to 
> > > > > > > > > > > a locked down process/container without full access to 
> > > > > > > > > > > the metadata DB and connection secrets etc. The idea 
> > > > > > > > > > > would be to parse and serialize the DAG in this isolated 
> > > > > > > > > > > container and pass back a serialized representation for 
> > > > > > > > > > > persistence in the DB. Has anyone explored this idea?
> > > > > > > > > > >
> > > > > > > > > > > 2. Limiting task access to the metadata DB. It would be 
> > > > > > > > > > > great if we could remove the requirement for tasks to 
> > > > > > > > > > > have full access to the metadata DB and to report task 
> > > > > > > > > > > status in a different (but still scalable) way. We'd need 
> > > > > > > > > > > to tackle access or injection of connection, variable and 
> > > > > > > > > > > xcom data as well for each task naturally.
> > > > > > > > > > >
> > > > > > > > > > > 3. Finer-grained access controls on connection secrets. 
> > > > > > > > > > > Right now, although there are nice at-rest encryption 
> > > > > > > > > > > options with Fernet or Vault, IIUC any DAG can access any 
> > > > > > > > > > > connection (and thus any secret). Since the "run as" user 
> > > > > > > > > > > is largely defined within the DAG and its tasks, this is 
> > > > > > > > > > > challenging for a multi-tenant environment (see caveat 
> > > > > > > > > > > below)
> > > > > > > > > > >
> > > > > > > > > > > Caveat: It's definitely noted that to some extent we 
> > > > > > > > > > > should assume that an Airflow deployment is a "trusted" 
> > > > > > > > > > > environment and that best practices such as git+PR 
> > > > > > > > > > > workflows are the gold standard and that any malicious 
> > > > > > > > > > > code and dependencies should be identified through this 
> > > > > > > > > > > process. Also that there is a clear admin role for 
> > > > > > > > > > > connection management etc.
> > > > > > > > > > >
> > > > > > > > > > > We have some ideas informally sketched out as to how to 
> > > > > > > > > > > address the above but would be keen to hear the community 
> > > > > > > > > > > opinion on this and to see if anyone is keen to 
> > > > > > > > > > > collaborate on designs and implementation, or to hear if 
> > > > > > > > > > > anything is already in the works. In particular I noticed 
> > > > > > > > > > > that the very first improvement proposal (AIP-1) 
> > > > > > > > > > > addresses much of the above :). However, it seems fairly 
> > > > > > > > > > > dormant at the moment.
> > > > > > > > > > >
> > > > > > > > > > > One other question: we have a provider (operators and 
> > > > > > > > > > > hooks) for interacting with Cloudera components that we'd 
> > > > > > > > > > > like to contribute to the project. The provider FAQs 
> > > > > > > > > > > indicate that new provider contributions are still 
> > > > > > > > > > > welcome in the project in 2.x, is that accurate?
> > > > > > > > > > >
> > > > > > > > > > > Thanks in advance!
> > > > > > > > > > >
> > > > > > > > > > > Ian

Reply via email to