For now I created a channel for the SIG:
https://apache-airflow.slack.com/archives/C02M551UDA4 - feel free to
join anyone and in the next weeks once all the people involved so far
expressed their interest, we should set some plan on getting the
AIP(s)?  drafted/discussed and start implementing it.

On Sat, Nov 6, 2021 at 11:22 PM Xinbin Huang <bin.huan...@gmail.com> wrote:
>
> Hi Jarek,
>
> The plan sounds great! And +1 to a special interest group. Please add me to 
> the group if you do create one.
>
> Here is the doc ( Airflow Multi-tenancy discussion ) we used to discuss back 
> in April. It's not a note per-se, but I think it can shed some light on what 
> we talked about. Other folks may have an actual note or even a draft proposal 
> on this topic.
>
>   I'm excited for us to move forward with this.
>
> Bin
>
> On Fri, Nov 5, 2021 at 10:38 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>> Hello Ian, Everyone,
>>
>> I wonder if there are any notes from the meeting in April? Has there
>> been any more work on that one from Cloudera to formalize and plan
>> work on it?
>>
>> I was not able to participate, but I think it's about the time to
>> seriously start work on that and I am super happy to take more lead on
>> this project and involve all the interested parties. The ideas
>> described in the email and discussed after are I think super
>> reasonable and definitely necessary to get to the multi-tenancy and I
>> believe that there are already ideas that can be turned into reality
>> rather soon. I had a talk today also with the Google Composer team and
>> they are also fully on board with dedicating a lot of effort on this
>> one (and their ideas are I think super-aligned with Cloudera's), so I
>> think we have a critical mass and engineering power to make it happen
>> :)
>>
>> I plan to put quite a lot of focus on that one over the coming months
>> and I am happy to lead or co-lead the AIP and take a big part in
>> implementation.
>>
>> Possibly we should create a special interest group around that and
>> start drafting the AIP proposals in a smaller group of people who are
>> interested and start planning the work. I already have some ideas
>> where we could start gradually implementing it (of course after we
>> prepare the AIP and get it through the community's approval process).
>>
>> How does it sound?
>>
>> J.
>>
>> On Wed, Apr 21, 2021 at 8:56 AM Ian Buss <ianjb...@gmail.com> wrote:
>> >
>> > Yes, no invite required. See you tomorrow!
>> > On 21 Apr 2021, 07:46 +0100, Sumit Maheshwari <msu...@apache.org>, wrote:
>> >
>> > I'll join as well (I believe the zoom link will work without an invite)
>> >
>> > On Wed, Apr 21, 2021 at 10:48 AM Dimitris Stafylarakis <xan...@gmail.com> 
>> > wrote:
>> >>
>> >> hi all,
>> >>
>> >> great to read about this, I'd like to join in! Can I just join using the 
>> >> zoom link tomorrow or do I need an invitation? (If I do need one, please 
>> >> invite me :))
>> >>
>> >> cheers
>> >>
>> >>
>> >> On Wed, Apr 14, 2021 at 8:15 PM Daniel Imberman 
>> >> <daniel.imber...@gmail.com> wrote:
>> >>>
>> >>> Thank you Ian,
>> >>>
>> >>> I’ve invited everyone on this thread to the meeting with that zoom link. 
>> >>> Anyone else who wants to join can add the calendar event here 
>> >>> calendar.google.com/event?action=TEMPLATE&tmeid=Mm4zN2Q3MnFwNnBqbW9hMmNocXMyNzJpdHYgZGFuaWVsQGFzdHJvbm9tZXIuaW8&tmsrc=dan...@astronomer.io
>> >>>
>> >>> On Wed, Apr 14, 2021 at 11:05 AM, Ian Buss <ianjb...@gmail.com> wrote:
>> >>>
>> >>> If this works for everyone, here's a zoom link for Thursday 8AM PST: 
>> >>> https://cloudera.zoom.us/j/99928254235?pwd=VTFlQk4vQjQ5Z2JzUDM3ZWZKKy9MQT09
>> >>>
>> >>> Happy to move or use an alternate method as needed.
>> >>>
>> >>> On Wed, Apr 14, 2021 at 6:58 PM Daniel Imberman 
>> >>> <daniel.imber...@gmail.com> wrote:
>> >>>>
>> >>>> Thursday works for me!
>> >>>>
>> >>>> On Wed, Apr 14, 2021 at 10:05 AM, Ian Buss <ianjb...@gmail.com> wrote:
>> >>>>
>> >>>> Hi all,
>> >>>>
>> >>>> I actually can’t do Wednesday next week as I’m moving house :) Any 
>> >>>> chance we could do Thursday or Friday at the same time?
>> >>>>
>> >>>> Cheers
>> >>>>
>> >>>> Ian
>> >>>> On 14 Apr 2021, 17:49 +0100, Kaxil Naik <kaxiln...@gmail.com>, wrote:
>> >>>>
>> >>>> Just few comments here:
>> >>>>
>> >>>> Currently -- atleast for the foreseeable future Airflow workers will 
>> >>>> need access to the DAG Files, so workers can not run using the 
>> >>>> Serialized DAGs.
>> >>>>
>> >>>> Also serialized DAGs do not even have all the info needed for it to run 
>> >>>> it. Currently the serialization happens in the parsing process in the 
>> >>>> scheduler which can be in future separated as a separator "parsining" 
>> >>>> component, but that won't solve the "isolation" problem you are trying 
>> >>>> to solve. The only current way it can be solved is pickling -- and we 
>> >>>> have strictly decided against using pickling for DAGs.
>> >>>>
>> >>>> The idea in Statement (2) & (3) would help solve the isolation problem 
>> >>>> in (1) and can be done with some work now.
>> >>>>
>> >>>> Happy to talk about it in more detail here or on call, the time Daniel 
>> >>>> suggested works for me.
>> >>>>
>> >>>> Regards,
>> >>>> Kaxil
>> >>>>
>> >>>> On Wed, Apr 14, 2021 at 5:35 PM Daniel Imberman 
>> >>>> <daniel.imber...@gmail.com> wrote:
>> >>>>>
>> >>>>> How about Wednesday, April 21 at 8:00AM PST?
>> >>>>>
>> >>>>> On Wed, Apr 14, 2021 at 9:33 AM, Xinbin Huang <bin.huan...@gmail.com> 
>> >>>>> wrote:
>> >>>>>
>> >>>>> I am available any days.
>> >>>>>
>> >>>>> On Wed, Apr 14, 2021, 9:32 AM Daniel Imberman 
>> >>>>> <daniel.imber...@gmail.com> wrote:
>> >>>>>>
>> >>>>>> Hi everyone!
>> >>>>>>
>> >>>>>> Would people be available around 8AM/9AM PST some point next week? 
>> >>>>>> I’m in PST and Ian is UTC+1 so would be great to find a timezone that 
>> >>>>>> accomodates everyone.
>> >>>>>>
>> >>>>>> Daniel
>> >>>>>> On Wed, Apr 14, 2021 at 6:26 AM, Ryan Hatter <ryannhat...@gmail.com> 
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>> I’d also like to be added please :)
>> >>>>>>
>> >>>>>> On Apr 13, 2021, at 21:27, Xinbin Huang <bin.huan...@gmail.com> wrote:
>> >>>>>>
>> >>>>>> 
>> >>>>>> Hi Daniel & Ian,
>> >>>>>>
>> >>>>>> I am also interested in the idea of a serialization representation 
>> >>>>>> that can be executed by workers directly. Can you also add me to the 
>> >>>>>> call?
>> >>>>>>
>> >>>>>> Thanks
>> >>>>>> Bin
>> >>>>>>
>> >>>>>> On Tue, Apr 13, 2021 at 2:49 PM Ian Buss <ianjb...@gmail.com> wrote:
>> >>>>>>>
>> >>>>>>> Daniel,
>> >>>>>>>
>> >>>>>>> Thanks for your warm welcome and quick response and the advice on 
>> >>>>>>> providers! Will certainly check out the examples you sent.
>> >>>>>>>
>> >>>>>>> 1. An "airflow register" command definitely sounds promising, would 
>> >>>>>>> love to collaborate on an AIP there so let's set something up.
>> >>>>>>> 2. We use KubernetesExecutor exclusively as well. We've noticed 
>> >>>>>>> significant additional load on the metadata DB as we scale up task 
>> >>>>>>> pods so I've also thought about an API-based approach. Such an API 
>> >>>>>>> could also open up the possibility of per-task security tokens which 
>> >>>>>>> are injected by the scheduler, which should improve the security of 
>> >>>>>>> such a system. Food for thought at least. I will start putting some 
>> >>>>>>> of these thoughts down on paper in a sharable format.
>> >>>>>>>
>> >>>>>>> Ian
>> >>>>>>>
>> >>>>>>> On Tue, Apr 13, 2021 at 7:46 PM Daniel Imberman 
>> >>>>>>> <daniel.imber...@gmail.com> wrote:
>> >>>>>>>>
>> >>>>>>>> Hi Ian,
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Firstly, welcome to the Airflow community :). I'm glad to hear 
>> >>>>>>>> you've had a positive experience so far. It's great to hear that 
>> >>>>>>>> you want to contribute back, and I think that multi-tenancy/DAG 
>> >>>>>>>> isolation is a pretty fantastic project for the community as a 
>> >>>>>>>> whole (a lot of things are are things we want but are limited by 
>> >>>>>>>> hours in a day).
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> 1. I've personally been kicking around some ideas lately about an 
>> >>>>>>>> "airflow register" command that would write the DAG into the 
>> >>>>>>>> metadata DB in a way that could be "gettable" by the workers via 
>> >>>>>>>> the API. This work is very early. I'd love to get some help on it. 
>> >>>>>>>> Perhaps we can set up a zoom chat to discuss drafting an AIP?
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> 2. Limiting worker access to the DB is not only good security 
>> >>>>>>>> practice; it also opens up the door to a lot of valuable features. 
>> >>>>>>>> This feature would be especially close to my heart as it would make 
>> >>>>>>>> the KubernetesExecutor significantly more efficient. It should be 
>> >>>>>>>> possible to set up a system where the workers only ever speak to an 
>> >>>>>>>> API server and never need to touch the DB.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> 3. This is not something I personally have insight into, but I 
>> >>>>>>>> think it sounds like a good idea.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Finally, addressing your question about a Cloudera provider. If 
>> >>>>>>>> anything, it would probably give the provider _more_ legitimacy if 
>> >>>>>>>> you hosted it under the Cloudera GitHub org (we very purposely 
>> >>>>>>>> created the provider packages with this workflow in mind). There 
>> >>>>>>>> are multiple places where we can work to surface this provider so 
>> >>>>>>>> it is easy to find and use.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Astronomer has a pretty good sample provider here. One example of 
>> >>>>>>>> it running in the wild is the Great Expectations provider here. I'd 
>> >>>>>>>> also be glad to get you in contact with people who have built 
>> >>>>>>>> providers in the past to help you with that process.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Looking forward to seeing some of these things come to fruition!
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Daniel
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Tue, Apr 13, 2021 at 9:43 AM, Ian Buss <ianjb...@gmail.com> 
>> >>>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>> Hi all,
>> >>>>>>>>
>> >>>>>>>> First a quick introduction: I'm an engineer with Cloudera working 
>> >>>>>>>> on our Data Engineering product (CDE). Airflow is working great for 
>> >>>>>>>> us so far. We've been looking into how we can enhance the 
>> >>>>>>>> multi-tenancy story of Apache Airflow as we currently deploy it. We 
>> >>>>>>>> have the following areas which we'd like (with community consensus) 
>> >>>>>>>> to work on and contribute back to Apache Airflow to enhance the 
>> >>>>>>>> isolation between tenants in a single Airflow deployment.
>> >>>>>>>>
>> >>>>>>>> 1. Isolating code execution and parsing of DAG files. At the 
>> >>>>>>>> moment, DAG files are parsed in a few locations in Airflow, 
>> >>>>>>>> including the scheduler and in tasks. There is already the concept 
>> >>>>>>>> of DAG serialization (and we're using that for the web component) 
>> >>>>>>>> but we'd be interested to see if we can sandbox the execution of 
>> >>>>>>>> arbitrary user code to a locked down process/container without full 
>> >>>>>>>> access to the metadata DB and connection secrets etc. The idea 
>> >>>>>>>> would be to parse and serialize the DAG in this isolated container 
>> >>>>>>>> and pass back a serialized representation for persistence in the 
>> >>>>>>>> DB. Has anyone explored this idea?
>> >>>>>>>>
>> >>>>>>>> 2. Limiting task access to the metadata DB. It would be great if we 
>> >>>>>>>> could remove the requirement for tasks to have full access to the 
>> >>>>>>>> metadata DB and to report task status in a different (but still 
>> >>>>>>>> scalable) way. We'd need to tackle access or injection of 
>> >>>>>>>> connection, variable and xcom data as well for each task naturally.
>> >>>>>>>>
>> >>>>>>>> 3. Finer-grained access controls on connection secrets. Right now, 
>> >>>>>>>> although there are nice at-rest encryption options with Fernet or 
>> >>>>>>>> Vault, IIUC any DAG can access any connection (and thus any 
>> >>>>>>>> secret). Since the "run as" user is largely defined within the DAG 
>> >>>>>>>> and its tasks, this is challenging for a multi-tenant environment 
>> >>>>>>>> (see caveat below)
>> >>>>>>>>
>> >>>>>>>> Caveat: It's definitely noted that to some extent we should assume 
>> >>>>>>>> that an Airflow deployment is a "trusted" environment and that best 
>> >>>>>>>> practices such as git+PR workflows are the gold standard and that 
>> >>>>>>>> any malicious code and dependencies should be identified through 
>> >>>>>>>> this process. Also that there is a clear admin role for connection 
>> >>>>>>>> management etc.
>> >>>>>>>>
>> >>>>>>>> We have some ideas informally sketched out as to how to address the 
>> >>>>>>>> above but would be keen to hear the community opinion on this and 
>> >>>>>>>> to see if anyone is keen to collaborate on designs and 
>> >>>>>>>> implementation, or to hear if anything is already in the works. In 
>> >>>>>>>> particular I noticed that the very first improvement proposal 
>> >>>>>>>> (AIP-1) addresses much of the above :). However, it seems fairly 
>> >>>>>>>> dormant at the moment.
>> >>>>>>>>
>> >>>>>>>> One other question: we have a provider (operators and hooks) for 
>> >>>>>>>> interacting with Cloudera components that we'd like to contribute 
>> >>>>>>>> to the project. The provider FAQs indicate that new provider 
>> >>>>>>>> contributions are still welcome in the project in 2.x, is that 
>> >>>>>>>> accurate?
>> >>>>>>>>
>> >>>>>>>> Thanks in advance!
>> >>>>>>>>
>> >>>>>>>> Ian

Reply via email to