Cool. I will propose next steps tomorrow most likely.

On Sat, Nov 13, 2021 at 12:37 PM Ian Buss <[email protected]> wrote:
>
> Thanks Jarek! Will join the channel. This is still something close to my 
> heart and an area where I think we could make great progress. I agree that we 
> should get together and start a draft AIP around this to gather the various 
> strands of this together.
> On 7 Nov 2021, 13:19 +0000, Jarek Potiuk <[email protected]>, wrote:
>
> For now I created a channel for the SIG:
> https://apache-airflow.slack.com/archives/C02M551UDA4 - feel free to
> join anyone and in the next weeks once all the people involved so far
> expressed their interest, we should set some plan on getting the
> AIP(s)? drafted/discussed and start implementing it.
>
> On Sat, Nov 6, 2021 at 11:22 PM Xinbin Huang <[email protected]> wrote:
>
>
> Hi Jarek,
>
> The plan sounds great! And +1 to a special interest group. Please add me to 
> the group if you do create one.
>
> Here is the doc ( Airflow Multi-tenancy discussion ) we used to discuss back 
> in April. It's not a note per-se, but I think it can shed some light on what 
> we talked about. Other folks may have an actual note or even a draft proposal 
> on this topic.
>
> I'm excited for us to move forward with this.
>
> Bin
>
> On Fri, Nov 5, 2021 at 10:38 AM Jarek Potiuk <[email protected]> wrote:
>
>
> Hello Ian, Everyone,
>
> I wonder if there are any notes from the meeting in April? Has there
> been any more work on that one from Cloudera to formalize and plan
> work on it?
>
> I was not able to participate, but I think it's about the time to
> seriously start work on that and I am super happy to take more lead on
> this project and involve all the interested parties. The ideas
> described in the email and discussed after are I think super
> reasonable and definitely necessary to get to the multi-tenancy and I
> believe that there are already ideas that can be turned into reality
> rather soon. I had a talk today also with the Google Composer team and
> they are also fully on board with dedicating a lot of effort on this
> one (and their ideas are I think super-aligned with Cloudera's), so I
> think we have a critical mass and engineering power to make it happen
> :)
>
> I plan to put quite a lot of focus on that one over the coming months
> and I am happy to lead or co-lead the AIP and take a big part in
> implementation.
>
> Possibly we should create a special interest group around that and
> start drafting the AIP proposals in a smaller group of people who are
> interested and start planning the work. I already have some ideas
> where we could start gradually implementing it (of course after we
> prepare the AIP and get it through the community's approval process).
>
> How does it sound?
>
> J.
>
> On Wed, Apr 21, 2021 at 8:56 AM Ian Buss <[email protected]> wrote:
>
>
> Yes, no invite required. See you tomorrow!
> On 21 Apr 2021, 07:46 +0100, Sumit Maheshwari <[email protected]>, wrote:
>
> I'll join as well (I believe the zoom link will work without an invite)
>
> On Wed, Apr 21, 2021 at 10:48 AM Dimitris Stafylarakis <[email protected]> 
> wrote:
>
>
> hi all,
>
> great to read about this, I'd like to join in! Can I just join using the zoom 
> link tomorrow or do I need an invitation? (If I do need one, please invite me 
> :))
>
> cheers
>
>
> On Wed, Apr 14, 2021 at 8:15 PM Daniel Imberman <[email protected]> 
> wrote:
>
>
> Thank you Ian,
>
> I’ve invited everyone on this thread to the meeting with that zoom link. 
> Anyone else who wants to join can add the calendar event here 
> calendar.google.com/event?action=TEMPLATE&tmeid=Mm4zN2Q3MnFwNnBqbW9hMmNocXMyNzJpdHYgZGFuaWVsQGFzdHJvbm9tZXIuaW8&[email protected]
>
> On Wed, Apr 14, 2021 at 11:05 AM, Ian Buss <[email protected]> wrote:
>
> If this works for everyone, here's a zoom link for Thursday 8AM PST: 
> https://cloudera.zoom.us/j/99928254235?pwd=VTFlQk4vQjQ5Z2JzUDM3ZWZKKy9MQT09
>
> Happy to move or use an alternate method as needed.
>
> On Wed, Apr 14, 2021 at 6:58 PM Daniel Imberman <[email protected]> 
> wrote:
>
>
> Thursday works for me!
>
> On Wed, Apr 14, 2021 at 10:05 AM, Ian Buss <[email protected]> wrote:
>
> Hi all,
>
> I actually can’t do Wednesday next week as I’m moving house :) Any chance we 
> could do Thursday or Friday at the same time?
>
> Cheers
>
> Ian
> On 14 Apr 2021, 17:49 +0100, Kaxil Naik <[email protected]>, wrote:
>
> Just few comments here:
>
> Currently -- atleast for the foreseeable future Airflow workers will need 
> access to the DAG Files, so workers can not run using the Serialized DAGs.
>
> Also serialized DAGs do not even have all the info needed for it to run it. 
> Currently the serialization happens in the parsing process in the scheduler 
> which can be in future separated as a separator "parsining" component, but 
> that won't solve the "isolation" problem you are trying to solve. The only 
> current way it can be solved is pickling -- and we have strictly decided 
> against using pickling for DAGs.
>
> The idea in Statement (2) & (3) would help solve the isolation problem in (1) 
> and can be done with some work now.
>
> Happy to talk about it in more detail here or on call, the time Daniel 
> suggested works for me.
>
> Regards,
> Kaxil
>
> On Wed, Apr 14, 2021 at 5:35 PM Daniel Imberman <[email protected]> 
> wrote:
>
>
> How about Wednesday, April 21 at 8:00AM PST?
>
> On Wed, Apr 14, 2021 at 9:33 AM, Xinbin Huang <[email protected]> wrote:
>
> I am available any days.
>
> On Wed, Apr 14, 2021, 9:32 AM Daniel Imberman <[email protected]> 
> wrote:
>
>
> Hi everyone!
>
> Would people be available around 8AM/9AM PST some point next week? I’m in PST 
> and Ian is UTC+1 so would be great to find a timezone that accomodates 
> everyone.
>
> Daniel
> On Wed, Apr 14, 2021 at 6:26 AM, Ryan Hatter <[email protected]> wrote:
>
> I’d also like to be added please :)
>
> On Apr 13, 2021, at 21:27, Xinbin Huang <[email protected]> wrote:
>
> 
> Hi Daniel & Ian,
>
> I am also interested in the idea of a serialization representation that can 
> be executed by workers directly. Can you also add me to the call?
>
> Thanks
> Bin
>
> On Tue, Apr 13, 2021 at 2:49 PM Ian Buss <[email protected]> wrote:
>
>
> Daniel,
>
> Thanks for your warm welcome and quick response and the advice on providers! 
> Will certainly check out the examples you sent.
>
> 1. An "airflow register" command definitely sounds promising, would love to 
> collaborate on an AIP there so let's set something up.
> 2. We use KubernetesExecutor exclusively as well. We've noticed significant 
> additional load on the metadata DB as we scale up task pods so I've also 
> thought about an API-based approach. Such an API could also open up the 
> possibility of per-task security tokens which are injected by the scheduler, 
> which should improve the security of such a system. Food for thought at 
> least. I will start putting some of these thoughts down on paper in a 
> sharable format.
>
> Ian
>
> On Tue, Apr 13, 2021 at 7:46 PM Daniel Imberman <[email protected]> 
> wrote:
>
>
> Hi Ian,
>
>
> Firstly, welcome to the Airflow community :). I'm glad to hear you've had a 
> positive experience so far. It's great to hear that you want to contribute 
> back, and I think that multi-tenancy/DAG isolation is a pretty fantastic 
> project for the community as a whole (a lot of things are are things we want 
> but are limited by hours in a day).
>
>
> 1. I've personally been kicking around some ideas lately about an "airflow 
> register" command that would write the DAG into the metadata DB in a way that 
> could be "gettable" by the workers via the API. This work is very early. I'd 
> love to get some help on it. Perhaps we can set up a zoom chat to discuss 
> drafting an AIP?
>
>
> 2. Limiting worker access to the DB is not only good security practice; it 
> also opens up the door to a lot of valuable features. This feature would be 
> especially close to my heart as it would make the KubernetesExecutor 
> significantly more efficient. It should be possible to set up a system where 
> the workers only ever speak to an API server and never need to touch the DB.
>
>
> 3. This is not something I personally have insight into, but I think it 
> sounds like a good idea.
>
>
> Finally, addressing your question about a Cloudera provider. If anything, it 
> would probably give the provider _more_ legitimacy if you hosted it under the 
> Cloudera GitHub org (we very purposely created the provider packages with 
> this workflow in mind). There are multiple places where we can work to 
> surface this provider so it is easy to find and use.
>
>
> Astronomer has a pretty good sample provider here. One example of it running 
> in the wild is the Great Expectations provider here. I'd also be glad to get 
> you in contact with people who have built providers in the past to help you 
> with that process.
>
>
> Looking forward to seeing some of these things come to fruition!
>
>
> Daniel
>
>
> On Tue, Apr 13, 2021 at 9:43 AM, Ian Buss <[email protected]> wrote:
>
> Hi all,
>
> First a quick introduction: I'm an engineer with Cloudera working on our Data 
> Engineering product (CDE). Airflow is working great for us so far. We've been 
> looking into how we can enhance the multi-tenancy story of Apache Airflow as 
> we currently deploy it. We have the following areas which we'd like (with 
> community consensus) to work on and contribute back to Apache Airflow to 
> enhance the isolation between tenants in a single Airflow deployment.
>
> 1. Isolating code execution and parsing of DAG files. At the moment, DAG 
> files are parsed in a few locations in Airflow, including the scheduler and 
> in tasks. There is already the concept of DAG serialization (and we're using 
> that for the web component) but we'd be interested to see if we can sandbox 
> the execution of arbitrary user code to a locked down process/container 
> without full access to the metadata DB and connection secrets etc. The idea 
> would be to parse and serialize the DAG in this isolated container and pass 
> back a serialized representation for persistence in the DB. Has anyone 
> explored this idea?
>
> 2. Limiting task access to the metadata DB. It would be great if we could 
> remove the requirement for tasks to have full access to the metadata DB and 
> to report task status in a different (but still scalable) way. We'd need to 
> tackle access or injection of connection, variable and xcom data as well for 
> each task naturally.
>
> 3. Finer-grained access controls on connection secrets. Right now, although 
> there are nice at-rest encryption options with Fernet or Vault, IIUC any DAG 
> can access any connection (and thus any secret). Since the "run as" user is 
> largely defined within the DAG and its tasks, this is challenging for a 
> multi-tenant environment (see caveat below)
>
> Caveat: It's definitely noted that to some extent we should assume that an 
> Airflow deployment is a "trusted" environment and that best practices such as 
> git+PR workflows are the gold standard and that any malicious code and 
> dependencies should be identified through this process. Also that there is a 
> clear admin role for connection management etc.
>
> We have some ideas informally sketched out as to how to address the above but 
> would be keen to hear the community opinion on this and to see if anyone is 
> keen to collaborate on designs and implementation, or to hear if anything is 
> already in the works. In particular I noticed that the very first improvement 
> proposal (AIP-1) addresses much of the above :). However, it seems fairly 
> dormant at the moment.
>
> One other question: we have a provider (operators and hooks) for interacting 
> with Cloudera components that we'd like to contribute to the project. The 
> provider FAQs indicate that new provider contributions are still welcome in 
> the project in 2.x, is that accurate?
>
> Thanks in advance!
>
> Ian

Reply via email to