RE: [DISCUSS] a cache for Airflow Variables

Scheffler Jens (XC-DX/ETV5) Thu, 30 Mar 2023 22:06:56 -0700

Hi Jarek,

I like this two step approach. So I think this is good and not breaking. Reasons

1) Currently the regular re-parsing eats up (in total) a lot of un-needed time, 
so any kind of (easy) chachin if nothing changed would be good. This constant 
re-parsing not only eats up a lot of (cumulated) CPU but also puts very high 
pressure on having fast DAG parsing, because we can not afford parsing for 
longer to kill the scheduler cycle. DagFileProcessor separates this from the 
scheduling loop.

2) Our use case: We want to share the DAG code across the instances but want to 
have certain features turned on/off parameterized. For example in the test 
instance we done want to publish results but still want to run the sage DAG for 
testing. Or we want to have the option to tune some batch parameters w/o need 
to re-deploy code. The only change that I know are Airflow Variables (which I 
think are great to be able to control general parameters w/o re-deployment 
during runtime and can be auto-provisioned) or ENV variables (whereas ENV needs 
to gt into the deployment and are static, can not be changed during runtime.

Otherwise I really like the (below) discussion about advanced caching 
strategies such that there is (an option) from code to tell when a re-parse 
actually needs to happen. This would allow more options on when a re-parse 
needs to happen and external factors like excessive imports or lookups are not 
putting the overall parsing and scheduling inn danger. 
As well we don't need to put as many rules into the head of people starting 
with Airflow (whom we currently always need to tell what is wrong when they 
contribute ther first DAGs with top level code) - which dis-satisfies an entry 
user potentially.

Mit freundlichen Grüßen / Best regards

Jens Scheffler

Deterministik open Loop (XC-DX/ETV5)
Robert Bosch GmbH | Hessbruehlstraße 21 | 70565 Stuttgart-Vaihingen | GERMANY | 
www.bosch.com
Tel. +49 711 811-91508 | Mobil +49 160 90417410 | Threema / Threema Work: 
KKTVR3F4 | [email protected]

Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
Aufsichtsratsvorsitzender: Prof. Dr. Stefan Asenkerschbaumer; Geschäftsführung: 
Dr. Stefan Hartung, 
Dr. Christian Fischer, Filiz Albrecht, Dr. Markus Forschner, Dr. Markus Heyn, 
Dr. Tanja Rückert

-----Original Message-----
From: Jarek Potiuk <[email protected]> 
Sent: Donnerstag, 30. März 2023 21:16
To: [email protected]
Subject: Re: [DISCUSS] a cache for Airflow Variables

Hello here.

I would like to propose something.

1) I really like the idea of conditionally disabling parsing - and I would 
really love that this idea is hashed out and discussed more (especially if 
those who came with it have a bit more time than me, I seem to be involved in 
more things recently :D. I would really like this to happen, but I am afraid it 
might take more time to discuss some of the consequences, also it would 
necessarily require our users to adopt and maintain some kind of exclusion 
mechanism (and we know it will take a lot of time for adoption).

2) On the other hand - we have a very small, very localized and already tested 
change that can help a number of users. Change that only involves running 
shared cache memory in DagFileProcessor, that does not require any changes from 
the users (mayve except configuration parameter change to enable it). It does 
not really change operational complexity nor bears any risks (especially if it 
will be disabled by default).

Or this is at least how I see the options we have. What I would hate most is 
that neither of the two happens because nobody will want to spend their time 
discussing approving and implementing 1) whereas 2) will be blocked because 1) 
is a better (though only ideated) solution.

My proposal will be to get that in for dag file processor only, disabled by 
default - with some extra documentation explaining the consequences.

Does it bear any risks I am not aware of for us if we do so? Is it too much to 
ask ?

J.

On Mon, Mar 27, 2023 at 8:26 PM Vandon, Raphael <[email protected]> 
wrote:
>
> My initial goal when working on this cache was mostly to shorten DAG parsing 
> times, simply because that's what I was looking at, so I'd be happy with 
> restricting this cache to dag parsing.
> I'm still relatively new to airflow codebase, so I don't know all the 
> implications this change has, so I'm grateful for the comments here.
>
> The benefits can be quite noticeable, depending a lot on the context. If the 
> dag file is simple, then a network call is going to be slow in comparison.
> And if the parsing interval is short compared to the dag execution schedule, 
> then the number of calls to get secrets is going to be dominated by the dag 
> parsing rather than the executions.
>
> The scenario where this brings the most benefits is many simple dag files, 
> all querying the same key from the Variables, parsed regularly, and ran less 
> often.
>
> @Hussein says that "the user can implement [their] own secret backend", but 
> it's not an easy task. They'd have to implement it as a wrapper around the 
> custom backend they want to use, since there can only be one custom secret 
> backend. And implementing an in-memory cache that works cross-process just as 
> a custom backend is straight up impossible.
>
> About the secure caching, since I'm only caching in-memory, I didn't do 
> anything to that regard, but we already have something in place to encrypt 
> secrets when they are saved in the metastore using cryptography.fernet.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: [DISCUSS] a cache for Airflow Variables

Reply via email to