Hi Jarek, I like this two step approach. So I think this is good and not breaking. Reasons
1) Currently the regular re-parsing eats up (in total) a lot of un-needed time, so any kind of (easy) chachin if nothing changed would be good. This constant re-parsing not only eats up a lot of (cumulated) CPU but also puts very high pressure on having fast DAG parsing, because we can not afford parsing for longer to kill the scheduler cycle. DagFileProcessor separates this from the scheduling loop. 2) Our use case: We want to share the DAG code across the instances but want to have certain features turned on/off parameterized. For example in the test instance we done want to publish results but still want to run the sage DAG for testing. Or we want to have the option to tune some batch parameters w/o need to re-deploy code. The only change that I know are Airflow Variables (which I think are great to be able to control general parameters w/o re-deployment during runtime and can be auto-provisioned) or ENV variables (whereas ENV needs to gt into the deployment and are static, can not be changed during runtime. Otherwise I really like the (below) discussion about advanced caching strategies such that there is (an option) from code to tell when a re-parse actually needs to happen. This would allow more options on when a re-parse needs to happen and external factors like excessive imports or lookups are not putting the overall parsing and scheduling inn danger. As well we don't need to put as many rules into the head of people starting with Airflow (whom we currently always need to tell what is wrong when they contribute ther first DAGs with top level code) - which dis-satisfies an entry user potentially. Mit freundlichen Grüßen / Best regards Jens Scheffler Deterministik open Loop (XC-DX/ETV5) Robert Bosch GmbH | Hessbruehlstraße 21 | 70565 Stuttgart-Vaihingen | GERMANY | www.bosch.com Tel. +49 711 811-91508 | Mobil +49 160 90417410 | Threema / Threema Work: KKTVR3F4 | [email protected] Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000; Aufsichtsratsvorsitzender: Prof. Dr. Stefan Asenkerschbaumer; Geschäftsführung: Dr. Stefan Hartung, Dr. Christian Fischer, Filiz Albrecht, Dr. Markus Forschner, Dr. Markus Heyn, Dr. Tanja Rückert -----Original Message----- From: Jarek Potiuk <[email protected]> Sent: Donnerstag, 30. März 2023 21:16 To: [email protected] Subject: Re: [DISCUSS] a cache for Airflow Variables Hello here. I would like to propose something. 1) I really like the idea of conditionally disabling parsing - and I would really love that this idea is hashed out and discussed more (especially if those who came with it have a bit more time than me, I seem to be involved in more things recently :D. I would really like this to happen, but I am afraid it might take more time to discuss some of the consequences, also it would necessarily require our users to adopt and maintain some kind of exclusion mechanism (and we know it will take a lot of time for adoption). 2) On the other hand - we have a very small, very localized and already tested change that can help a number of users. Change that only involves running shared cache memory in DagFileProcessor, that does not require any changes from the users (mayve except configuration parameter change to enable it). It does not really change operational complexity nor bears any risks (especially if it will be disabled by default). Or this is at least how I see the options we have. What I would hate most is that neither of the two happens because nobody will want to spend their time discussing approving and implementing 1) whereas 2) will be blocked because 1) is a better (though only ideated) solution. My proposal will be to get that in for dag file processor only, disabled by default - with some extra documentation explaining the consequences. Does it bear any risks I am not aware of for us if we do so? Is it too much to ask ? J. On Mon, Mar 27, 2023 at 8:26 PM Vandon, Raphael <[email protected]> wrote: > > My initial goal when working on this cache was mostly to shorten DAG parsing > times, simply because that's what I was looking at, so I'd be happy with > restricting this cache to dag parsing. > I'm still relatively new to airflow codebase, so I don't know all the > implications this change has, so I'm grateful for the comments here. > > The benefits can be quite noticeable, depending a lot on the context. If the > dag file is simple, then a network call is going to be slow in comparison. > And if the parsing interval is short compared to the dag execution schedule, > then the number of calls to get secrets is going to be dominated by the dag > parsing rather than the executions. > > The scenario where this brings the most benefits is many simple dag files, > all querying the same key from the Variables, parsed regularly, and ran less > often. > > @Hussein says that "the user can implement [their] own secret backend", but > it's not an easy task. They'd have to implement it as a wrapper around the > custom backend they want to use, since there can only be one custom secret > backend. And implementing an in-memory cache that works cross-process just as > a custom backend is straight up impossible. > > About the secure caching, since I'm only caching in-memory, I didn't do > anything to that regard, but we already have something in place to encrypt > secrets when they are saved in the metastore using cryptography.fernet. > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
