Hi Elad, Good point. While this does not solve that challenge directly, fsspec does implement a GitFS. This means that if we extend universal-pathlib, which ObjectStoragePath relies upon, this becomes available right away. GitFS also has an understanding of versioning and branches so that comes in handy for AIP-63.
So making dag parsing and processing independent of the local fs gives us more flexibility towards the future. Bolke On Sun, 26 May 2024 at 12:57, Elad Kalif <[email protected]> wrote: > Thank you Bolke! > Interesting read. > > I have a question about what is the pain we try to solve here. Most use > cases I encountered were about the need to sync dags from a branch in > GitHub (or equivalent) to the Airflow DAG folder. > Correct me if I am wrong but this AIP does not handle this. A sync > component will still be required to sync from git to S3/GCS/Other storage > and this AIP solves only the part that Airflow machines will be able to > fetch the files from storage. Is that correct? > > On Sun, May 26, 2024 at 10:55 AM Bolke de Bruin <[email protected]> wrote: > > > Hi All, > > > > I would like to discuss a new AIP aimed at enhancing the DAG loading > > mechanism to support reading DAGs from ephemeral storage solutions. This > > proposal is intended to supersede AIP-5 Remote DAG Fetcher and provide a > > more flexible and scalable approach and to prepare for AIP-63. > > > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-71+Generalizing+DAG+Loader+and+Processor+for+Ephemeral+Storage > > > > *Abstract* > > This proposal aims to generalize the DAG loader and processor to use > > pathlib.Path for file operations instead of assuming direct OS filesystem > > access. It includes implementing a custom module loader that supports > > loading from ObjectStoragePath locations and other Path-like > abstractions, > > with caching capabilities provided by fsspec. Furthermore, while this AIP > > does not directly implement DAG versioning, it creates a foundational > layer > > that can be extended to support DAG versioning as outlined in AIP-63. > > > > A work in progress PR can be found here: > > https://github.com/apache/airflow/pull/39647 > > > > *Key points for discussion* > > > > Previous proposals, like AIP-5, suggested using a Fetcher mechanism. Kind > > of like an in-process git-sync. This proposal is about making that > > redundant by fully supporting object storage locations by leveraging > > ObjectStoragePath and fsspec caching mechanisms. > > > > Earlier feedback on AIP-5 was that we thought that having an additional > > Fetcher process was out of scope of the project. With the transient > > integration of pathlib.Path and ObjectStoragePath I think this argument > > does not hold anymore and the demand is there. In addition the added > > flexibility allows AIP-63 to be implemented easier (what that looks like > > remains to be seen). > > > > Airflow scans DAGs often. This very likely requires a caching mechanism > for > > both the DAGs and their modules. Fsspec does implement caching, and it is > > planned to leverage this. > > > > Non DAG, Non module assets as part of the DAG folder are out of scope. So > > say for example for some reason you include a GIF. This will not > > automatically be available without changes to your code. > > > > I kindly request your thoughts :-). > > > > Bolke > > > > -- > > > > -- > > Bolke de Bruin > > [email protected] > > > -- -- Bolke de Bruin [email protected]
