I do something similar to what you describe, where I can replace what's displayed in the "Code" view based on some other content. I was able to cobble it together by using airflow_local_settings.py to patch the function that's invoked when the "Code" view is accessed. It's a total hack, but it works for my use-case (I'm on airflow 2.1.3). YMMV—
I have a python file in the dags folder that parses YAML files from a directory and publishes DAGs to the global scope based on the content of each YAML file (based on your previous messages, you probably have something similar). When creating each DAG, I add a KV pair in default_args for the file path to the underlying YAML. To access the YAML file in the Code view on the webserver, I set up airflow_local_settings.py to monkey patch the airflow.www.views.Airflow.code function with my own version. The patched function sets up a DagBag to allows me to access the full DAG object, where I can then read the YAML file from default_args and then return its content in the output. In my case, the scheduler and webserver run on the same VM so I can pass just the file path between them. If you can't just pass around the file path, I suppose you could try tossing the entire YAML content as a default_arg as well. This approach doesn't change fileloc and doesn't output the YAML file content to the dag_code table. One obvious downside to this approach is that whenever I upgrade airflow I have to double check in staging that my hack still works. But the function I patch is super simple and my change is like 4 extra lines of code, so I don't mind it. Chris On Sat, Aug 21, 2021, at 3:03 PM, Siddharth VP wrote: > Ok. Before I give up on the idea, how so? To me it looks like a new feature > entirely, so not sure what you mean by "changing fileloc like that". > > On Sat, 21 Aug 2021 at 22:28, Ash Berlin-Taylor <[email protected]> wrote: >> FYI: Changing fileloc like that will have un-intended consequences to >> execution in future versions (i.e. 2.2) so we can't do that. >> >> -ash >> >> On Sat, Aug 21 2021 at 09:50:57 +0530, Siddharth VP <[email protected]> >> wrote: >>> Yes per my commit linked above, it would be the yaml file that is shown in >>> the webserver code view (because the fileloc field points to that). >>> >>> I've been doing something similar to what Damian said, but with the >>> difference that I've to generate the YAMLs programmatically based on >>> parameters received at an API endpoint - and implement all the CRUD >>> operations on these yaml config files. During POC testing, I saw that when >>> there were 200+ unpaused DAGs generated this way, the dag processor timeout >>> was being hit. On increasing that timeout to a higher value, I got a >>> message on the UI that the last scheduler heartbeat was 1 minute ago (which >>> is likely because the scheduler was busy with DAG processing for the whole >>> minute). >>> >>> That's what brings me to this proposal. By embedding the task of parsing >>> the yaml/json within Airflow, dynamic dags are supported in a much more >>> "native" way, such that all timeouts and intervals apply individually to >>> the config files. >>> >>> This won't replace dag-factory <https://github.com/ajbosco/dag-factory> and >>> other ecosystem tools (because we still need to have the "builder" python >>> code), rather improve the scalability when using such tools, and avoid >>> compromising on scheduler availability (though I may have had this issue >>> only because I was using 1.10.12). >>> >>> Would love to hear feedback on whether the patch is PR-worthy -- because I >>> think it is quite simple (doesn't require any schema changes for instance) >>> but still addresses a lot of dynamic workflow needs. >>> >>> On Sat, 21 Aug 2021 at 03:29, Jarek Potiuk <[email protected]> wrote: >>>> Agree with Ash here. It's OK to present different view of the "source" of >>>> the DAG once we parsed the Python code. This can be done and it could be >>>> as easy as >>>> >>>> a) adding a field to dag to point to a "definition file" if the DAGs are >>>> produced by parsing files from source folder >>>> b) API call/parameter to submit/fetch the dag (providing we implement some >>>> form of DAG fetcher/DAG submission) >>>> >>>> On Fri, Aug 20, 2021 at 10:49 PM Ash Berlin-Taylor <[email protected]> wrote: >>>>> Changing the code view to show the YAML is now "relatively" easy to >>>>> achieve, at least from the webserver point of view, as since 2.0 it >>>>> doesn't read the files on disk, but from the DB. >>>>> >>>>> There's a lot of details, but changing the way these DagCode rows are >>>>> written could be achievable whilst still keeping the "there must be a >>>>> python file to generate the dag". >>>>> >>>>> -ash >>>>> >>>>> On Fri, Aug 20 2021 at 20:41:55 +0000, "Shaw, Damian P." >>>>> <[email protected]> wrote: >>>>>> I’d personally find this very useful. There’s usually extra information >>>>>> I have about the DAG, and the current “docs_md” is usually not nearly >>>>>> sufficient enough as it’s poorly placed so if I start adding a lot of >>>>>> info it gets in the way of the regular UI. Also last I tested the >>>>>> markdown formatting didn’t work and neither did the other formatter >>>>>> options.____ >>>>>> __ __ >>>>>> But I’m not sure how much other people have demand for this.____ >>>>>> __ __ >>>>>> Thanks,____ >>>>>> Damian ____ >>>>>> __ __ >>>>>> *From:* Collin McNulty <[email protected]> >>>>>> *Sent:* Friday, August 20, 2021 16:36 >>>>>> *To:* [email protected] >>>>>> *Subject:* Re: [DISCUSS] Adding better support for parametrized DAGs and >>>>>> dynamic DAGs using JSON/YAML dataformats____ >>>>>> __ __ >>>>>> On the topic of pointing the code view to yaml, would we alternatively >>>>>> consider adding a view on the UI that would allow arbitrary text >>>>>> content? This could be accomplished by adding an optional parameter to >>>>>> the dag object that allowed you to pass text (or a filepath) that would >>>>>> then go through a renderer (e.g. markdown). It could be a readme, or >>>>>> yaml content or anything the author wanted. ____ >>>>>> __ __ >>>>>> Collin____ >>>>>> __ __ >>>>>> On Fri, Aug 20, 2021 at 3:27 PM Shaw, Damian P. >>>>>> <[email protected]> wrote:____ >>>>>>> FYI this is what I did on one of my past projects for Airflow.____ >>>>>>> ____ >>>>>>> The users wanted to write their DAGs as YAML files so my “DAG file” was >>>>>>> a Python script that read the YAML files and converted them to DAGs. It >>>>>>> was very easy to do and worked because of the flexibility of >>>>>>> Airflow.____ >>>>>>> ____ >>>>>>> The one thing that would have been nice though is if I could of easily >>>>>>> changed the “code view” in Airflow to point to the relevant YAML file >>>>>>> instead of the less useful “DAG file”.____ >>>>>>> ____ >>>>>>> Damian ____ >>>>>>> ____ >>>>>>> *From:* Jarek Potiuk <[email protected]> >>>>>>> *Sent:* Friday, August 20, 2021 16:21 >>>>>>> *To:* [email protected] >>>>>>> *Cc:* [email protected] >>>>>>> *Subject:* Re: [DISCUSS] Adding better support for parametrized DAGs >>>>>>> and dynamic DAGs using JSON/YAML dataformats____ >>>>>>> ____ >>>>>>> Airflow DAGS are Python code.This is a very basic assumption - which is >>>>>>> not likely to change. Ever.____ >>>>>>> ____ >>>>>>> And we are working on making it even more powerful. Writing DAGs in >>>>>>> yaml/json makes them less powerful and less flexible. This is fine if >>>>>>> you want to build on top of airflow and build a more declarative way of >>>>>>> defining dags and use airflow to run it under the hood. ____ >>>>>>> if you think there is a group of users who can benefit from that - >>>>>>> cool. You can publish a code to convert those to Airflow DAGs and >>>>>>> submit it to our Ecosystem page. There are plenty of tlike "CWL - >>>>>>> Common Workflow Language" and others: ____ >>>>>>> https://airflow.apache.org/ecosystem/#tools-integrating-with-airflow____ >>>>>>> ____ >>>>>>> J.____ >>>>>>> ____ >>>>>>> On Fri, Aug 20, 2021 at 2:48 PM Siddharth VP <[email protected]> >>>>>>> wrote:____ >>>>>>>> Have we considered allowing dags in json/yaml formats before? I came >>>>>>>> up with a rather straightforward way to address parametrized and >>>>>>>> dynamic DAGs in Airflow, which I think makes dynamic dags work at >>>>>>>> scale.____ >>>>>>>> ____ >>>>>>>> *Background / Current limitations:*____ >>>>>>>> 1. Dynamic DAG generation using single-file methods >>>>>>>> <https://www.astronomer.io/guides/dynamically-generating-dags#single-file-methods> >>>>>>>> can cause scalability issues >>>>>>>> <https://www.astronomer.io/guides/dynamically-generating-dags#scalability> >>>>>>>> where there are too many active DAGs per file. The >>>>>>>> dag_file_processor_timeout is applied to the loader file, so *all* >>>>>>>> dynamically generated dags need to be processed in that time. Sure the >>>>>>>> timeout could be increased, but that may be undesirable (what if there >>>>>>>> are other static DAGs in the system on which we really want to enforce >>>>>>>> a small timeout?) ____ >>>>>>>> 2. Parametrizing DAGs in Airflow is difficult. There is no good way to >>>>>>>> have multiple workflows that differ only by choices of some constants. >>>>>>>> Using TriggerDagRunOperator to trigger a generic DAG with conf doesn't >>>>>>>> give a native-ish experience as it creates DagRuns of the *triggered* >>>>>>>> dag rather than *this* dag - which also means a single scheduler log >>>>>>>> file.____ >>>>>>>> ____ >>>>>>>> *Suggested approach:*____ >>>>>>>> 1. User writes configuration files in JSON/YAML format. The schema can >>>>>>>> be arbitrary except for one condition that it must have a *builder* >>>>>>>> parameter with the path to a python file.____ >>>>>>>> 2. User writes the "builder" - a python file containing a make_dag >>>>>>>> method that receives the parsed json/yaml and returns a DAG object. >>>>>>>> (Just a sample strategy, we could instead say the file should contain >>>>>>>> a class that extends an abstract DagBuilder class.)____ >>>>>>>> 2. Airflow reads JSON/YAML files as well from the dags directory. It >>>>>>>> parses the file, imports the builder python file, and passes the >>>>>>>> parsed json/yaml to it and collects the generated DAG into the >>>>>>>> DagBag.____ >>>>>>>> ____ >>>>>>>> *Sample implementation:*____ >>>>>>>> See >>>>>>>> https://github.com/siddharthvp/airflow/commit/47bad51fc4999737e9a300b134c04bbdbd04c88a; >>>>>>>> only major code change is in dagbag.py____ >>>>>>>> ____ >>>>>>>> *Result:*____ >>>>>>>> Dag file processor logs show yaml/json file (instead of the builder >>>>>>>> python file). Each dynamically generated dag gets its own scheduler >>>>>>>> log file. ____ >>>>>>>> The configs dag_dir_list_interval, min_file_process_interval, >>>>>>>> file_parsing_sort_mode all directly apply to dag config files.____ >>>>>>>> If the json/yaml fail to parse, it's registered as an import error.____ >>>>>>>> ____ >>>>>>>> Would like to know your thoughts on this. Thanks!____ >>>>>>>> Siddharth VP____ >>>>>>> ____ >>>>>>> ____ >>>>>>> -- ____ >>>>>>> +48 660 796 129____ >>>>>>> __ __ >>>>>>> ============================================================================== >>>>>>> Please access the attached hyperlink for an important electronic >>>>>>> communications disclaimer: >>>>>>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html >>>>>>> ==============================================================================____ >>>>>> >>>>>> ============================================================================== >>>>>> Please access the attached hyperlink for an important electronic >>>>>> communications disclaimer: >>>>>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html >>>>>> ==============================================================================____ >>>> >>>> >>>> -- >>>> +48 660 796 129
