I do something similar to what you describe, where I can replace what's 
displayed in the "Code" view based on some other content. I was able to cobble 
it together by using airflow_local_settings.py to patch the function that's 
invoked when the "Code" view is accessed. It's a total hack, but it works for 
my use-case (I'm on airflow 2.1.3). YMMV—

I have a python file in the dags folder that parses YAML files from a directory 
and publishes DAGs to the global scope based on the content of each YAML file 
(based on your previous messages, you probably have something similar). When 
creating each DAG, I add a KV pair in default_args for the file path to the 
underlying YAML. To access the YAML file in the Code view on the webserver, I 
set up airflow_local_settings.py to monkey patch the 
airflow.www.views.Airflow.code function with my own version. The patched 
function sets up a DagBag to allows me to access the full DAG object, where I 
can then read the YAML file from default_args and then return its content in 
the output. In my case, the scheduler and webserver run on the same VM so I can 
pass just the file path between them. If you can't just pass around the file 
path, I suppose you could try tossing the entire YAML content as a default_arg 
as well.

This approach doesn't change fileloc and doesn't output the YAML file content 
to the dag_code table.

One obvious downside to this approach is that whenever I upgrade airflow I have 
to double check in staging that my hack still works. But the function I patch 
is super simple and my change is like 4 extra lines of code, so I don't mind it.

Chris

On Sat, Aug 21, 2021, at 3:03 PM, Siddharth VP wrote:
> Ok. Before I give up on the idea, how so? To me it looks like a new feature 
> entirely, so not sure what you mean by "changing fileloc like that".
> 
> On Sat, 21 Aug 2021 at 22:28, Ash Berlin-Taylor <[email protected]> wrote:
>> FYI: Changing fileloc like that will have un-intended consequences to 
>> execution in future versions (i.e. 2.2) so we can't do that.
>> 
>> -ash
>> 
>> On Sat, Aug 21 2021 at 09:50:57 +0530, Siddharth VP <[email protected]> 
>> wrote:
>>> Yes per my commit linked above, it would be the yaml file that is shown in 
>>> the webserver code view (because the fileloc field points to that). 
>>> 
>>> I've been doing something similar to what Damian said, but with the 
>>> difference that I've to generate the YAMLs programmatically based on 
>>> parameters received at an API endpoint - and implement all the CRUD 
>>> operations on these yaml config files. During POC testing, I saw that when 
>>> there were 200+ unpaused DAGs generated this way, the dag processor timeout 
>>> was being hit. On increasing that timeout to a higher value, I got a 
>>> message on the UI that the last scheduler heartbeat was 1 minute ago (which 
>>> is likely because the scheduler was busy with DAG processing for the whole 
>>> minute). 
>>> 
>>> That's what brings me to this proposal. By embedding the task of parsing 
>>> the yaml/json within Airflow, dynamic dags are supported in a much more 
>>> "native" way, such that all timeouts and intervals apply individually to 
>>> the config files. 
>>> 
>>> This won't replace dag-factory <https://github.com/ajbosco/dag-factory> and 
>>> other ecosystem tools (because we still need to have the "builder" python 
>>> code), rather improve the scalability when using such tools, and avoid 
>>> compromising on scheduler availability (though I may have had this issue 
>>> only because I was using 1.10.12). 
>>> 
>>> Would love to hear feedback on whether the patch is PR-worthy -- because I 
>>> think it is quite simple (doesn't require any schema changes for instance) 
>>> but still addresses a lot of dynamic workflow needs.
>>> 
>>> On Sat, 21 Aug 2021 at 03:29, Jarek Potiuk <[email protected]> wrote:
>>>> Agree with Ash here. It's OK to present different view of the "source" of 
>>>> the DAG once we parsed the Python code. This can be done and it could be 
>>>> as easy as 
>>>> 
>>>> a) adding a field to dag to point to a "definition file" if the DAGs are 
>>>> produced by parsing files from source folder
>>>> b) API call/parameter to submit/fetch the dag (providing we implement some 
>>>> form of DAG fetcher/DAG submission) 
>>>> 
>>>> On Fri, Aug 20, 2021 at 10:49 PM Ash Berlin-Taylor <[email protected]> wrote:
>>>>> Changing the code view to show the YAML is now "relatively" easy to 
>>>>> achieve, at least from the webserver point of view, as since 2.0 it 
>>>>> doesn't read the files on disk, but from the DB.
>>>>> 
>>>>> There's a lot of details, but changing the way these DagCode rows are 
>>>>> written could be achievable whilst still keeping the "there must be a 
>>>>> python file to generate the dag".
>>>>> 
>>>>> -ash
>>>>> 
>>>>> On Fri, Aug 20 2021 at 20:41:55 +0000, "Shaw, Damian P." 
>>>>> <[email protected]> wrote:
>>>>>> I’d personally find this very useful. There’s usually extra information 
>>>>>> I have about the DAG, and the current “docs_md” is usually not nearly 
>>>>>> sufficient enough as it’s poorly placed so if I start adding a lot of 
>>>>>> info it gets in the way of the regular UI. Also last I tested the 
>>>>>> markdown formatting didn’t work and neither did the other formatter 
>>>>>> options.____
>>>>>> __ __
>>>>>> But I’m not sure how much other people have demand for this.____
>>>>>> __ __
>>>>>> Thanks,____
>>>>>> Damian ____
>>>>>> __ __
>>>>>> *From:* Collin McNulty <[email protected]> 
>>>>>> *Sent:* Friday, August 20, 2021 16:36
>>>>>> *To:* [email protected]
>>>>>> *Subject:* Re: [DISCUSS] Adding better support for parametrized DAGs and 
>>>>>> dynamic DAGs using JSON/YAML dataformats____
>>>>>> __ __
>>>>>> On the topic of pointing the code view to yaml, would we alternatively 
>>>>>> consider adding a view on the UI that would allow arbitrary text 
>>>>>> content? This could be accomplished by adding an optional parameter to 
>>>>>> the dag object that allowed you to pass text (or a filepath) that would 
>>>>>> then go through a renderer (e.g. markdown). It could be a readme, or 
>>>>>> yaml content or anything the author wanted. ____
>>>>>> __ __
>>>>>> Collin____
>>>>>> __ __
>>>>>> On Fri, Aug 20, 2021 at 3:27 PM Shaw, Damian P. 
>>>>>> <[email protected]> wrote:____
>>>>>>> FYI this is what I did on one of my past projects for Airflow.____
>>>>>>>  ____
>>>>>>> The users wanted to write their DAGs as YAML files so my “DAG file” was 
>>>>>>> a Python script that read the YAML files and converted them to DAGs. It 
>>>>>>> was very easy to do and worked because of the flexibility of 
>>>>>>> Airflow.____
>>>>>>>  ____
>>>>>>> The one thing that would have been nice though is if I could of easily 
>>>>>>> changed the “code view” in Airflow to point to the relevant YAML file 
>>>>>>> instead of the less useful “DAG file”.____
>>>>>>>  ____
>>>>>>> Damian ____
>>>>>>>  ____
>>>>>>> *From:* Jarek Potiuk <[email protected]> 
>>>>>>> *Sent:* Friday, August 20, 2021 16:21
>>>>>>> *To:* [email protected]
>>>>>>> *Cc:* [email protected]
>>>>>>> *Subject:* Re: [DISCUSS] Adding better support for parametrized DAGs 
>>>>>>> and dynamic DAGs using JSON/YAML dataformats____
>>>>>>>  ____
>>>>>>> Airflow DAGS are Python code.This is a very basic assumption - which is 
>>>>>>> not likely to change. Ever.____
>>>>>>>  ____
>>>>>>> And we are working on making it even more powerful. Writing DAGs in 
>>>>>>> yaml/json makes them less powerful and less flexible. This is fine if 
>>>>>>> you want to build on top of airflow and build a more declarative way of 
>>>>>>> defining dags and use airflow to run it under the hood. ____
>>>>>>> if you think there is a group of users who can benefit from that - 
>>>>>>> cool. You can publish a code to convert those to Airflow DAGs and 
>>>>>>> submit it to our Ecosystem page. There are plenty of tlike "CWL - 
>>>>>>> Common Workflow Language" and others: ____
>>>>>>> https://airflow.apache.org/ecosystem/#tools-integrating-with-airflow____
>>>>>>>  ____
>>>>>>> J.____
>>>>>>>  ____
>>>>>>> On Fri, Aug 20, 2021 at 2:48 PM Siddharth VP <[email protected]> 
>>>>>>> wrote:____
>>>>>>>> Have we considered allowing dags in json/yaml formats before? I came 
>>>>>>>> up with a rather straightforward way to address parametrized and 
>>>>>>>> dynamic DAGs in Airflow, which I think makes dynamic dags work at 
>>>>>>>> scale.____
>>>>>>>>  ____
>>>>>>>> *Background / Current limitations:*____
>>>>>>>> 1. Dynamic DAG generation using single-file methods 
>>>>>>>> <https://www.astronomer.io/guides/dynamically-generating-dags#single-file-methods>
>>>>>>>>  can cause scalability issues 
>>>>>>>> <https://www.astronomer.io/guides/dynamically-generating-dags#scalability>
>>>>>>>>  where there are too many active DAGs per file. The 
>>>>>>>> dag_file_processor_timeout is applied to the loader file, so *all* 
>>>>>>>> dynamically generated dags need to be processed in that time. Sure the 
>>>>>>>> timeout could be increased, but that may be undesirable (what if there 
>>>>>>>> are other static DAGs in the system on which we really want to enforce 
>>>>>>>> a small timeout?) ____
>>>>>>>> 2. Parametrizing DAGs in Airflow is difficult. There is no good way to 
>>>>>>>> have multiple workflows that differ only by choices of some constants. 
>>>>>>>> Using TriggerDagRunOperator to trigger a generic DAG with conf doesn't 
>>>>>>>> give a native-ish experience as it creates DagRuns of the *triggered* 
>>>>>>>> dag rather than *this* dag - which also means a single scheduler log 
>>>>>>>> file.____
>>>>>>>>  ____
>>>>>>>> *Suggested approach:*____
>>>>>>>> 1. User writes configuration files in JSON/YAML format. The schema can 
>>>>>>>> be arbitrary except for one condition that it must have a *builder* 
>>>>>>>> parameter with the path to a python file.____
>>>>>>>> 2. User writes the "builder" - a python file containing a make_dag 
>>>>>>>> method that receives the parsed json/yaml and returns a DAG object. 
>>>>>>>> (Just a sample strategy, we could instead say the file should contain 
>>>>>>>> a class that extends an abstract DagBuilder class.)____
>>>>>>>> 2. Airflow reads JSON/YAML files as well from the dags directory. It 
>>>>>>>> parses the file, imports the builder python file, and passes the 
>>>>>>>> parsed json/yaml to it and collects the generated DAG into the 
>>>>>>>> DagBag.____
>>>>>>>>  ____
>>>>>>>> *Sample implementation:*____
>>>>>>>> See 
>>>>>>>> https://github.com/siddharthvp/airflow/commit/47bad51fc4999737e9a300b134c04bbdbd04c88a;
>>>>>>>>  only major code change is in dagbag.py____
>>>>>>>>  ____
>>>>>>>> *Result:*____
>>>>>>>> Dag file processor logs show yaml/json file (instead of the builder 
>>>>>>>> python file). Each dynamically generated dag gets its own scheduler 
>>>>>>>> log file. ____
>>>>>>>> The configs dag_dir_list_interval, min_file_process_interval, 
>>>>>>>> file_parsing_sort_mode all directly apply to dag config files.____
>>>>>>>> If the json/yaml fail to parse, it's registered as an import error.____
>>>>>>>>  ____
>>>>>>>> Would like to know your thoughts on this. Thanks!____
>>>>>>>> Siddharth VP____
>>>>>>> ____
>>>>>>>  ____
>>>>>>> -- ____
>>>>>>> +48 660 796 129____
>>>>>>> __ __
>>>>>>> ==============================================================================
>>>>>>> Please access the attached hyperlink for an important electronic 
>>>>>>> communications disclaimer:
>>>>>>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>>>>>>> ==============================================================================____
>>>>>> 
>>>>>> ==============================================================================
>>>>>> Please access the attached hyperlink for an important electronic 
>>>>>> communications disclaimer:
>>>>>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>>>>>> ==============================================================================____
>>>> 
>>>> 
>>>> -- 
>>>> +48 660 796 129

Reply via email to