Ok. Before I give up on the idea, how so? To me it looks like a new feature
entirely, so not sure what you mean by "changing fileloc like that".

On Sat, 21 Aug 2021 at 22:28, Ash Berlin-Taylor <[email protected]> wrote:

> FYI: Changing fileloc like that will have un-intended consequences to
> execution in future versions (i.e. 2.2) so we can't do that.
>
> -ash
>
> On Sat, Aug 21 2021 at 09:50:57 +0530, Siddharth VP <[email protected]>
> wrote:
>
> Yes per my commit linked above, it would be the yaml file that is shown in
> the webserver code view (because the fileloc field points to that).
>
> I've been doing something similar to what Damian said, but with the
> difference that I've to generate the YAMLs programmatically based on
> parameters received at an API endpoint - and implement all the CRUD
> operations on these yaml config files. During POC testing, I saw that when
> there were 200+ unpaused DAGs generated this way, the dag processor timeout
> was being hit. On increasing that timeout to a higher value, I got a
> message on the UI that the last scheduler heartbeat was 1 minute ago (which
> is likely because the scheduler was busy with DAG processing for the whole
> minute).
>
> That's what brings me to this proposal. By embedding the task of parsing
> the yaml/json within Airflow, dynamic dags are supported in a much more
> "native" way, such that all timeouts and intervals apply individually to
> the config files.
>
> This won't replace dag-factory <https://github.com/ajbosco/dag-factory> and
> other ecosystem tools (because we still need to have the "builder" python
> code), rather improve the scalability when using such tools, and avoid
> compromising on scheduler availability (though I may have had this issue
> only because I was using 1.10.12).
>
> Would love to hear feedback on whether the patch is PR-worthy -- because I
> think it is quite simple (doesn't require any schema changes for instance)
> but still addresses a lot of dynamic workflow needs.
>
> On Sat, 21 Aug 2021 at 03:29, Jarek Potiuk <[email protected]> wrote:
>
>> Agree with Ash here. It's OK to present different view of the "source" of
>> the DAG once we parsed the Python code. This can be done and it could be as
>> easy as
>>
>> a) adding a field to dag to point to a "definition file" if the DAGs are
>> produced by parsing files from source folder
>> b) API call/parameter to submit/fetch the dag (providing we implement
>> some form of DAG fetcher/DAG submission)
>>
>> On Fri, Aug 20, 2021 at 10:49 PM Ash Berlin-Taylor <[email protected]>
>> wrote:
>>
>>> Changing the code view to show the YAML is now "relatively" easy to
>>> achieve, at least from the webserver point of view, as since 2.0 it doesn't
>>> read the files on disk, but from the DB.
>>>
>>> There's a lot of details, but changing the way these DagCode rows are
>>> written could be achievable whilst still keeping the "there must be a
>>> python file to generate the dag".
>>>
>>> -ash
>>>
>>> On Fri, Aug 20 2021 at 20:41:55 +0000, "Shaw, Damian P." <
>>> [email protected]> wrote:
>>>
>>> I’d personally find this very useful. There’s usually extra information
>>> I have about the DAG, and the current “docs_md” is usually not nearly
>>> sufficient enough as it’s poorly placed so if I start adding a lot of info
>>> it gets in the way of the regular UI. Also last I tested the markdown
>>> formatting didn’t work and neither did the other formatter options.
>>>
>>>
>>>
>>> But I’m not sure how much other people have demand for this.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Damian
>>>
>>>
>>>
>>> *From:* Collin McNulty <[email protected]>
>>> *Sent:* Friday, August 20, 2021 16:36
>>> *To:* [email protected]
>>> *Subject:* Re: [DISCUSS] Adding better support for parametrized DAGs
>>> and dynamic DAGs using JSON/YAML dataformats
>>>
>>>
>>>
>>> On the topic of pointing the code view to yaml, would we alternatively
>>> consider adding a view on the UI that would allow arbitrary text content?
>>> This could be accomplished by adding an optional parameter to the dag
>>> object that allowed you to pass text (or a filepath) that would then go
>>> through a renderer (e.g. markdown). It could be a readme, or yaml content
>>> or anything the author wanted.
>>>
>>>
>>>
>>> Collin
>>>
>>>
>>>
>>> On Fri, Aug 20, 2021 at 3:27 PM Shaw, Damian P. <
>>> [email protected]> wrote:
>>>
>>> FYI this is what I did on one of my past projects for Airflow.
>>>
>>>
>>>
>>> The users wanted to write their DAGs as YAML files so my “DAG file” was
>>> a Python script that read the YAML files and converted them to DAGs. It was
>>> very easy to do and worked because of the flexibility of Airflow.
>>>
>>>
>>>
>>> The one thing that would have been nice though is if I could of easily
>>> changed the “code view” in Airflow to point to the relevant YAML file
>>> instead of the less useful “DAG file”.
>>>
>>>
>>>
>>> Damian
>>>
>>>
>>>
>>> *From:* Jarek Potiuk <[email protected]>
>>> *Sent:* Friday, August 20, 2021 16:21
>>> *To:* [email protected]
>>> *Cc:* [email protected]
>>> *Subject:* Re: [DISCUSS] Adding better support for parametrized DAGs
>>> and dynamic DAGs using JSON/YAML dataformats
>>>
>>>
>>>
>>> Airflow DAGS are Python code.This is a very basic assumption - which is
>>> not likely to change. Ever.
>>>
>>>
>>>
>>> And we are working on making it even more powerful. Writing DAGs in
>>> yaml/json makes them less powerful and less flexible. This is fine if you
>>> want to build on top of airflow and build a more declarative way of
>>> defining dags and use airflow to run it under the hood.
>>>
>>> if you think there is a group of users who can benefit from that - cool.
>>> You can publish a code to convert those to Airflow DAGs and submit it to
>>> our Ecosystem page. There are plenty of tlike "CWL - Common Workflow
>>> Language" and others:
>>>
>>> https://airflow.apache.org/ecosystem/#tools-integrating-with-airflow
>>>
>>>
>>>
>>> J.
>>>
>>>
>>>
>>> On Fri, Aug 20, 2021 at 2:48 PM Siddharth VP <[email protected]>
>>> wrote:
>>>
>>> Have we considered allowing dags in json/yaml formats before? I came up
>>> with a rather straightforward way to address parametrized and dynamic DAGs
>>> in Airflow, which I think makes dynamic dags work at scale.
>>>
>>>
>>>
>>> *Background / Current limitations:*
>>>
>>> 1. Dynamic DAG generation using single-file methods
>>> <https://www.astronomer.io/guides/dynamically-generating-dags#single-file-methods>
>>>  can
>>> cause scalability issues
>>> <https://www.astronomer.io/guides/dynamically-generating-dags#scalability>
>>> where there are too many active DAGs per file. The
>>> dag_file_processor_timeout is applied to the loader file, so *all* 
>>> dynamically
>>> generated dags need to be processed in that time. Sure the timeout could be
>>> increased, but that may be undesirable (what if there are other static DAGs
>>> in the system on which we really want to enforce a small timeout?)
>>>
>>> 2. Parametrizing DAGs in Airflow is difficult. There is no good way to
>>> have multiple workflows that differ only by choices of some constants.
>>> Using TriggerDagRunOperator to trigger a generic DAG with conf doesn't give
>>> a native-ish experience as it creates DagRuns of the *triggered* dag
>>> rather than *this* dag - which also means a single scheduler log file.
>>>
>>>
>>>
>>> *Suggested approach:*
>>>
>>> 1. User writes configuration files in JSON/YAML format. The schema can
>>> be arbitrary except for one condition that it must have a *builder* 
>>> parameter
>>> with the path to a python file.
>>>
>>> 2. User writes the "builder" - a python file containing a make_dag method
>>> that receives the parsed json/yaml and returns a DAG object. (Just a
>>> sample strategy, we could instead say the file should contain a class that
>>> extends an abstract DagBuilder class.)
>>>
>>> 2. Airflow reads JSON/YAML files as well from the dags directory. It
>>> parses the file, imports the builder python file, and passes the parsed
>>> json/yaml to it and collects the generated DAG into the DagBag.
>>>
>>>
>>>
>>> *Sample implementation:*
>>>
>>> See
>>> https://github.com/siddharthvp/airflow/commit/47bad51fc4999737e9a300b134c04bbdbd04c88a;
>>> only major code change is in dagbag.py
>>>
>>>
>>>
>>> *Result:*
>>>
>>> Dag file processor logs show yaml/json file (instead of the builder
>>> python file). Each dynamically generated dag gets its own scheduler log
>>> file.
>>>
>>> The configs dag_dir_list_interval, min_file_process_interval,
>>> file_parsing_sort_mode all directly apply to dag config files.
>>>
>>> If the json/yaml fail to parse, it's registered as an import error.
>>>
>>>
>>>
>>> Would like to know your thoughts on this. Thanks!
>>>
>>> Siddharth VP
>>>
>>>
>>>
>>>
>>> --
>>>
>>> +48 660 796 129
>>>
>>>
>>>
>>>
>>> ==============================================================================
>>> Please access the attached hyperlink for an important electronic
>>> communications disclaimer:
>>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>>>
>>> ==============================================================================
>>>
>>>
>>>
>>> ==============================================================================
>>> Please access the attached hyperlink for an important electronic
>>> communications disclaimer:
>>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>>>
>>> ==============================================================================
>>>
>>>
>>
>> --
>> +48 660 796 129
>>
>

Reply via email to