FYI: Changing fileloc like that will have un-intended consequences to
execution in future versions (i.e. 2.2) so we can't do that.
-ash
On Sat, Aug 21 2021 at 09:50:57 +0530, Siddharth VP
<[email protected]> wrote:
Yes per my commit linked above, it would be the yaml file that is
shown in the webserver code view (because the fileloc field points to
that).
I've been doing something similar to what Damian said, but with the
difference that I've to generate the YAMLs programmatically based on
parameters received at an API endpoint - and implement all the CRUD
operations on these yaml config files. During POC testing, I saw that
when there were 200+ unpaused DAGs generated this way, the dag
processor timeout was being hit. On increasing that timeout to a
higher value, I got a message on the UI that the last scheduler
heartbeat was 1 minute ago (which is likely because the scheduler was
busy with DAG processing for the whole minute).
That's what brings me to this proposal. By embedding the task of
parsing the yaml/json within Airflow, dynamic dags are supported in a
much more "native" way, such that all timeouts and intervals apply
individually to the config files.
This won't replace dag-factory
<https://github.com/ajbosco/dag-factory> and other ecosystem tools
(because we still need to have the "builder" python code), rather
improve the scalability when using such tools, and avoid compromising
on scheduler availability (though I may have had this issue only
because I was using 1.10.12).
Would love to hear feedback on whether the patch is PR-worthy --
because I think it is quite simple (doesn't require any schema
changes for instance) but still addresses a lot of dynamic workflow
needs.
On Sat, 21 Aug 2021 at 03:29, Jarek Potiuk <[email protected]
<mailto:[email protected]>> wrote:
Agree with Ash here. It's OK to present different view of the
"source" of the DAG once we parsed the Python code. This can be done
and it could be as easy as
a) adding a field to dag to point to a "definition file" if the DAGs
are produced by parsing files from source folder
b) API call/parameter to submit/fetch the dag (providing we
implement some form of DAG fetcher/DAG submission)
On Fri, Aug 20, 2021 at 10:49 PM Ash Berlin-Taylor <[email protected]
<mailto:[email protected]>> wrote:
Changing the code view to show the YAML is now "relatively" easy to
achieve, at least from the webserver point of view, as since 2.0 it
doesn't read the files on disk, but from the DB.
There's a lot of details, but changing the way these DagCode rows
are written could be achievable whilst still keeping the "there
must be a python file to generate the dag".
-ash
On Fri, Aug 20 2021 at 20:41:55 +0000, "Shaw, Damian P."
<[email protected]
<mailto:[email protected]>> wrote:
I’d personally find this very useful. There’s usually extra
information I have about the DAG, and the current “docs_md” is
usually not nearly sufficient enough as it’s poorly placed so if
I start adding a lot of info it gets in the way of the regular UI.
Also last I tested the markdown formatting didn’t work and
neither did the other formatter options.____
__ __
But I’m not sure how much other people have demand for this.____
__ __
Thanks,____
Damian____
__ __
*From:*Collin McNulty <[email protected]>
*Sent:* Friday, August 20, 2021 16:36
*To:* [email protected] <mailto:[email protected]>
*Subject:* Re: [DISCUSS] Adding better support for parametrized
DAGs and dynamic DAGs using JSON/YAML dataformats____
__ __
On the topic of pointing the code view to yaml, would we
alternatively consider adding a view on the UI that would allow
arbitrary text content? This could be accomplished by adding an
optional parameter to the dag object that allowed you to pass text
(or a filepath) that would then go through a renderer (e.g.
markdown). It could be a readme, or yaml content or anything the
author wanted. ____
__ __
Collin____
__ __
On Fri, Aug 20, 2021 at 3:27 PM Shaw, Damian P.
<[email protected]
<mailto:[email protected]>> wrote:____
FYI this is what I did on one of my past projects for Airflow.____
____
The users wanted to write their DAGs as YAML files so my “DAG
file” was a Python script that read the YAML files and
converted them to DAGs. It was very easy to do and worked because
of the flexibility of Airflow.____
____
The one thing that would have been nice though is if I could of
easily changed the “code view” in Airflow to point to the
relevant YAML file instead of the less useful “DAG file”.____
____
Damian____
____
*From:*Jarek Potiuk <[email protected] <mailto:[email protected]>>
*Sent:* Friday, August 20, 2021 16:21
*To:* [email protected] <mailto:[email protected]>
*Cc:* [email protected] <mailto:[email protected]>
*Subject:* Re: [DISCUSS] Adding better support for parametrized
DAGs and dynamic DAGs using JSON/YAML dataformats____
____
Airflow DAGS are Python code.This is a very basic assumption -
which is not likely to change. Ever.____
____
And we are working on making it even more powerful. Writing DAGs
in yaml/json makes them less powerful and less flexible. This is
fine if you want to build on top of airflow and build a more
declarative way of defining dags and use airflow to run it under
the hood. ____
if you think there is a group of users who can benefit from that
- cool. You can publish a code to convert those to Airflow DAGs
and submit it to our Ecosystem page. There are plenty of tlike
"CWL - Common Workflow Language" and others: ____
<https://airflow.apache.org/ecosystem/#tools-integrating-with-airflow>____
____
J.____
____
On Fri, Aug 20, 2021 at 2:48 PM Siddharth VP
<[email protected] <mailto:[email protected]>> wrote:____
Have we considered allowing dags in json/yaml formats before? I
came up with a rather straightforward way to address
parametrized and dynamic DAGs in Airflow, which I think makes
dynamic dags work at scale.____
____
*Background / Current limitations:*____
1. Dynamic DAG generation using single-file methods
<https://www.astronomer.io/guides/dynamically-generating-dags#single-file-methods>
can cause scalability issues
<https://www.astronomer.io/guides/dynamically-generating-dags#scalability>
where there are too many active DAGs per file. The
dag_file_processor_timeout is applied to the loader file, so
/all/ dynamically generated dags need to be processed in that
time. Sure the timeout could be increased, but that may be
undesirable (what if there are other static DAGs in the system
on which we really want to enforce a small timeout?) ____
2. Parametrizing DAGs in Airflow is difficult. There is no good
way to have multiple workflows that differ only by choices of
some constants. Using TriggerDagRunOperator to trigger a generic
DAG with conf doesn't give a native-ish experience as it creates
DagRuns of the /triggered/ dag rather than /this/ dag - which
also means a single scheduler log file.____
____
*Suggested approach:*____
1. User writes configuration files in JSON/YAML format. The
schema can be arbitrary except for one condition that it must
have a /builder/ parameter with the path to a python file.____
2. User writes the "builder" - a python file containing a
make_dag method that receives the parsed json/yaml and returns
aDAGobject. (Just a sample strategy, we could instead say the
file should contain a class that extends an abstract DagBuilder
class.)____
2. Airflow reads JSON/YAML files as well from the dags
directory. It parses the file, imports the builder python file,
and passes the parsed json/yaml to it and collects the generated
DAG into the DagBag.____
____
*Sample implementation:*____
See
<https://github.com/siddharthvp/airflow/commit/47bad51fc4999737e9a300b134c04bbdbd04c88a>;
only major code change is in dagbag.py____
____
*Result:*____
Dag file processor logs show yaml/json file (instead of the
builder python file). Each dynamically generated dag gets its
own scheduler log file. ____
The configs dag_dir_list_interval, min_file_process_interval,
file_parsing_sort_mode all directly apply to dag config
files.____
If the json/yaml fail to parse, it's registered as an import
error.____
____
Would like to know your thoughts on this. Thanks!____
Siddharth VP____
____
____
-- ____
+48 660 796 129____
__ __
==============================================================================
Please access the attached hyperlink for an important electronic
communications disclaimer:
<http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html>
==============================================================================____
==============================================================================
Please access the attached hyperlink for an important electronic
communications disclaimer:
<http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html>
==============================================================================____
--
+48 660 796 129