Re: Best Practice: dynamic dags with external dependencies

Jarek Potiuk Mon, 21 Jun 2021 13:25:02 -0700

I think this is a great approach in general. You could use files
(stored in the same shared volume as DAGs) for that.

However I'd also point out one more extension (or different angle) of
that kind of approach.

Some of our users (also my team had the same experiences) learned that
it is actually easier to generate not the config files but ...... the
resulting DAGs directly. It's surprisingly easy to generate a nice
looking and correct python code (for example using Jinja templates)
and sometimes (depending on your case) it might be easier to generate
directly the python code of the DAGS, rather than config files that
will be read by the pre-defined DAGs. And you can even add parsing and
validation of generated code as part of your automated CI pipeline.

As counter-intuitive as it feels initially, it has very nice
properties - the logic of the DAG can be more "diverse" (you can for
example handle different cases by different templates and choose them
on-the-flight), the resulting DAG code might be cleaner as it does not
have to handle all the paths, it can be formatted by "black"
automatically (for example), you can generate variable number of DAGs
files this way etc. etc. You do not have to synchronise DAG code and
DAG config over time (there is JUST DAG code eventually). Adding
configuration to DAG is actually half-way to make your workflows
"declarative" (you write imperative code but somehow you need to make
it follows the "declarative" config). Airflow's0 premise is more
"imperative" in nature and generating the code provides a "shortcut"
to the power of it.

Just a thought that you might consider.

J.

On Mon, Jun 21, 2021 at 10:23 PM Daniel Standish <[email protected]> wrote:
>
> The only hurdle to overcome with this approach is getting the file into every 
> running container (depending on your infra setup).  E.g. if worker 1 picks up 
> the "update config" task and updates a config file locally, it would not be 
> accessible in the scheduler container, or worker 2.
>
> Do you have a network drive mounted into every container so that once the 
> config file is updated it is then immediately available to all containers?  
> Or some other solution?
>
> What I have done in this scenario is have the "update config" dag update an 
> airflow variable.  Then the dynamic dag reads from that variable to generate 
> the tasks.  This avoids the file problem I describe above.  It does make a 
> call to the metastore but in practice that does not seem to be a problem.
>
> Another thing I have thought about is generate the config file during 
> deployments and bake it into the image but that requires more setup than the 
> variable approach so I did not go that route.
>
> Having one "config update" dag for all such processes like this seems like a 
> pretty good way to go. But for me right now I update the config variable 
> within the dag that uses the config.
>
> On Mon, Jun 21, 2021 at 12:55 PM Dan Andreescu <[email protected]> 
> wrote:
>>
>> Hi, this is a question about best practices, as we build our AirFlow 
>> instance and establish coding conventions.
>>
>> We have a few jobs that follow this pattern:
>>
>> An external API defines a list of items.  Calls to this API are slow, let's 
>> say on the order of minutes.
>> For each item in this list, we want to launch a sequence of tasks.
>>
>> So far reading and playing with AirFlow, we figure this might be a good 
>> approach:
>>
>> A separate "Generator" DAG calls the API and generates a config file with 
>> the list of items.
>> The "Actual" DAG parses at DAG parsing time, reads the config file and 
>> generates a dynamic DAG accordingly.
>>
>> Are there other preferred ways to do this kind of thing?  Thanks in advance!
>>
>>
>> Dan Andreescu
>> Wikimedia Foundation

-- 
+48 660 796 129

Re: Best Practice: dynamic dags with external dependencies

Reply via email to