The only hurdle to overcome with this approach is getting the file into every running container (depending on your infra setup). E.g. if worker 1 picks up the "update config" task and updates a config file locally, it would not be accessible in the scheduler container, or worker 2.
Do you have a network drive mounted into every container so that once the config file is updated it is then immediately available to all containers? Or some other solution? What I have done in this scenario is have the "update config" dag update an airflow variable. Then the dynamic dag reads from that variable to generate the tasks. This avoids the file problem I describe above. It does make a call to the metastore but in practice that does not seem to be a problem. Another thing I have thought about is generate the config file during deployments and bake it into the image but that requires more setup than the variable approach so I did not go that route. Having one "config update" dag for all such processes like this seems like a pretty good way to go. But for me right now I update the config variable within the dag that uses the config. On Mon, Jun 21, 2021 at 12:55 PM Dan Andreescu <[email protected]> wrote: > Hi, this is a question about best practices, as we build our AirFlow > instance and establish coding conventions. > > We have a few jobs that follow this pattern: > > - An external API defines a list of items. Calls to this API are > slow, let's say on the order of minutes. > - For each item in this list, we want to launch a sequence of tasks. > > So far reading and playing with AirFlow, we figure this might be a good > approach: > > 1. A separate "Generator" DAG calls the API and generates a config > file with the list of items. > 2. The "Actual" DAG parses at DAG parsing time, reads the config file > and generates a dynamic DAG accordingly. > > Are there other preferred ways to do this kind of thing? Thanks in > advance! > > > Dan Andreescu > Wikimedia Foundation >
