Hey all, I want to run Airflow on master in a test environment. My thought was simply to:
1. Generate some test DAGs that do various things (timeouts, SLAs, pools, start/end date, etc) 2. Auto-install master ever day, and restart Airflow to run the latest code. 3. Validate that things are working properly. First, does this sound useful? If so, does my plan of attack sound like the right one? (3) is what I'm running into trouble with right now. I'm really struggling to figure out the best way to monitor for when things go wrong. Some of the things that I want to monitor are: - All DAGs are executed according to their CRON schedule (if set) - Timeouts are timing out as expected - All parallelism limits are honored - SLA misses are being logged appropriately - Validate priority is honored - There are no DAG import errors - The start_date/end_date are honored properly for DAGs What I tried to do this morning was write an operator for each one of these, and have a DAG that would do the validation. Some problems with this approach: 1. It doesn't work so well for DAGs that change over time (e.g. a new task was added, the schedule_interval was changed). As soon as the DAG changes, the prior executions appear to be misbehaving. If you change a schedule_interval from daily to hourly, the past looks like it's missed a bunch of executions. 2. Some of these are fairly annoying to test. For example, validating that parallelism is honored means looking at every start_date/end_date of every task/DAG, and making sure they weren't overlapping in a way that exceeded one of the N parallelism knobs that Airflow has. Other ways that I considered testing this were to: - Use a cluster policy that would attach a checker task at the beginning of a DAG somehow. - Use a script outside of Airflow to do the checking. Something that would snapshot the current state, so that you could diff the state of the DB over time, to get around (1) above. - Use StatsD to monitor for errors. This was problematic because things aren't well instrumented, and a large number of the things that i want to check for are more active monitoring things (wake up and check), not metric-based things. It kind of feels like I'm thinking about this a bit wrong, so I'm looking for thoughtful suggestions. The end goal is that I want to run synthetic DAGs on master, and know when things broke. Cheers, Chris
