Running on master

Chris Riccomini Fri, 10 Jun 2016 13:55:58 -0700

Hey all,

I want to run Airflow on master in a test environment. My thought was
simply to:


1. Generate some test DAGs that do various things (timeouts, SLAs, pools,
start/end date, etc)
2. Auto-install master ever day, and restart Airflow to run the latest code.
3. Validate that things are working properly.

First, does this sound useful? If so, does my plan of attack sound like the
right one?

(3) is what I'm running into trouble with right now. I'm really struggling
to figure out the best way to monitor for when things go wrong. Some of the
things that I want to monitor are:

   - All DAGs are executed according to their CRON schedule (if set)
   - Timeouts are timing out as expected
   - All parallelism limits are honored
   - SLA misses are being logged appropriately
   - Validate priority is honored
   - There are no DAG import errors
   - The start_date/end_date are honored properly for DAGs

What I tried to do this morning was write an operator for each one of
these, and have a DAG that would do the validation. Some problems with this
approach:

   1. It doesn't work so well for DAGs that change over time (e.g. a new
   task was added, the schedule_interval was changed). As soon as the DAG
   changes, the prior executions appear to be misbehaving. If you change a
   schedule_interval from daily to hourly, the past looks like it's missed a
   bunch of executions.
   2. Some of these are fairly annoying to test. For example, validating
   that parallelism is honored means looking at every start_date/end_date of
   every task/DAG, and making sure they weren't overlapping in a way that
   exceeded one of the N parallelism knobs that Airflow has.

Other ways that I considered testing this were to:

   - Use a cluster policy that would attach a checker task at the beginning
   of a DAG somehow.
   - Use a script outside of Airflow to do the checking. Something that
   would snapshot the current state, so that you could diff the state of the
   DB over time, to get around (1) above.
   - Use StatsD to monitor for errors. This was problematic because things
   aren't well instrumented, and a large number of the things that i want to
   check for are more active monitoring things (wake up and check), not
   metric-based things.

It kind of feels like I'm thinking about this a bit wrong, so I'm looking
for thoughtful suggestions. The end goal is that I want to run synthetic
DAGs on master, and know when things broke.

Cheers,
Chris

Running on master

Reply via email to