Re: [DISCUSSION] AIP-47 New design of Airflow System Tests

Mateusz Nojek Mon, 07 Feb 2022 02:45:50 -0800

Thanks, Jarek, for your email. I think this completes the reasoning behind
this AIP.
However, I recently discovered the flaw in the design that was omitted.
The problem is related to the state of the test (DAG) and the trigger rules.


When the teardown task is used (e.g. to clean the resources after the test)
it has a trigger rule set to "all_done", which means it will execute every
time, no matter the stats of the other tasks. This task influences the test
state (DAG Run) because if any of the tasks fail, the teardown will be
executed as the last task and its state will be propagated to the DAG Run
state (which almost always will be "success") unexpectedly "shadowing" the
failing tasks.

The cause of this situation is the fact that the DAG Run status is
determined by the "leaf nodes" (tasks that do not have any children) and
because teardown task is a leaf node, its status can be propagated to the
whole DAG. I was thinking a lot about how we can overcome this and
discussed it extensively with Jarek Potiuk and the only possible solution
that solves this problem and doesn't require any changes to the Airflow is
by using another task (which I called `watcher` task) that will watch for
the status of the other tasks and if any of them fail, the watcher will
also fail and propagate its status to the DAG Run.

For that to work I needed to create a simple PythonOperator task with
trigger rule set to "one_failed" and an exception that will be raised when
this task is executed, then I attached this task to all other tasks as a
downstream task. Thanks to this, if any of the tasks in the DAG will fail,
this watcher task will be triggered and fail and because it's a leaf node,
it will also fail the whole test (DAG).

At first, it may look like over-complication, but it's actually a clever
solution.
It uses only built-in Airflow features and does not require any changes or
new implementation. It also creates this "separation of concern", when one
task (teardown) is responsible for cleanup, while the other one (watcher)
is responsible for monitoring the tasks. By using this watcher task, we
also have a control over which tasks are "critical" for the test and need
to be watched.

This whole solution would not be needed if we will not have a teardown task
with `trigger_rule="all_done"`, but we don't want to drop this feature
since not removing resources created for the execution of the test may lead
to significant loss of credits (due to cloud computations that cost money).
Also, keeping all the temporary structures may interfere and cause
collisions between the executions of system tests.

It's easier to grab the concept with an example. I just prepared the PR
that includes an example DAG that shows this case. Since I already dig very
deeply into how trigger rules work and how states are assigned to DAG Runs
and tasks, I decided to also extend the documentation. Please take a look
at PR#21382 https://github.com/apache/airflow/pull/21382 and tell me what
you think about it. Use PR comment section to talk about the changes
proposed there but if your comment is related to the AIP-47, please
continue by responding to this email.

Kind regards,
Mateusz Nojek


niedz., 6 lut 2022 o 17:52 Jarek Potiuk <ja...@potiuk.com> napisał(a):

> I think Mateusz explained the points very well. I have just a few comments
> to some of the points.
>
> > 3. In general the AIP reads as if it's solved this problem, but it's
> more like it has absolved itself from solving this problem, which is much
> different. I think this approach could possibly make things even worse as
> now there is no contract or interface for how to plumb configuration and
> credentials to the system test dags. The current set of methods and files
> to plumb credentials through aren't great (and as of now are quite Google
> specific) but I think this interface can be simplified and improved rather
> than just exported wholesale for each provider to re-invent a new approach.
>
> We've discussed it extensively with Mateusz (I was also of the opinion
> that we could do some automation here). For example wec could write a
> "terraform" script that creates the whole environment - set up all the
> service accounts etc. But Mateusz convinced me it will be very hard to
> "mandate" a common way of doing it for multiple "services" or "groups" of
> services. My proposal is that we should be clear in the AIP/framework that
> we don't solve it in a "common way". But instead we keep a
> "service-specific" way of preparing the environment. We might automate it -
> in a service-specific way, but having it as part of the system tests is I
> think out-of-scope. In a way we currently have it already with our
> "regular" tests. To build the AMI to run our self-hosted runners, we have a
> separate repo: https://github.com/apache/airflow-ci-infra/ - where we
> have "packer" scripts which build our image. We even tried Terraform there,
> but well - packer is "good enough". And we can have separate
> "airflow-system-tests-aws" repo and "airlfow-system-tests-gcp" repo, where
> we will - separately document and possibly automate how to build such a
> "runner"
>
> > 4.  A system that relies on good intentions like "be sure to remember to
> do X otherwise bad thing Y will happen" certainly guaranties that bad thing
> Y will happen, frequently. Humans are fallible and not great at sticking to
> a contract or interface that isn't codified. And this AIP is littered with
> statements like this. We need test infrastructure that's easier to use and
> will also enforce these best practices/requirements which are needed for
> the tests to run.
>
> Here - I wholeheartedly agree with Mateusz. This is GREAT simplification
> to have one example file doing everything. The previous approach we had was
> extremely complex - you had scripts, pytest tests, example dags and they
> were providing (meta)data to each other and it was not only hard to reason
> about them but also hard to verify that they are "ok". The idea of just
> making it one file is great. And the amount of "be sure" is not only small
> but it actually can be very easily enforced by pre-commits. We could make
> sure that our "example_dags" in a certain provider contain the (few) lines
> that need to be copied among them - having a common "header" and "footer"
> on example dag is super-simple pre-commit to write. We also discussed some
> other approaches and I think it is really powerful that the scripts can be
> run as "pytest" tests and as "standalone" scripts with SequentialDebugger.
> The level of control it gives for manual runs and the level of automation
> it provides by tapping into some of the great pytest features
> (parallel runs, status, flakiness, timeouts, and plethora of other plugins)
> - all of that makes it great to run multiple tests in the CI environment.
> This way it is very easy to run the system test locally even without
> employing pytest - when you need to run it standalone and run Pytest in CI
> or when you want to run multiple tests.
>
>

Re: [DISCUSSION] AIP-47 New design of Airflow System Tests

Reply via email to