Re: Vending AWS System Test Results Back to the Community

Jarek Potiuk Fri, 19 Aug 2022 14:32:41 -0700

Hey Niko,

Very good points to discuss. I think this is something equally needed on
AWS/Google but also other "popular" services we have integration with (and
eventually fulfilling the goal of AIP-4) :)

*Context*:

I am a big fan of thinking of CI systems as largely invisible to the
"regular" users. The best CI system (and it's almost impossible goal but
should always be our compass) is a system you are not aware of until
you (or someone) introduced a change that needs an action from someone -
when things get broken (and the breakage might come from multiple sources
in case of System Tests: code change, library upgrade, service changing the
API, required permission change etc. etc. Unlike in other types of tests, a
lot more of those might be independent from PRs merged, this is actually
more likely that some external change might impact System Tests.

Also System Tests - cannot be really run with every PR (they take too long)
so there is a bit different "usage" and "access" needs for the results.
This impacts what can be the trigger for such tests. I think it's far more
likely that we will run the system tests regularly on schedule (once a
day?) and manually by the release manager when we prepare a provider's
release to verify that the providers are still working or when we want to
test if we fix the problem reported by failing "scheduled" run (from a PR
branch).

And I think it's not only "how you access" the results, but how "failure
notifications" are delivered and how the tests are triggered and we should
answer all those questions.

*Audience of the solution:*

This leads to another question - WHO will be interested in seeing
notifications and fixing the problems ?

I think this is the most important question to answer. I really think
regular contributors (even those who contribute to - say - Amazon provider)
will not be interested and will not regularly monitor failure
notifications. Unlike regular "test failure" notifications - they should
not come to the users who made the PR but to those who are interested in
keeping the "provider" green. It will be rather difficult in a number of
cases to engage (automatically) a contributor's attention when such system
tests start to fail. But eventually (and I think this split is quite
obvious) there will be people interested in monitoring the overall health
of a given provider. They don't have to fix it, they can merely investigate
and see that likely this or that caused the problem and "pull in" the PR
contributors. But there is a watchout there - those contributors might not
have or want to use access to run such system tests manually (it might
cost, might need paid accounts, have some risks involved as data is
deleted/recreated etc.). Those contributors should see the logs/results,
should be able to fix the problem but then they should also be able to
trigger system test execution on their PR when they want to check if the
problem is fixed.

*Characteristics of the solution:*

So I think any solution should have those characteristics:

1) Produce notification that tests executed have failed/succeeded - this
should come to a dedicated, separate per-provider place (for example slack
channel - with the "low" frequency of such messages, slack channel seems
like the best idea. Then people who want to monitor a given provider could
simply subscribe to that channel. Seeing regular "All tests passed" and
sometimes "some tests failed" message there is a great indication that a)
tests continue to work in general, b) see when things fail. We need to have
regular schedule and notify about successes as well - to make some kind of
"heartbeat" which would tell those people monitoring the errors that things
are not working when the heartbeat is missing.

2) Seeing logs - the notifications should contain links to logs that could
be browsed by anyone (Read-only). Luckily we have no "secret" information
so it could be a publicly available link. Ideally, it should be a
cloud-based one (CloudWatch for AWS). I think access to logs is absolutely
crucial for anyone trying to investigate and fix the problem. And I think
this is the only thing needed by anyone outside of the group that is
interested in keeping the provider "green" - such individual contributors
might only need to see a "green" log to compare what has changed in this
particular build, maybe, but they are not really interested in seeing the
historical stats.

3) Triggering the tests. This one is tricky. This is something that should
be accessible by everyone contributing a PR, but it should be controlled
somehow. No idea yet how this can be gated/controlled but this is something
we will need to figure out. Who and when and how to trigger such a build
for your own PR? There might be various ways - special comment on the PR +
some conditions (approvals?) that the PR/user should fulfill to be able to
trigger it - this is not really something I have complete proposal for.

4) Dashboard - that one is mostly interesting to release manager and people
who are interested in keeping given provider "green".  It's OK to make it
public, but it does not need to be "beautiful" or anything - it can be a
very "raw" output.

In this context, answering some of your questions Niko:

* I do not think we need automated API. The frequency and nature of
"reasons" for failures are low and I do not see a reason why we would
consume it (but this might come in the future as we learn).
* Public Dashboard is fine, but public access to logs is far more important
IMHO

*Who should run the infrastructure ?*

Looking at the expectations above - I think it would be better to run such
tests by the Amazon team for Amazon, Google team for Google etc. - While
the CloudFormation scripts would be best if published, I think it will be
far more efficient to get it in the hands of those stakeholders who are
mostly interested in getting the "green" tests. That is much more scalable
solution from the community point of view. We are not going to publish it
to our users and is not really needed to be run on our infra. I don't see a
particular need for regular community members to even know how/what
infrastructure is used to run the tests - the test execution is pretty
standardised, and I think we are really interested in output rather than
the infra to run it.

J.

On Fri, Aug 19, 2022 at 2:53 PM Kamil Breguła <[email protected]> wrote:

> I don't think we have to limit ourselves that only the commiters have
> access to the Amazon account managed by Airflow community. In the past,
> commiters was supported by other people whom they trust e.g. commiter asked
> for help from another co-worker from her company when he needed it.
>
> This means that there are no restrictions on Amazon employees using this
> account and maintaining this environment.
>
> We just have to be careful that no-commiters have not write permission to
> the repository, and that they cannot publish a new version of the
> application that can be seen as official released by the Apache Foundation.
>
> On Fri, Aug 19, 2022, 01:30 Oliveira, Niko <[email protected]>
> wrote:
>
>> Hey folks,
>>
>>
>> Those of us on the AWS Airflow team (myself, Dennis F, Vincent B, Seyed
>> H) have been working on a few projects over the past few months:
>>
>> 1. Writing example dags/docs for all existing Operators in the AWS
>> Airflow provider package (done)
>>
>> 2. Writing AWS specific logic in Airflow codebase to support AIP-47 (done)
>>
>> 3. Converting all example dags to AIP-47 compliant system tests (just
>> over halfway done)
>>
>>
>> All of these are ultimately culminating to the goal of us running these
>> system tests at a regular cadence within Amazon (where we have access to
>> funded AWS accounts). We will run these system tests, triggered by updates
>> to airflow:main, at least once a day.
>>
>> I'd like to open a discussion on how we can vend these results back to
>> the community in a way that is most consumable for contributors, release
>> managers and users alike.
>>
>> A quick and easy approach would be to create a publicly viewable
>> CloudWatch Dashboard. With at least the following metrics for each system
>> test over time:  pass/fail, duration, and execution count.
>> This would be a human readable way to consume the current status of AWS
>> Operators.
>>
>> If a more machine readable format is required/preferred (e.g. for scripts
>> related to Airflow release management perhaps) we could also put together a
>> simple API Gateway endpoint that would vend the data in a format such as
>> JSON.
>>
>> Another interesting option would be for us to publish the CloudFormation
>> templates (or the codebase used to generate the templates) for configuring
>> the system test environment and executing the tests. This could be deployed
>> to an AWS account owned and managed by the Airflow community where tests
>> would be run periodically. AWS has provided some credits in the past which
>> could be used to help fund the account. But this introduces a large
>> component that would need ownership and management by folks within the
>> Airflow community who have access to such AWS accounts and credits (likely
>> only committers/release managers?). So it might not be worth the complexity.
>>
>>
>> I'd like to hear what folks think!
>>
>> Cheers,
>> Niko
>>
>>
>>
>>

Re: Vending AWS System Test Results Back to the Community

Reply via email to