Hey Niko, Very good points to discuss. I think this is something equally needed on AWS/Google but also other "popular" services we have integration with (and eventually fulfilling the goal of AIP-4) :)
*Context*: I am a big fan of thinking of CI systems as largely invisible to the "regular" users. The best CI system (and it's almost impossible goal but should always be our compass) is a system you are not aware of until you (or someone) introduced a change that needs an action from someone - when things get broken (and the breakage might come from multiple sources in case of System Tests: code change, library upgrade, service changing the API, required permission change etc. etc. Unlike in other types of tests, a lot more of those might be independent from PRs merged, this is actually more likely that some external change might impact System Tests. Also System Tests - cannot be really run with every PR (they take too long) so there is a bit different "usage" and "access" needs for the results. This impacts what can be the trigger for such tests. I think it's far more likely that we will run the system tests regularly on schedule (once a day?) and manually by the release manager when we prepare a provider's release to verify that the providers are still working or when we want to test if we fix the problem reported by failing "scheduled" run (from a PR branch). And I think it's not only "how you access" the results, but how "failure notifications" are delivered and how the tests are triggered and we should answer all those questions. *Audience of the solution:* This leads to another question - WHO will be interested in seeing notifications and fixing the problems ? I think this is the most important question to answer. I really think regular contributors (even those who contribute to - say - Amazon provider) will not be interested and will not regularly monitor failure notifications. Unlike regular "test failure" notifications - they should not come to the users who made the PR but to those who are interested in keeping the "provider" green. It will be rather difficult in a number of cases to engage (automatically) a contributor's attention when such system tests start to fail. But eventually (and I think this split is quite obvious) there will be people interested in monitoring the overall health of a given provider. They don't have to fix it, they can merely investigate and see that likely this or that caused the problem and "pull in" the PR contributors. But there is a watchout there - those contributors might not have or want to use access to run such system tests manually (it might cost, might need paid accounts, have some risks involved as data is deleted/recreated etc.). Those contributors should see the logs/results, should be able to fix the problem but then they should also be able to trigger system test execution on their PR when they want to check if the problem is fixed. *Characteristics of the solution:* So I think any solution should have those characteristics: 1) Produce notification that tests executed have failed/succeeded - this should come to a dedicated, separate per-provider place (for example slack channel - with the "low" frequency of such messages, slack channel seems like the best idea. Then people who want to monitor a given provider could simply subscribe to that channel. Seeing regular "All tests passed" and sometimes "some tests failed" message there is a great indication that a) tests continue to work in general, b) see when things fail. We need to have regular schedule and notify about successes as well - to make some kind of "heartbeat" which would tell those people monitoring the errors that things are not working when the heartbeat is missing. 2) Seeing logs - the notifications should contain links to logs that could be browsed by anyone (Read-only). Luckily we have no "secret" information so it could be a publicly available link. Ideally, it should be a cloud-based one (CloudWatch for AWS). I think access to logs is absolutely crucial for anyone trying to investigate and fix the problem. And I think this is the only thing needed by anyone outside of the group that is interested in keeping the provider "green" - such individual contributors might only need to see a "green" log to compare what has changed in this particular build, maybe, but they are not really interested in seeing the historical stats. 3) Triggering the tests. This one is tricky. This is something that should be accessible by everyone contributing a PR, but it should be controlled somehow. No idea yet how this can be gated/controlled but this is something we will need to figure out. Who and when and how to trigger such a build for your own PR? There might be various ways - special comment on the PR + some conditions (approvals?) that the PR/user should fulfill to be able to trigger it - this is not really something I have complete proposal for. 4) Dashboard - that one is mostly interesting to release manager and people who are interested in keeping given provider "green". It's OK to make it public, but it does not need to be "beautiful" or anything - it can be a very "raw" output. In this context, answering some of your questions Niko: * I do not think we need automated API. The frequency and nature of "reasons" for failures are low and I do not see a reason why we would consume it (but this might come in the future as we learn). * Public Dashboard is fine, but public access to logs is far more important IMHO *Who should run the infrastructure ?* Looking at the expectations above - I think it would be better to run such tests by the Amazon team for Amazon, Google team for Google etc. - While the CloudFormation scripts would be best if published, I think it will be far more efficient to get it in the hands of those stakeholders who are mostly interested in getting the "green" tests. That is much more scalable solution from the community point of view. We are not going to publish it to our users and is not really needed to be run on our infra. I don't see a particular need for regular community members to even know how/what infrastructure is used to run the tests - the test execution is pretty standardised, and I think we are really interested in output rather than the infra to run it. J. On Fri, Aug 19, 2022 at 2:53 PM Kamil Breguła <dzaku...@gmail.com> wrote: > I don't think we have to limit ourselves that only the commiters have > access to the Amazon account managed by Airflow community. In the past, > commiters was supported by other people whom they trust e.g. commiter asked > for help from another co-worker from her company when he needed it. > > This means that there are no restrictions on Amazon employees using this > account and maintaining this environment. > > We just have to be careful that no-commiters have not write permission to > the repository, and that they cannot publish a new version of the > application that can be seen as official released by the Apache Foundation. > > On Fri, Aug 19, 2022, 01:30 Oliveira, Niko <oniko...@amazon.com.invalid> > wrote: > >> Hey folks, >> >> >> Those of us on the AWS Airflow team (myself, Dennis F, Vincent B, Seyed >> H) have been working on a few projects over the past few months: >> >> 1. Writing example dags/docs for all existing Operators in the AWS >> Airflow provider package (done) >> >> 2. Writing AWS specific logic in Airflow codebase to support AIP-47 (done) >> >> 3. Converting all example dags to AIP-47 compliant system tests (just >> over halfway done) >> >> >> All of these are ultimately culminating to the goal of us running these >> system tests at a regular cadence within Amazon (where we have access to >> funded AWS accounts). We will run these system tests, triggered by updates >> to airflow:main, at least once a day. >> >> I'd like to open a discussion on how we can vend these results back to >> the community in a way that is most consumable for contributors, release >> managers and users alike. >> >> A quick and easy approach would be to create a publicly viewable >> CloudWatch Dashboard. With at least the following metrics for each system >> test over time: pass/fail, duration, and execution count. >> This would be a human readable way to consume the current status of AWS >> Operators. >> >> If a more machine readable format is required/preferred (e.g. for scripts >> related to Airflow release management perhaps) we could also put together a >> simple API Gateway endpoint that would vend the data in a format such as >> JSON. >> >> Another interesting option would be for us to publish the CloudFormation >> templates (or the codebase used to generate the templates) for configuring >> the system test environment and executing the tests. This could be deployed >> to an AWS account owned and managed by the Airflow community where tests >> would be run periodically. AWS has provided some credits in the past which >> could be used to help fund the account. But this introduces a large >> component that would need ownership and management by folks within the >> Airflow community who have access to such AWS accounts and credits (likely >> only committers/release managers?). So it might not be worth the complexity. >> >> >> I'd like to hear what folks think! >> >> Cheers, >> Niko >> >> >> >>