Dev-iL opened a new pull request, #64366: URL: https://github.com/apache/airflow/pull/64366
## Context E2E tests running with 2 parallel workers produce false-positive failures in CI. The test suite runs 88 tests across chromium, firefox, and webkit, and consistently shows 2-4 failures + 2 flaky tests per run -- all caused by server unresponsiveness rather than actual bugs. ### Observed Failure Pattern From CI logs (`2026-03-28` run): - **2 failed**: `requiredAction.spec.ts`, `xcoms.spec.ts` - **2 flaky**: `dag-calendar-tab.spec.ts`, `task-instances.spec.ts` - **83 passed** (16.7m total) - **6 test files already disabled** as too flaky to run ### Failure Analysis All failures trace back to the same infrastructure-level root causes, not test logic bugs: **1. `requiredAction.spec.ts` -- Cascading server overload** ``` RequiredActionsPage.ts:271 - Timeout 120000ms exceeded while waiting on the predicate DagsPage.ts:349 - Timeout 60000ms exceeded (retry #1 and #2) ``` The HITL workflow test triggers a DAG run (heavy operation), then navigates between pages to check task states. With 2 workers, the server becomes overwhelmed during backfill/HITL operations on worker 1, causing all navigations on worker 2 to timeout. Retries also fail because the server hasn't recovered. **2. `xcoms.spec.ts` -- BasePage.navigateTo timeout** ``` BasePage.ts:56 - page.goto: Test timeout of 60000ms exceeded navigating to "http://localhost:8080/xcoms", waiting until "domcontentloaded" ``` Direct `page.goto()` timeout during server overload. The XComs page navigate call in `BasePage.navigateTo` has no server health awareness -- it just calls `page.goto()` and hopes the server responds. **3. `dag-calendar-tab.spec.ts` -- Tooltip interaction failure during slow server** ``` DagCalendarTab.ts:73 - expect(locator).toBeVisible() failed - getByTestId('calendar-tooltip') ``` Calendar cell hover fails because server-side data hasn't loaded in time. **4. `task-instances.spec.ts` -- Table load timeout** ``` TaskInstancesPage.ts:40 - locator.waitFor: Timeout 10000ms exceeded waiting for locator('table, div[role="table"]') to be visible ``` Navigation completes but table never renders because server is still processing requests from other workers. ## Root Cause Analysis Four interconnected root causes identified during investigation: | # | Root Cause | Impact | Affected Tests | |---|-----------|--------|----------------| | 1 | **No server health gating** -- Tests navigate blindly regardless of server state | When server is overwhelmed by heavy tests (HITL, backfill), all concurrent tests fail | ALL failing tests | | 2 | **BasePage.navigateTo lacks resilience** -- Simple `page.goto()` with no retry or health check | Navigation times out within test timeout if server is slow | XComs, TaskInstances, all `navigateTo` consumers | | 3 | **No inter-test recovery** -- No mechanism to wait for server to recover between tests | Heavy test on worker 1 overwhelms server, test on worker 2 fails, retries fail too | Cross-worker failures | | 4 | **Page objects bypass BasePage with fragile `page.goto()` calls** -- 12 direct `page.goto()` calls across 4 page objects bypass any protection in `navigateTo` | Even if `navigateTo` is hardened, these bypass calls remain vulnerable | RequiredActions (7), DagsPage (3), Calendar (1), Events (1) | ## Approach **Infrastructure-only changes to shared page objects and utilities. No spec files modified. No timeouts increased.** The fix is a two-layer health-aware navigation system: ### Layer 1: Health Check Utility (`tests/e2e/utils/health.ts`) New shared utility that polls `/api/v2/monitor/health` (unauthenticated endpoint already used by Breeze for startup checks). Checks HTTP 200 **AND** response time < 2s threshold. Uses exponential backoff intervals `[1s, 2s, 4s, 8s]` with 60s max wait. When the server is healthy (the common case), returns immediately with negligible overhead. ### Layer 2: Health-Aware BasePage Navigation (`BasePage.goto()`) Added a `protected goto()` method to `BasePage` that wraps `waitForServerReady()` + `page.goto()`. All page object subclasses use `this.goto()` instead of `this.page.goto()` directly. The existing `navigateTo()` delegates to `this.goto()`. This ensures every navigation path -- whether through the public `navigateTo` API or custom page object methods -- is health-checked. ### Design Decisions | Decision | Choice | Rationale | |----------|--------|-----------| | Health check location | Navigation-level (not `beforeEach` fixture) | No spec file changes needed. First navigation per test naturally triggers recovery. | | Timeout policy | Strict -- zero timeouts increased | Fix root cause (server unresponsiveness) not symptoms (short timeouts). | | Fix scope | Infrastructure only (page objects + utilities) | Prevents spec file churn. All 88 tests benefit without modification. | | Pattern consolidation | Single `BasePage.goto()` wrapper | Eliminates 12 scattered manual calls. Future page objects inherit protection automatically. | | Configurability | None -- hardcoded sensible defaults | No caller needs custom options. Avoids speculative interface. | ## Changes ### New File - **`tests/e2e/utils/health.ts`** -- `waitForServerReady(page)` function with backoff polling ### Modified Files - **`tests/e2e/pages/BasePage.ts`** -- Added `protected goto()` method; `navigateTo()` delegates to it - **`tests/e2e/pages/DagsPage.ts`** -- 3 direct `page.goto()` calls replaced with `this.goto()` - **`tests/e2e/pages/RequiredActionsPage.ts`** -- 7 direct `page.goto()` calls replaced with `this.goto()` - **`tests/e2e/pages/DagCalendarTab.ts`** -- 1 direct `page.goto()` call replaced with `this.goto()` - **`tests/e2e/pages/EventsPage.ts`** -- 1 direct `page.goto()` call replaced with `this.goto()` *All paths relative to `airflow-core/src/airflow/ui/`* ### Not Modified - **BackfillPage.ts** -- Already uses `navigateTo()` exclusively (only `page.request.*` for API calls) - **LoginPage.ts** -- Runs before health infra; already uses `navigateTo()` - **Zero spec files** -- All changes are in shared infrastructure - **Zero timeout values** -- No existing timeout increased ## Verification ### Automated (all PASS) | Check | Status | |-------|--------| | TypeScript compiles (`tsc --noEmit`) | PASS | | No spec files modified | PASS | | No timeout values increased | PASS | | Code bugs review (opus) | PASS -- 0 bugs found | | Code simplicity review (opus) | PASS -- complexity proportional to problem | | Code maintainability review (opus) | PASS -- praised Template Method pattern | | CLAUDE.md adherence review (opus) | PASS | | Comprehensive `page.goto()` audit | PASS -- all 18 navigation calls across 22 page objects are health-checked | ### Manual (required before merge) - [ ] Run `breeze testing ui-e2e-tests` and compare results against baseline - [ ] Verify health check is a no-op when server is responsive (no latency regression) ## How It Works ``` Test calls this.navigateTo("/dags") -> BasePage.navigateTo() calls this.goto("/dags", { waitUntil: "domcontentloaded" }) -> BasePage.goto() calls waitForServerReady(this.page) -> GET /api/v2/monitor/health -> 200 in < 2s? Return immediately (fast path, ~0ms overhead) -> Slow/failed? Backoff [1s, 2s, 4s, 8s], retry up to 60s -> Still failing after 60s? Throw descriptive error -> this.page.goto(path, options) Test calls this.goto("/dags/my_dag/runs/abc123") (subclass direct call) -> Same flow as above -- health check + goto ``` When the server is healthy (the normal case), `waitForServerReady` completes on the first attempt with negligible overhead. When the server is overloaded (the flaky CI case), the health check waits with backoff until the server recovers, then proceeds with navigation. This eliminates the root cause of cascading failures across parallel workers. <!-- SPDX-License-Identifier: Apache-2.0 https://www.apache.org/licenses/LICENSE-2.0 --> <!-- Thank you for contributing! Please provide above a brief description of the changes made in this pull request. Write a good git commit message following this guide: http://chris.beams.io/posts/git-commit/ Please make sure that your code changes are covered with tests. And in case of new features or big changes remember to adjust the documentation. Feel free to ping (in general) for the review if you do not see reaction for a few days (72 Hours is the minimum reaction time you can expect from volunteers) - we sometimes miss notifications. In case of an existing issue, reference it using one of the following: * closes: #ISSUE * related: #ISSUE --> --- ##### Was generative AI tooling used to co-author this PR? <!-- If generative AI tooling has been used in the process of authoring this PR, please change below checkbox to `[X]` followed by the name of the tool, uncomment the "Generated-by". --> - [x] Yes (please specify the tool below) Generated-by: Claude Opus 4.6 following [the guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions) --- * Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)** for more information. Note: commit author/co-author name and email in commits become permanently public when merged. * For fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals)) is needed. * When adding dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). * For significant user-facing changes create newsfragment: `{pr_number}.significant.rst`, in [airflow-core/newsfragments](https://github.com/apache/airflow/tree/main/airflow-core/newsfragments). You can add this file in a follow-up commit after the PR is created so you know the PR number. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
