Dev-iL opened a new pull request, #64366:
URL: https://github.com/apache/airflow/pull/64366

   ## Context
   
   E2E tests running with 2 parallel workers produce false-positive failures in 
CI. The test suite runs 88 tests across chromium, firefox, and webkit, and 
consistently shows 2-4 failures + 2 flaky tests per run -- all caused by server 
unresponsiveness rather than actual bugs.
   
   ### Observed Failure Pattern
   
   From CI logs (`2026-03-28` run):
   - **2 failed**: `requiredAction.spec.ts`, `xcoms.spec.ts`
   - **2 flaky**: `dag-calendar-tab.spec.ts`, `task-instances.spec.ts`
   - **83 passed** (16.7m total)
   - **6 test files already disabled** as too flaky to run
   
   ### Failure Analysis
   
   All failures trace back to the same infrastructure-level root causes, not 
test logic bugs:
   
   **1. `requiredAction.spec.ts` -- Cascading server overload**
   ```
   RequiredActionsPage.ts:271 - Timeout 120000ms exceeded while waiting on the 
predicate
   DagsPage.ts:349 - Timeout 60000ms exceeded (retry #1 and #2)
   ```
   The HITL workflow test triggers a DAG run (heavy operation), then navigates 
between pages to check task states. With 2 workers, the server becomes 
overwhelmed during backfill/HITL operations on worker 1, causing all 
navigations on worker 2 to timeout. Retries also fail because the server hasn't 
recovered.
   
   **2. `xcoms.spec.ts` -- BasePage.navigateTo timeout**
   ```
   BasePage.ts:56 - page.goto: Test timeout of 60000ms exceeded
   navigating to "http://localhost:8080/xcoms";, waiting until "domcontentloaded"
   ```
   Direct `page.goto()` timeout during server overload. The XComs page navigate 
call in `BasePage.navigateTo` has no server health awareness -- it just calls 
`page.goto()` and hopes the server responds.
   
   **3. `dag-calendar-tab.spec.ts` -- Tooltip interaction failure during slow 
server**
   ```
   DagCalendarTab.ts:73 - expect(locator).toBeVisible() failed - 
getByTestId('calendar-tooltip')
   ```
   Calendar cell hover fails because server-side data hasn't loaded in time.
   
   **4. `task-instances.spec.ts` -- Table load timeout**
   ```
   TaskInstancesPage.ts:40 - locator.waitFor: Timeout 10000ms exceeded
   waiting for locator('table, div[role="table"]') to be visible
   ```
   Navigation completes but table never renders because server is still 
processing requests from other workers.
   
   ## Root Cause Analysis
   
   Four interconnected root causes identified during investigation:
   
   | # | Root Cause | Impact | Affected Tests |
   |---|-----------|--------|----------------|
   | 1 | **No server health gating** -- Tests navigate blindly regardless of 
server state | When server is overwhelmed by heavy tests (HITL, backfill), all 
concurrent tests fail | ALL failing tests |
   | 2 | **BasePage.navigateTo lacks resilience** -- Simple `page.goto()` with 
no retry or health check | Navigation times out within test timeout if server 
is slow | XComs, TaskInstances, all `navigateTo` consumers |
   | 3 | **No inter-test recovery** -- No mechanism to wait for server to 
recover between tests | Heavy test on worker 1 overwhelms server, test on 
worker 2 fails, retries fail too | Cross-worker failures |
   | 4 | **Page objects bypass BasePage with fragile `page.goto()` calls** -- 
12 direct `page.goto()` calls across 4 page objects bypass any protection in 
`navigateTo` | Even if `navigateTo` is hardened, these bypass calls remain 
vulnerable | RequiredActions (7), DagsPage (3), Calendar (1), Events (1) |
   
   ## Approach
   
   **Infrastructure-only changes to shared page objects and utilities. No spec 
files modified. No timeouts increased.**
   
   The fix is a two-layer health-aware navigation system:
   
   ### Layer 1: Health Check Utility (`tests/e2e/utils/health.ts`)
   New shared utility that polls `/api/v2/monitor/health` (unauthenticated 
endpoint already used by Breeze for startup checks). Checks HTTP 200 **AND** 
response time < 2s threshold. Uses exponential backoff intervals `[1s, 2s, 4s, 
8s]` with 60s max wait. When the server is healthy (the common case), returns 
immediately with negligible overhead.
   
   ### Layer 2: Health-Aware BasePage Navigation (`BasePage.goto()`)
   Added a `protected goto()` method to `BasePage` that wraps 
`waitForServerReady()` + `page.goto()`. All page object subclasses use 
`this.goto()` instead of `this.page.goto()` directly. The existing 
`navigateTo()` delegates to `this.goto()`. This ensures every navigation path 
-- whether through the public `navigateTo` API or custom page object methods -- 
is health-checked.
   
   ### Design Decisions
   
   | Decision | Choice | Rationale |
   |----------|--------|-----------|
   | Health check location | Navigation-level (not `beforeEach` fixture) | No 
spec file changes needed. First navigation per test naturally triggers 
recovery. |
   | Timeout policy | Strict -- zero timeouts increased | Fix root cause 
(server unresponsiveness) not symptoms (short timeouts). |
   | Fix scope | Infrastructure only (page objects + utilities) | Prevents spec 
file churn. All 88 tests benefit without modification. |
   | Pattern consolidation | Single `BasePage.goto()` wrapper | Eliminates 12 
scattered manual calls. Future page objects inherit protection automatically. |
   | Configurability | None -- hardcoded sensible defaults | No caller needs 
custom options. Avoids speculative interface. |
   
   ## Changes
   
   ### New File
   - **`tests/e2e/utils/health.ts`** -- `waitForServerReady(page)` function 
with backoff polling
   
   ### Modified Files
   - **`tests/e2e/pages/BasePage.ts`** -- Added `protected goto()` method; 
`navigateTo()` delegates to it
   - **`tests/e2e/pages/DagsPage.ts`** -- 3 direct `page.goto()` calls replaced 
with `this.goto()`
   - **`tests/e2e/pages/RequiredActionsPage.ts`** -- 7 direct `page.goto()` 
calls replaced with `this.goto()`
   - **`tests/e2e/pages/DagCalendarTab.ts`** -- 1 direct `page.goto()` call 
replaced with `this.goto()`
   - **`tests/e2e/pages/EventsPage.ts`** -- 1 direct `page.goto()` call 
replaced with `this.goto()`
   
   *All paths relative to `airflow-core/src/airflow/ui/`*
   
   ### Not Modified
   - **BackfillPage.ts** -- Already uses `navigateTo()` exclusively (only 
`page.request.*` for API calls)
   - **LoginPage.ts** -- Runs before health infra; already uses `navigateTo()`
   - **Zero spec files** -- All changes are in shared infrastructure
   - **Zero timeout values** -- No existing timeout increased
   
   ## Verification
   
   ### Automated (all PASS)
   | Check | Status |
   |-------|--------|
   | TypeScript compiles (`tsc --noEmit`) | PASS |
   | No spec files modified | PASS |
   | No timeout values increased | PASS |
   | Code bugs review (opus) | PASS -- 0 bugs found |
   | Code simplicity review (opus) | PASS -- complexity proportional to problem 
|
   | Code maintainability review (opus) | PASS -- praised Template Method 
pattern |
   | CLAUDE.md adherence review (opus) | PASS |
   | Comprehensive `page.goto()` audit | PASS -- all 18 navigation calls across 
22 page objects are health-checked |
   
   ### Manual (required before merge)
   - [ ] Run `breeze testing ui-e2e-tests` and compare results against baseline
   - [ ] Verify health check is a no-op when server is responsive (no latency 
regression)
   
   ## How It Works
   
   ```
   Test calls this.navigateTo("/dags")
     -> BasePage.navigateTo() calls this.goto("/dags", { waitUntil: 
"domcontentloaded" })
       -> BasePage.goto() calls waitForServerReady(this.page)
         -> GET /api/v2/monitor/health
           -> 200 in < 2s? Return immediately (fast path, ~0ms overhead)
           -> Slow/failed? Backoff [1s, 2s, 4s, 8s], retry up to 60s
           -> Still failing after 60s? Throw descriptive error
       -> this.page.goto(path, options)
   
   Test calls this.goto("/dags/my_dag/runs/abc123")  (subclass direct call)
     -> Same flow as above -- health check + goto
   ```
   
   When the server is healthy (the normal case), `waitForServerReady` completes 
on the first attempt with negligible overhead. When the server is overloaded 
(the flaky CI case), the health check waits with backoff until the server 
recovers, then proceeds with navigation. This eliminates the root cause of 
cascading failures across parallel workers.
   
   
    <!-- SPDX-License-Identifier: Apache-2.0
         https://www.apache.org/licenses/LICENSE-2.0 -->
   
   <!--
   Thank you for contributing!
   
   Please provide above a brief description of the changes made in this pull 
request.
   Write a good git commit message following this guide: 
http://chris.beams.io/posts/git-commit/
   
   Please make sure that your code changes are covered with tests.
   And in case of new features or big changes remember to adjust the 
documentation.
   
   Feel free to ping (in general) for the review if you do not see reaction for 
a few days
   (72 Hours is the minimum reaction time you can expect from volunteers) - we 
sometimes miss notifications.
   
   In case of an existing issue, reference it using one of the following:
   
   * closes: #ISSUE
   * related: #ISSUE
   -->
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   <!--
   If generative AI tooling has been used in the process of authoring this PR, 
please
   change below checkbox to `[X]` followed by the name of the tool, uncomment 
the "Generated-by".
   -->
   
   - [x] Yes (please specify the tool below)
   
   Generated-by: Claude Opus 4.6 following [the 
guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)
   
   ---
   
   * Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)**
 for more information. Note: commit author/co-author name and email in commits 
become permanently public when merged.
   * For fundamental code changes, an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals))
 is needed.
   * When adding dependency, check compliance with the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   * For significant user-facing changes create newsfragment: 
`{pr_number}.significant.rst`, in 
[airflow-core/newsfragments](https://github.com/apache/airflow/tree/main/airflow-core/newsfragments).
 You can add this file in a follow-up commit after the PR is created so you 
know the PR number.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to