[PR] [DRAFT] Add vault data layer for auto-triage [airflow]

via GitHub Wed, 01 Apr 2026 09:28:11 -0700


andreahlert opened a new pull request, #64590:
URL: https://github.com/apache/airflow/pull/64590

## Summary

The auto-triage TUI currently fetches all PR data live from the GitHub API
on every run. Author profiles are lost when the process exits, check status is
re-queried for unchanged commits, and workflow runs make 4+ REST calls per PR
with no caching. This adds up on larger batches and makes the tool slower than
it needs to be.

This PR introduces a local vault layer that materializes triage data on disk
so it can be reused across sessions. The vault sits between the existing code
and the GitHub API: it tries a local lookup first and falls back to the API
when data is missing or stale. No new external dependencies.

## Proposed approach

### Phase 1: Author profile persistence
Persist author profiles (account age, PR history, merge rates, contributed
repos) to disk using the existing `CacheStore` pattern from `pr_cache.py`.
Today these live only in an in-memory dict and are re-fetched every session.
With persistence they survive across runs and only refresh after a configurable
TTL (default 7 days).

### Phase 2: PR metadata vault
Materialize PR metadata (number, title, author, labels, head_sha, checks
state) as structured files when the auto-triage processes them. On subsequent
runs, known PRs are loaded from the vault and only refreshed if the head_sha
changed. This eliminates redundant GraphQL calls for PRs that haven't been
updated.

### Phase 3: Hybrid API/vault lookups
Replace the hottest API call paths with vault-first lookups:
- **Check status counts**: GraphQL per commit SHA, currently no caching. If
the vault has results for the same SHA, skip the API call.
- **Workflow runs**: 4+ REST calls per PR, currently no caching. Materialize
on first fetch, invalidate on SHA change.

A `--use-vault` flag enables the vault layer (opt-in). Without the flag,
behavior is unchanged.

### Phase 4: Directed review questions (future)
Use vault data to generate verification questions for the LLM prompt:
version field checks, test coverage detection, large PR warnings, breaking
change signals. Instead of the LLM analyzing the diff from scratch, it receives
pointed questions based on what the vault already knows about the PR and its
context.

## Current status

This is a direction proposal. Looking for feedback on the approach before
writing the implementation. Specifically:
- Does the phased approach make sense?
- Any concerns with the `CacheStore`-based persistence for the vault?
- Preference on `--use-vault` opt-in vs always-on with graceful fallback?

## Test plan

- Phase 1: Unit tests for profile persistence and TTL invalidation
- Phase 2: Unit tests for vault write/read/invalidation
- Phase 3: Integration test showing API call reduction with vault populated
- Each phase will be a separate commit for easy review

---

##### Was generative AI tooling used to co-author this PR?

- [ ] No

---

* Read the **[Pull Request
Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)**
for more information.
* For fundamental code changes, an Airflow Improvement Proposal
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals))
is needed.
* When adding dependency, check compliance with the [ASF 3rd Party License
Policy](https://www.apache.org/legal/resolved.html#category-x).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [DRAFT] Add vault data layer for auto-triage [airflow]

Reply via email to