andreahlert opened a new pull request, #64590: URL: https://github.com/apache/airflow/pull/64590
## Summary The auto-triage TUI currently fetches all PR data live from the GitHub API on every run. Author profiles are lost when the process exits, check status is re-queried for unchanged commits, and workflow runs make 4+ REST calls per PR with no caching. This adds up on larger batches and makes the tool slower than it needs to be. This PR introduces a local vault layer that materializes triage data on disk so it can be reused across sessions. The vault sits between the existing code and the GitHub API: it tries a local lookup first and falls back to the API when data is missing or stale. No new external dependencies. ## Proposed approach ### Phase 1: Author profile persistence Persist author profiles (account age, PR history, merge rates, contributed repos) to disk using the existing `CacheStore` pattern from `pr_cache.py`. Today these live only in an in-memory dict and are re-fetched every session. With persistence they survive across runs and only refresh after a configurable TTL (default 7 days). ### Phase 2: PR metadata vault Materialize PR metadata (number, title, author, labels, head_sha, checks state) as structured files when the auto-triage processes them. On subsequent runs, known PRs are loaded from the vault and only refreshed if the head_sha changed. This eliminates redundant GraphQL calls for PRs that haven't been updated. ### Phase 3: Hybrid API/vault lookups Replace the hottest API call paths with vault-first lookups: - **Check status counts**: GraphQL per commit SHA, currently no caching. If the vault has results for the same SHA, skip the API call. - **Workflow runs**: 4+ REST calls per PR, currently no caching. Materialize on first fetch, invalidate on SHA change. A `--use-vault` flag enables the vault layer (opt-in). Without the flag, behavior is unchanged. ### Phase 4: Directed review questions (future) Use vault data to generate verification questions for the LLM prompt: version field checks, test coverage detection, large PR warnings, breaking change signals. Instead of the LLM analyzing the diff from scratch, it receives pointed questions based on what the vault already knows about the PR and its context. ## Current status This is a direction proposal. Looking for feedback on the approach before writing the implementation. Specifically: - Does the phased approach make sense? - Any concerns with the `CacheStore`-based persistence for the vault? - Preference on `--use-vault` opt-in vs always-on with graceful fallback? ## Test plan - Phase 1: Unit tests for profile persistence and TTL invalidation - Phase 2: Unit tests for vault write/read/invalidation - Phase 3: Integration test showing API call reduction with vault populated - Each phase will be a separate commit for easy review --- ##### Was generative AI tooling used to co-author this PR? - [ ] No --- * Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)** for more information. * For fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals)) is needed. * When adding dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
