andreahlert opened a new pull request, #64590:
URL: https://github.com/apache/airflow/pull/64590

   ## Summary
   
   The auto-triage TUI currently fetches all PR data live from the GitHub API 
on every run. Author profiles are lost when the process exits, check status is 
re-queried for unchanged commits, and workflow runs make 4+ REST calls per PR 
with no caching. This adds up on larger batches and makes the tool slower than 
it needs to be.
   
   This PR introduces a local vault layer that materializes triage data on disk 
so it can be reused across sessions. The vault sits between the existing code 
and the GitHub API: it tries a local lookup first and falls back to the API 
when data is missing or stale. No new external dependencies.
   
   ## Proposed approach
   
   ### Phase 1: Author profile persistence
   Persist author profiles (account age, PR history, merge rates, contributed 
repos) to disk using the existing `CacheStore` pattern from `pr_cache.py`. 
Today these live only in an in-memory dict and are re-fetched every session. 
With persistence they survive across runs and only refresh after a configurable 
TTL (default 7 days).
   
   ### Phase 2: PR metadata vault
   Materialize PR metadata (number, title, author, labels, head_sha, checks 
state) as structured files when the auto-triage processes them. On subsequent 
runs, known PRs are loaded from the vault and only refreshed if the head_sha 
changed. This eliminates redundant GraphQL calls for PRs that haven't been 
updated.
   
   ### Phase 3: Hybrid API/vault lookups
   Replace the hottest API call paths with vault-first lookups:
   - **Check status counts**: GraphQL per commit SHA, currently no caching. If 
the vault has results for the same SHA, skip the API call.
   - **Workflow runs**: 4+ REST calls per PR, currently no caching. Materialize 
on first fetch, invalidate on SHA change.
   
   A `--use-vault` flag enables the vault layer (opt-in). Without the flag, 
behavior is unchanged.
   
   ### Phase 4: Directed review questions (future)
   Use vault data to generate verification questions for the LLM prompt: 
version field checks, test coverage detection, large PR warnings, breaking 
change signals. Instead of the LLM analyzing the diff from scratch, it receives 
pointed questions based on what the vault already knows about the PR and its 
context.
   
   ## Current status
   
   This is a direction proposal. Looking for feedback on the approach before 
writing the implementation. Specifically:
   - Does the phased approach make sense?
   - Any concerns with the `CacheStore`-based persistence for the vault?
   - Preference on `--use-vault` opt-in vs always-on with graceful fallback?
   
   ## Test plan
   
   - Phase 1: Unit tests for profile persistence and TTL invalidation
   - Phase 2: Unit tests for vault write/read/invalidation
   - Phase 3: Integration test showing API call reduction with vault populated
   - Each phase will be a separate commit for easy review
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [ ] No
   
   ---
   
   * Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)**
 for more information.
   * For fundamental code changes, an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals))
 is needed.
   * When adding dependency, check compliance with the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to