fxgagnon opened a new issue, #8847:
URL: https://github.com/apache/incubator-devlake/issues/8847

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/incubator-devlake/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   When using **GitHub App authentication (**not PAT) on a GitHub connection, 
the connection silently becomes non-functional for administrative operations 
after approximately 1 hour, even when no requests have been made during that 
period.
   
   **Observed symptoms:**
   
   - The private GitHub organization no longer appears when trying to add a new 
scope through the Config UI.
   - The API endpoint used to search/list repositories before creating a scope 
returns empty results, making it impossible to add new scopes.
   - No error is logged in the devlake-lake pod — the failure is completely 
silent.
   - Data collection on already-configured scopes continues to work normally 
during the same period. Blueprints run successfully, data is collected, 
pipelines complete without error.
   - Editing the connection and re-submitting the same GitHub App private key 
restores the listing functionality for another ~1 hour.
   - Restarting the devlake-lake pod (without re-editing the connection) also 
restores functionality for ~1 hour.
   
   **Key discriminating observation:**
   
   Once the connection is in the "zombie" state, REST data collection tasks 
keep succeeding on existing scopes while the remote scopes listing API returns 
empty. This decoupling strongly indicates that two different HTTP clients with 
independent token lifecycles are in use: one is being refreshed correctly, the 
other is not.
   
   ### What do you expect to happen
   
   The GitHub App installation token should be transparently refreshed for all 
HTTP clients used by the GitHub plugin — including the ones backing the remote 
scopes API (org/repo listing) and the Test Connection endpoint — the same way 
it is already refreshed for REST data collection tasks since #8746.
   
   The connection should remain fully usable indefinitely without requiring a 
pod restart or a manual re-edit of the private key.
   
   ### How to reproduce
   
   1. Deploy DevLake v1.0.3-beta10 (Helm chart) on Kubernetes.
   2. Create a GitHub connection using GitHub App authentication (App ID + 
installation ID + private key). Do not use a PAT.
   3. Create a project and add at least one scope pointing to a private 
repository of the installed organization. Verify that data collection works.
   4. Wait ~1 hour without making any request against the connection.
   5. In the Config UI, attempt to add a new scope to the project: the 
organization no longer appears in the repository picker.
   6. Via the API, attempt to search repositories through the remote scopes 
endpoint before creating a scope: the response is empty.
   7. Trigger a blueprint on an existing scope: data collection still succeeds, 
proving the installation token itself is being refreshed somewhere in the 
process but not on the admin/listing code path.
   8. Restart the devlake pod (no connection edit, no DB change): listing works 
again for another ~1 hour.
   
   ### Anything else
   
   **Root cause hypothesis**
   This appears to be the same architectural pattern as #8788, but on a 
different code path.
   
   PR #8746 introduced NewAppInstallationTokenProvider + RefreshRoundTripper 
and wired them into CreateApiClient so that REST data collection tasks get 
transparent token refresh. Issue #8788 identified that the GraphQL client in 
backend/plugins/github_graphql/impl/impl.go was not updated as part of #8746 
and still uses oauth2.StaticTokenSource, which freezes the token at task start 
time.
   
   The present bug strongly suggests that the HTTP client(s) backing the remote 
scopes / connection admin APIs (org listing, repo search, test connection) 
suffer from the same issue: they were not recabled to the RefreshRoundTripper 
machinery in #8746. The installation token captured when the process starts (or 
when the connection is last edited) is cached in memory and never refreshed, so 
after the 1-hour GitHub installation token TTL expires, all admin API calls 
fail silently — likely because GitHub returns 401 Bad credentials and the 
listing code path interprets the empty/errored response as "no orgs available" 
without surfacing an error.
   
   This would explain every observed characteristic:
   
   - Why it is silent: the error is swallowed at the listing layer instead of 
being logged (unlike the GraphQL path in #8788 which panics loudly).
   - Why collection still works: the REST path uses the client correctly 
refreshed by #8746.
   - Why a pod restart fixes it temporarily: the in-memory frozen token is 
replaced by a freshly minted one when the process starts.
   - Why re-editing the connection fixes it temporarily: it likely forces a new 
connection object to be built, including a new client with a fresh token.
   - Why other environments with the same version may not see it: pods that are 
restarted more frequently (rolling deploys, aggressive liveness probes, 
autoscaling) never reach the 1-hour threshold in steady state.
   
   ### Version
   
    v1.0.3-beta10 
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to