fxgagnon opened a new issue, #8847: URL: https://github.com/apache/incubator-devlake/issues/8847
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/incubator-devlake/issues?q=is%3Aissue) and found no similar issues. ### What happened When using **GitHub App authentication (**not PAT) on a GitHub connection, the connection silently becomes non-functional for administrative operations after approximately 1 hour, even when no requests have been made during that period. **Observed symptoms:** - The private GitHub organization no longer appears when trying to add a new scope through the Config UI. - The API endpoint used to search/list repositories before creating a scope returns empty results, making it impossible to add new scopes. - No error is logged in the devlake-lake pod — the failure is completely silent. - Data collection on already-configured scopes continues to work normally during the same period. Blueprints run successfully, data is collected, pipelines complete without error. - Editing the connection and re-submitting the same GitHub App private key restores the listing functionality for another ~1 hour. - Restarting the devlake-lake pod (without re-editing the connection) also restores functionality for ~1 hour. **Key discriminating observation:** Once the connection is in the "zombie" state, REST data collection tasks keep succeeding on existing scopes while the remote scopes listing API returns empty. This decoupling strongly indicates that two different HTTP clients with independent token lifecycles are in use: one is being refreshed correctly, the other is not. ### What do you expect to happen The GitHub App installation token should be transparently refreshed for all HTTP clients used by the GitHub plugin — including the ones backing the remote scopes API (org/repo listing) and the Test Connection endpoint — the same way it is already refreshed for REST data collection tasks since #8746. The connection should remain fully usable indefinitely without requiring a pod restart or a manual re-edit of the private key. ### How to reproduce 1. Deploy DevLake v1.0.3-beta10 (Helm chart) on Kubernetes. 2. Create a GitHub connection using GitHub App authentication (App ID + installation ID + private key). Do not use a PAT. 3. Create a project and add at least one scope pointing to a private repository of the installed organization. Verify that data collection works. 4. Wait ~1 hour without making any request against the connection. 5. In the Config UI, attempt to add a new scope to the project: the organization no longer appears in the repository picker. 6. Via the API, attempt to search repositories through the remote scopes endpoint before creating a scope: the response is empty. 7. Trigger a blueprint on an existing scope: data collection still succeeds, proving the installation token itself is being refreshed somewhere in the process but not on the admin/listing code path. 8. Restart the devlake pod (no connection edit, no DB change): listing works again for another ~1 hour. ### Anything else **Root cause hypothesis** This appears to be the same architectural pattern as #8788, but on a different code path. PR #8746 introduced NewAppInstallationTokenProvider + RefreshRoundTripper and wired them into CreateApiClient so that REST data collection tasks get transparent token refresh. Issue #8788 identified that the GraphQL client in backend/plugins/github_graphql/impl/impl.go was not updated as part of #8746 and still uses oauth2.StaticTokenSource, which freezes the token at task start time. The present bug strongly suggests that the HTTP client(s) backing the remote scopes / connection admin APIs (org listing, repo search, test connection) suffer from the same issue: they were not recabled to the RefreshRoundTripper machinery in #8746. The installation token captured when the process starts (or when the connection is last edited) is cached in memory and never refreshed, so after the 1-hour GitHub installation token TTL expires, all admin API calls fail silently — likely because GitHub returns 401 Bad credentials and the listing code path interprets the empty/errored response as "no orgs available" without surfacing an error. This would explain every observed characteristic: - Why it is silent: the error is swallowed at the listing layer instead of being logged (unlike the GraphQL path in #8788 which panics loudly). - Why collection still works: the REST path uses the client correctly refreshed by #8746. - Why a pod restart fixes it temporarily: the in-memory frozen token is replaced by a freshly minted one when the process starts. - Why re-editing the connection fixes it temporarily: it likely forces a new connection object to be built, including a new client with a fresh token. - Why other environments with the same version may not see it: pods that are restarted more frequently (rolling deploys, aggressive liveness probes, autoscaling) never reach the 1-hour threshold in steady state. ### Version v1.0.3-beta10 ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
