GitHub user dosubot[bot] edited a comment on the discussion: Transient 502s & Incremental Cursor Gaps: Scaling github_graphql for 300+ Repos
<!-- Greeting --> Hi @zaiddialpad! I'm [Dosu](https://go.dosu.dev/dosubot) and I’m helping the incubator-devlake team. <!-- Answer --> Great questions — I dug into the codebase to give you concrete answers on all three. **1. Retry Logic: Yes, it exists, but with caveats.** The `GraphqlAsyncClient` *does* retry HTTP-level errors including 502 and 503 [[1]](https://github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/graphql_async_client.go#L158-L189). It uses a **fixed backoff** (not exponential) controlled by two env vars: | Variable | Default | What it does | |---|---|---| | `API_RETRY` | `3` | Max retry attempts per GraphQL request | | `API_TIMEOUT` | `30s` | Wait duration *between* retries (for the GraphQL client specifically) | So if you're seeing 502s kill tasks after retries, you may want to bump `API_RETRY` higher (e.g., `5` or `7`) given your scale. Note that GraphQL-level `DataError` responses (schema/validation errors) are *not* retried — only transport-level failures are. A closely related issue was reported in [#8614](https://github.com/apache/incubator-devlake/issues/8614) [[2]](https://github.com/apache/incubator-devlake/issues/8614) where 502s and stream cancellations were caused by GitHub's server-side response size limits on large repos. The fix (merged in v1.0.3-beta8) introduced `GITHUB_JOB_GRAPHQL_PAGE_SIZE` to reduce page sizes. If you haven't tuned this, it may help reduce the frequency of 502s. **⚠️ Heads-up on two active bugs that compound this:** - The GraphQL client's background rate-limit goroutine calls `panic(err)` on *any* error (including a transient 502 during rate-limit polling), which **crashes the entire DevLake process** [[3]](https://github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/graphql_async_client.go#L127-L131). This is tracked in [#8788](https://github.com/apache/incubator-devlake/issues/8788) [[4]](https://github.com/apache/incubator-devlake/issues/8788). - If you're authenticating via a GitHub App, the GraphQL client uses a `StaticTokenSource` that **freezes the token at task start**. After ~1 hour when it expires, all GraphQL requests fail with 401 and trigger the panic above. The REST client was already fixed for this, but the GraphQL client was not [[4]](https://github.com/apache/incubator-devlake/issues/8788). Both are addressed in the open [PR #8791](https://github.com/apache/incubator-devlake/pull/8791) [[5]](https://github.com/apache/incubator-devlake/pull/8791) — not yet merged. Given your GitHub App auth + 340 repos (likely >1 hour syncs), this is almost certainly hitting you. **2. Cursor Commitment: Strict all-or-nothing — your concern about gaps is partially valid.** The `CollectorStateManager.Close()` that persists `LatestSuccessStart` is **only called when all nested collectors succeed** [[6]](https://github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/api_collector_stateful.go#L94-L104) [[7]](https://github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/collector_state_manager.go#L108-L119). If a subtask fails at any point, `Close()` is never reached, so the cursor stays at its previous position. On the next run, collection restarts from the same `since` timestamp. Here's the nuance on data gaps: - **Raw data** collected before the failure **persists** in the `_raw_github_graphql_*` tables (no rollback) [[8]](https://github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/graphql_collector.go#L296-L308). - On the next incremental run, the raw table is **not** cleared, so the re-run will produce **duplicate raw rows** for data already collected in the failed run. - The GraphQL pagination cursor (`EndCursor`) is **entirely in-memory** — if a subtask fails on page 47 of 100, the next run restarts from page 1. - **No data should be silently skipped**, because the `since` timestamp doesn't advance on failure. The risk is duplicates, not gaps. The extractor/converter pipeline should handle deduplication at the domain layer via primary keys, though it's worth verifying this for your specific data models. **3. Recommended Workarounds at Scale** - **Increase retries**: Set `API_RETRY=5` or higher and consider increasing `API_TIMEOUT` (the inter-retry wait) to give GitHub more breathing room. - **Reduce page sizes**: Set `GITHUB_JOB_GRAPHQL_PAGE_SIZE` to a smaller value (e.g., 20 or 30) to reduce the chance of GitHub returning 502 on large responses. - **Watch for PR #8791**: This is critical for your setup — it fixes the panic crash and the GitHub App token expiry issue for the GraphQL client. You may want to comment on it to signal urgency. - **Consider splitting blueprints**: Breaking your 346 repos into multiple blueprints (e.g., groups of 50-80) can reduce the blast radius of a single failure and keep individual sync times under the 1-hour token expiry window. Hope this helps make the pipelines more resilient. The retry mechanism is there but the defaults are conservative for a 340+ repo deployment, and the two bugs in #8788 are likely compounding your failures significantly. <!-- Dosu Comment Footer --> *To reply, just mention [@dosu](https://go.dosu.dev/dosubot?utm_source=github&utm_medium=bot-comment&utm_campaign=github-comment-footer-20260415&utm_content=reply-with-mention&utm_term=apache%2Fincubator-devlake).* --- Docs are dead. Just use [Dosu](https://dosu.dev?utm_source=github&utm_medium=bot-comment&utm_campaign=github-comment-footer-20260415&utm_content=static-docs-tagline&utm_term=apache%2Fincubator-devlake). [](https://app.dosu.dev/response-feedback?utm_source=github&utm_medium=bot-comment&utm_campaign=github-comment-footer-20260415&utm_content=static-docs-feedback&utm_term=apache%2Fincubator-devlake#message_id=befe9d1e-5953-4581-a571-60faa502849b) [](https://github.dosu.com/apache/incubator-devlake?utm_source=github&utm_medium=bot-comment&utm_campaign=github-comment-footer-20260415&utm_content=static-docs-ask-repo&utm_term=apache%2Fincubator-devlake) [](https://app.dosu.dev/signup?referrer=openSource&source=github-footer&utm_source=github&utm_medium=bot-comment&utm_campaign=github-comment-footer-20260415&utm_content=static-docs-share-team&utm_term=apache%2Fincubator-devlake) GitHub link: https://github.com/apache/incubator-devlake/discussions/8821#discussioncomment-16398435 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
