GitHub user zaiddialpad created a discussion: Collect Pull requests Phase 2
triggers persistent 502s
## Environment
- **DevLake version**: v1.0.3-beta9 (Helm chart)
- **Deployment**: Kubernetes (GKE)
- **Plugin**: `github_graphql`
- **Affected repo**: Large monorepo (~20K+ total PRs, ~1,200 currently OPEN)
## Problem Summary
The "Collect Pull Requests" subtask in the `github_graphql` plugin has been
failing consistently for our largest repository since late March 2026. The
failures are caused by **two separate issues** that compound each other:
1. **Phase 2 of PR collection triggers GitHub 502 Bad Gateway errors** when
batching OPEN PRs in groups of 100
2. **A panic in `updateRateRemaining`** crashes the entire pod when GitHub
returns a 401 during the retry loop
## Detailed Description
### Phase 2 behavior
Looking at `pr_collector.go`, the "Collect Pull Requests" subtask operates in
two phases:
- **Phase 1**: Paginated search for new/updated PRs since the last cursor —
this completes fine every time.
- **Phase 2**: Queries the DB for ALL PRs with `state = 'OPEN'`, then refetches
them from GitHub in batches of 100 using individual `pullRequest(number:
$number)` GraphQL queries.
For our repo with ~1,200 OPEN PRs, Phase 2 generates 12+ batches of 100 PRs
each. These compound GraphQL queries (each requesting 100 individual PRs by
number in a single request) are heavy enough to consistently trigger **502 Bad
Gateway** responses from GitHub's GraphQL API.
The retry logic (`retry #0` through `retry #9`, 120s apart) does not help
because the 502s are not transient for these heavy queries — they fail on every
attempt. After exhausting retries on one batch, it moves to the next batch,
which also fails.
### Observed log pattern
```
[Collect Pull Requests] ended api collection without error # Phase 1
done
[Collect Pull Requests] start graphql collection # Phase 2
starts
retry #0 graphql calling after 120s — 502 Bad Gateway # Batch 1
fails
retry #1 graphql calling after 120s — 502 Bad Gateway
...
retry #9 graphql calling after 120s — 502 Bad Gateway # All
retries exhausted
[Collect Pull Requests] finished records: 1(not exactly) # Moves to
batch 2
retry #0 graphql calling after 120s — 502 Bad Gateway # Batch 2
also fails
...
```
This pattern repeats for every batch. The task runs for hours exhausting
retries across all batches before finally failing.
### Panic crash (separate but compounding issue)
During the retry loops, the `updateRateRemaining` function calls GitHub's
`/rate_limit` REST endpoint to check remaining quota. If GitHub returns a
non-200 response (e.g., 401 Unauthorized due to a transient token issue during
heavy load), the code **panics** instead of handling the error gracefully:
```
panic: non-200 OK status code: 401 Unauthorized body: "Bad credentials"
goroutine 3660 [running]:
github.com/apache/incubator-devlake/helpers/pluginhelper/api.(*GraphqlAsyncClient).updateRateRemaining.func1()
/app/helpers/pluginhelper/api/graphql_async_client.go:129 +0x166
```
This crashes the pod (our lake pod is now at **42 restarts**). After the pod
restarts, the task is retried from scratch but hits the same 502 pattern again.
### Why this is getting worse over time
The number of OPEN PRs in the DB only grows:
- Phase 2 refetches all OPEN PRs and updates their state from GitHub
- But new PRs are created faster than old ones are closed/merged
- We went from ~1,055 OPEN PRs in late March to ~1,209 now
- More OPEN PRs → more Phase 2 batches → more chances to hit 502s → lower
success rate
We attempted to trim stale OPEN PRs in the DB (setting `state='CLOSED'` for PRs
not updated since Jan 1, 2026), which temporarily reduced the count to 516.
However, **the next successful run re-collected those PRs from GitHub and set
them back to OPEN**, undoing the fix entirely.
### Success rate history
| Date | Pipeline | Collect PRs | Notes |
|---|---|---|---|
| Mar 29 | P53 | ✅ Success | Last success (1,055 OPEN PRs) |
| Mar 31 | P55 | ❌ Failed | 502s on Phase 2 |
| Apr 1 | P56 | ❌ Failed | 502s on Phase 2 |
| Apr 3 | P57 | ❌ Failed | 502s on Phase 2 |
| Apr 6 | P58 (retry) | ✅ Success | Lucky timing — GitHub API was stable |
| Apr 7 | P59 | ❌ Failed | 502s + panic crash from 401 |
## Suggested Fixes
### 1. Reduce `InputStep` for Phase 2 (high impact, easy fix)
In `pr_collector.go`, the Phase 2 `InputStep` is hardcoded to 100:
```go
err = apiCollector.InitGraphQLCollector(api.GraphqlCollectorArgs{
GraphqlClient: data.GraphqlClient,
Input: iterator,
InputStep: 100, // <-- This creates very heavy compound queries
...
})
```
Reducing this to 10-20 would create lighter GraphQL queries that are far less
likely to trigger 502s. For our repo, this would mean 60-120 smaller batches
instead of 12 heavy ones. Each individual request would be 5-10x lighter on
GitHub's backend.
Could this be made configurable per-connection or per-scope?
### 2. Fix the panic in `updateRateRemaining` (critical stability fix)
In `graphql_async_client.go:129`, the panic should be replaced with proper
error handling:
```go
// Current (panics):
if err != nil {
panic(err)
}
// Suggested (graceful):
if err != nil {
logger.Error(err, "failed to update rate remaining, will retry")
return
}
```
This panic crashes the entire pod, affecting all running tasks — not just the
one that triggered it.
### 3. Consider making Phase 2 optional or configurable
For repos with very large numbers of OPEN PRs, Phase 2 is the bottleneck. Phase
1 already catches any OPEN PR that gets updated (since updated PRs have a newer
`updatedAt` that Phase 1's cursor picks up). Phase 2 only adds value for PRs
whose state changes without any other activity — which is relatively rare.
An option to disable Phase 2 per-scope, or to cap the number of OPEN PRs
refetched, would help large repos significantly.
## Reproduction
1. Add a GitHub repo with 500+ OPEN PRs to a `github_graphql` connection
2. Run the blueprint — Phase 1 will succeed, Phase 2 will hit 502s
3. During the retry loop, if a transient 401 occurs on `/rate_limit`, the pod
crashes
## Related
- The `updateRateRemaining` panic also exists on the `main` branch (not just
v1.0.3-beta9)
- This may be related to other reports of GraphQL 502 errors on large repos
GitHub link: https://github.com/apache/incubator-devlake/discussions/8824
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]