GitHub user zaiddialpad created a discussion: Collect Pull requests Phase 2 
triggers persistent 502s

## Environment

- **DevLake version**: v1.0.3-beta9 (Helm chart)
- **Deployment**: Kubernetes (GKE)
- **Plugin**: `github_graphql`
- **Affected repo**: Large monorepo (~20K+ total PRs, ~1,200 currently OPEN)

## Problem Summary

The "Collect Pull Requests" subtask in the `github_graphql` plugin has been 
failing consistently for our largest repository since late March 2026. The 
failures are caused by **two separate issues** that compound each other:

1. **Phase 2 of PR collection triggers GitHub 502 Bad Gateway errors** when 
batching OPEN PRs in groups of 100
2. **A panic in `updateRateRemaining`** crashes the entire pod when GitHub 
returns a 401 during the retry loop

## Detailed Description

### Phase 2 behavior

Looking at `pr_collector.go`, the "Collect Pull Requests" subtask operates in 
two phases:

- **Phase 1**: Paginated search for new/updated PRs since the last cursor — 
this completes fine every time.
- **Phase 2**: Queries the DB for ALL PRs with `state = 'OPEN'`, then refetches 
them from GitHub in batches of 100 using individual `pullRequest(number: 
$number)` GraphQL queries.

For our repo with ~1,200 OPEN PRs, Phase 2 generates 12+ batches of 100 PRs 
each. These compound GraphQL queries (each requesting 100 individual PRs by 
number in a single request) are heavy enough to consistently trigger **502 Bad 
Gateway** responses from GitHub's GraphQL API.

The retry logic (`retry #0` through `retry #9`, 120s apart) does not help 
because the 502s are not transient for these heavy queries — they fail on every 
attempt. After exhausting retries on one batch, it moves to the next batch, 
which also fails.

### Observed log pattern

```
[Collect Pull Requests] ended api collection without error         # Phase 1 
done
[Collect Pull Requests] start graphql collection                   # Phase 2 
starts
retry #0 graphql calling after 120s — 502 Bad Gateway              # Batch 1 
fails
retry #1 graphql calling after 120s — 502 Bad Gateway
...
retry #9 graphql calling after 120s — 502 Bad Gateway              # All 
retries exhausted
[Collect Pull Requests] finished records: 1(not exactly)           # Moves to 
batch 2
retry #0 graphql calling after 120s — 502 Bad Gateway              # Batch 2 
also fails
...
```

This pattern repeats for every batch. The task runs for hours exhausting 
retries across all batches before finally failing.

### Panic crash (separate but compounding issue)

During the retry loops, the `updateRateRemaining` function calls GitHub's 
`/rate_limit` REST endpoint to check remaining quota. If GitHub returns a 
non-200 response (e.g., 401 Unauthorized due to a transient token issue during 
heavy load), the code **panics** instead of handling the error gracefully:

```
panic: non-200 OK status code: 401 Unauthorized body: "Bad credentials"

goroutine 3660 [running]:
github.com/apache/incubator-devlake/helpers/pluginhelper/api.(*GraphqlAsyncClient).updateRateRemaining.func1()
    /app/helpers/pluginhelper/api/graphql_async_client.go:129 +0x166
```

This crashes the pod (our lake pod is now at **42 restarts**). After the pod 
restarts, the task is retried from scratch but hits the same 502 pattern again.

### Why this is getting worse over time

The number of OPEN PRs in the DB only grows:
- Phase 2 refetches all OPEN PRs and updates their state from GitHub
- But new PRs are created faster than old ones are closed/merged
- We went from ~1,055 OPEN PRs in late March to ~1,209 now
- More OPEN PRs → more Phase 2 batches → more chances to hit 502s → lower 
success rate

We attempted to trim stale OPEN PRs in the DB (setting `state='CLOSED'` for PRs 
not updated since Jan 1, 2026), which temporarily reduced the count to 516. 
However, **the next successful run re-collected those PRs from GitHub and set 
them back to OPEN**, undoing the fix entirely.

### Success rate history

| Date | Pipeline | Collect PRs | Notes |
|---|---|---|---|
| Mar 29 | P53 | ✅ Success | Last success (1,055 OPEN PRs) |
| Mar 31 | P55 | ❌ Failed | 502s on Phase 2 |
| Apr 1 | P56 | ❌ Failed | 502s on Phase 2 |
| Apr 3 | P57 | ❌ Failed | 502s on Phase 2 |
| Apr 6 | P58 (retry) | ✅ Success | Lucky timing — GitHub API was stable |
| Apr 7 | P59 | ❌ Failed | 502s + panic crash from 401 |

## Suggested Fixes

### 1. Reduce `InputStep` for Phase 2 (high impact, easy fix)

In `pr_collector.go`, the Phase 2 `InputStep` is hardcoded to 100:

```go
err = apiCollector.InitGraphQLCollector(api.GraphqlCollectorArgs{
    GraphqlClient: data.GraphqlClient,
    Input:         iterator,
    InputStep:     100,  // <-- This creates very heavy compound queries
    ...
})
```

Reducing this to 10-20 would create lighter GraphQL queries that are far less 
likely to trigger 502s. For our repo, this would mean 60-120 smaller batches 
instead of 12 heavy ones. Each individual request would be 5-10x lighter on 
GitHub's backend.

Could this be made configurable per-connection or per-scope?

### 2. Fix the panic in `updateRateRemaining` (critical stability fix)

In `graphql_async_client.go:129`, the panic should be replaced with proper 
error handling:

```go
// Current (panics):
if err != nil {
    panic(err)
}

// Suggested (graceful):
if err != nil {
    logger.Error(err, "failed to update rate remaining, will retry")
    return
}
```

This panic crashes the entire pod, affecting all running tasks — not just the 
one that triggered it.

### 3. Consider making Phase 2 optional or configurable

For repos with very large numbers of OPEN PRs, Phase 2 is the bottleneck. Phase 
1 already catches any OPEN PR that gets updated (since updated PRs have a newer 
`updatedAt` that Phase 1's cursor picks up). Phase 2 only adds value for PRs 
whose state changes without any other activity — which is relatively rare.

An option to disable Phase 2 per-scope, or to cap the number of OPEN PRs 
refetched, would help large repos significantly.

## Reproduction

1. Add a GitHub repo with 500+ OPEN PRs to a `github_graphql` connection
2. Run the blueprint — Phase 1 will succeed, Phase 2 will hit 502s
3. During the retry loop, if a transient 401 occurs on `/rate_limit`, the pod 
crashes

## Related

- The `updateRateRemaining` panic also exists on the `main` branch (not just 
v1.0.3-beta9)
- This may be related to other reports of GraphQL 502 errors on large repos


GitHub link: https://github.com/apache/incubator-devlake/discussions/8824

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to