GitHub user zaiddialpad created a discussion: Transient 502s & Incremental 
Cursor Gaps: Scaling github_graphql for 300+ Repos

### Environment
- DevLake Version: v1.0.3-beta9

- Plugin: github_graphql

- Source: GitHub Cloud (authenticated via GitHub App)

- Deployment Scale: 340+ repositories running on a daily sync blueprint

### The Context
First off, a huge thanks to the contributors for DevLake, it’s been a 
game-changer for our internal metrics. However, as we’ve scaled our deployment 
to handle 346+ repositories, we’ve run into two interconnected issues that are 
making our production dashboards a bit shaky.

Currently, our daily pipelines experience 1-2 task failures every single run, 
primarily due to how transient errors are handled during the collection phase.

### Issue 1: GitHub GraphQL 502 Bad Gateway (Transient)
We are seeing intermittent graphql query got error failures when GitHub returns 
a 502. While these are clearly transient on GitHub's end, the github_graphql 
plugin appears to treat them as fatal.

In a large-scale environment, it’s statistically inevitable that at least one 
repo will hit a 502 during a massive sync. Because there doesn't seem to be a 
built-in retry mechanism for these specific HTTP codes, a single "hiccup" from 
GitHub kills the entire task.

### Issue 2: The "Silent" Incremental Collection Cursor Gap
When a Collect subtask fails halfway through (due to the 502 mentioned above), 
we’ve observed the following behavior:

- Partial Write: Data collected before the 502 is already committed to the raw 
tables.

- Cursor Stalls: The incremental cursor is not advanced because the subtask 
failed.

- The Gap: On the next scheduled run, the incremental cursor picks up from the 
last successful run.

We are concerned that if the next run successfully skips over the "failed 
window" or if the logic assumes the data was already handled because it exists 
in the raw tables, we end up with silent data gaps in our domain tables. For an 
organization relying on these for DORA metrics, missing even a few PRs or 
Issues creates a significant trust issue with the data.

### Questions for the Maintainers
- Retry Logic: Does the github_graphql plugin currently support (or have plans 
for) configurable retries on transient HTTP errors like 502 or 503?

- Cursor Commitment: Is the incremental cursor only committed upon total 
subtask success? If a subtask is partially successful, is there a risk of the 
next run skipping the "partially collected" data window?

- Workarounds: For those running DevLake at this scale (300+ repos), are there 
recommended configurations to mitigate these transient failures? We are 
currently using Advanced Mode blueprints to surgically exclude problematic 
subtasks, but a more automated "resiliency" setting would be ideal.

We’d love to hear if others are seeing this or if there’s a specific 
configuration in v1.0.3 we might be missing to make these pipelines more 
self-healing.

GitHub link: https://github.com/apache/incubator-devlake/discussions/8821

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to