This is an automated email from the ASF dual-hosted git repository.
kaxilnaik pushed a commit to branch fix-registry-incremental-provider-flag
in repository https://gitbox.apache.org/repos/asf/airflow.git
The following commit(s) were added to
refs/heads/fix-registry-incremental-provider-flag by this push:
new 35b9324b885 Fix incremental builds overwriting provider
connection/parameter data on S3
35b9324b885 is described below
commit 35b9324b885bd3082521dcc9a2b80479bd2efb6f
Author: Kaxil Naik <[email protected]>
AuthorDate: Tue Mar 17 17:42:06 2026 +0000
Fix incremental builds overwriting provider connection/parameter data on S3
Eleventy pagination templates emit empty fallback JSON for every provider,
even when only one provider's data was extracted. A plain `aws s3 sync`
uploads those stubs and overwrites real connection/parameter data.
Changes:
- Exclude per-provider connections.json and parameters.json from the main
S3 sync during incremental builds, then selectively upload only the
target provider's API files
- Filter connections early in extract_connections.py (before the loop)
and support space-separated multi-provider IDs
- Suppress SCARF_ANALYTICS and DO_NOT_TRACK telemetry in CI
- Document the Eleventy pagination limitation in README and AGENTS.md
---
.github/workflows/registry-build.yml | 32 +++++++++++++++++++++++++++++++-
dev/registry/extract_connections.py | 18 ++++++++++--------
registry/AGENTS.md | 5 ++++-
registry/README.md | 20 +++++++++++++++++---
4 files changed, 62 insertions(+), 13 deletions(-)
diff --git a/.github/workflows/registry-build.yml
b/.github/workflows/registry-build.yml
index 94b2a4bb9d4..9fc61a9d70c 100644
--- a/.github/workflows/registry-build.yml
+++ b/.github/workflows/registry-build.yml
@@ -98,6 +98,8 @@ jobs:
needs: [build-ci-image]
runs-on: ubuntu-latest
env:
+ SCARF_ANALYTICS: "false"
+ DO_NOT_TRACK: "1"
EXISTING_REGISTRY_DIR: /tmp/existing-registry
REGISTRY_DATA_DIR: dev/registry
REGISTRY_PROVIDERS_JSON: providers.json
@@ -247,10 +249,38 @@ jobs:
- name: "Sync registry to S3"
env:
S3_BUCKET: ${{ steps.destination.outputs.bucket }}
+ PROVIDER: ${{ inputs.provider }}
run: |
+ # Incremental builds only extract connections/parameters for the
+ # target provider(s). The Eleventy site build still emits empty
+ # stub JSON for every other provider. Uploading those stubs would
+ # overwrite real data on S3, so we exclude per-provider API JSON
+ # from the main sync and selectively upload only the target
+ # provider's files afterward.
+ EXCLUDE_PROVIDER_API=()
+ if [[ -n "${PROVIDER}" ]]; then
+ EXCLUDE_PROVIDER_API=(
+ --exclude "api/providers/*/connections.json"
+ --exclude "api/providers/*/parameters.json"
+ --exclude "api/providers/*/*/connections.json"
+ --exclude "api/providers/*/*/parameters.json"
+ )
+ fi
+
aws s3 sync registry/_site/ "${S3_BUCKET}" \
--cache-control "${REGISTRY_CACHE_CONTROL}" \
- --exclude "pagefind/*"
+ --exclude "pagefind/*" \
+ "${EXCLUDE_PROVIDER_API[@]}"
+
+ # For incremental builds, sync only the updated provider's API files.
+ if [[ -n "${PROVIDER}" ]]; then
+ for pid in ${PROVIDER}; do
+ aws s3 sync "registry/_site/api/providers/${pid}/" \
+ "${S3_BUCKET}api/providers/${pid}/" \
+ --cache-control "${REGISTRY_CACHE_CONTROL}"
+ done
+ fi
+
# Pagefind generates content-hashed filenames (e.g.
en_181da6f.pf_index).
# Each rebuild produces new hashes, so --delete is needed to remove
stale
# index files. This is separate from the main sync which
intentionally
diff --git a/dev/registry/extract_connections.py
b/dev/registry/extract_connections.py
index ce81cc43a27..ebb269193c3 100644
--- a/dev/registry/extract_connections.py
+++ b/dev/registry/extract_connections.py
@@ -162,7 +162,7 @@ def main():
parser.add_argument(
"--provider",
default=None,
- help="Only output connections for this provider ID (e.g. 'amazon').",
+ help="Only output connections for these provider ID(s)
(space-separated, e.g. 'amazon common-io').",
)
parser.add_argument(
"--providers-json",
@@ -212,12 +212,21 @@ def main():
total_with_custom = 0
total_with_ui = 0
+ # Parse space-separated provider filter (matches extract_metadata.py
behaviour)
+ provider_filter: set[str] | None = None
+ if args.provider:
+ provider_filter = {pid.strip() for pid in args.provider.split() if
pid.strip()}
+ print(f"Filtering to provider(s): {',
'.join(sorted(provider_filter))}")
+
for conn_type, hook_info in sorted(hooks.items()):
if hook_info is None or not hook_info.package_name:
continue
provider_id = package_name_to_provider_id(hook_info.package_name)
+ if provider_filter and provider_id not in provider_filter:
+ continue
+
standard_fields =
build_standard_fields(field_behaviours.get(conn_type))
custom_fields = build_custom_fields(form_widgets, conn_type)
@@ -244,13 +253,6 @@ def main():
print(f" {total_with_custom} have custom fields")
print(f" {total_with_ui} have UI field customisation")
- # Filter to single provider if requested
- if args.provider:
- provider_connections = {
- pid: conns for pid, conns in provider_connections.items() if pid
== args.provider
- }
- print(f"Filtering output to provider: {args.provider}")
-
# Write per-provider files to versions/{pid}/{version}/connections.json
for output_dir in OUTPUT_DIRS:
if not output_dir.parent.exists():
diff --git a/registry/AGENTS.md b/registry/AGENTS.md
index fc7ce31fb0f..b6416ec0244 100644
--- a/registry/AGENTS.md
+++ b/registry/AGENTS.md
@@ -404,7 +404,10 @@ The registry is built in the `apache/airflow` repo and
served at `airflow.apache
Supports two modes:
- **Full build** (no `provider` input): extracts all ~99 providers (~12 min)
- **Incremental build** (`provider=amazon`): extracts one provider (~30s),
merges
- with existing data from S3 via `merge_registry_data.py`, then builds the
full site
+ with existing data from S3 via `merge_registry_data.py`, then builds the
full site.
+ The S3 sync step excludes per-provider `connections.json` and
`parameters.json`
+ for non-target providers to avoid overwriting real data with Eleventy's
empty
+ fallback stubs (Eleventy 3.x `permalink: false` does not work with
pagination).
2. **S3 buckets**: `{live|staging}-docs-airflow-apache-org/registry/` (same
bucket as docs, different prefix)
3. **Serving**: Apache HTTPD at `airflow.apache.org` rewrites `/registry/*` to
CloudFront, which serves from S3
4. **Auto-trigger**: When `publish-docs-to-s3.yml` publishes provider docs, its
diff --git a/registry/README.md b/registry/README.md
index 2d0d354c59f..6b4df5b78eb 100644
--- a/registry/README.md
+++ b/registry/README.md
@@ -327,10 +327,16 @@ it triggers `registry-build.yml` with the provider ID.
The incremental flow:
metadata and PyPI stats; `extract_parameters.py` discovers modules for only
the
specified provider.
3. **Merge** — `merge_registry_data.py` replaces the updated provider's
entries in
- the downloaded JSON while keeping all other providers intact.
+ the downloaded JSON while keeping all other providers intact. Only global
files
+ (`providers.json`, `modules.json`) are merged — per-version files like
+ `connections.json` and `parameters.json` are not downloaded from S3.
4. **Build site** — Eleventy builds all pages from the merged data; Pagefind
indexes
- all records.
-5. **S3 sync** — only changed pages are uploaded (S3 sync diffs).
+ all records. Because per-version data only exists for the target provider,
Eleventy
+ emits empty fallback JSON for other providers' `connections.json` and
+ `parameters.json` API endpoints (see **Known limitation** below).
+5. **S3 sync (selective)** — the main sync excludes per-provider
`connections.json`
+ and `parameters.json` to avoid overwriting real data with empty stubs. A
second
+ sync uploads only the target provider's API files.
6. **Publish versions** — `publish_versions.py` updates
`api/providers/{id}/versions.json`.
The merge script (`dev/registry/merge_registry_data.py`) handles edge cases:
@@ -338,6 +344,14 @@ The merge script (`dev/registry/merge_registry_data.py`)
handles edge cases:
- First deploy (no existing data on S3): uses the single-provider output as-is.
- Missing modules file: treated as empty.
+**Known limitation**: Eleventy's pagination templates generate API files for
every
+provider in `providers.json`, even when per-version data (connections,
parameters) only
+exists for the target provider. The templates emit empty fallback JSON
+(`{"connection_types":[]}`) for providers without data. The S3 sync step works
around
+this with `--exclude` patterns during incremental builds. A proper
template-level fix
+(skipping file generation) is tracked as a follow-up — `permalink: false` does
not work
+with Eleventy 3.x pagination templates.
+
To run an incremental build locally:
```bash