This is an automated email from the ASF dual-hosted git repository.
Yicong-Huang pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/texera.git
The following commit(s) were added to refs/heads/main by this push:
new b98c863127 chore(licensing): add --ignore-transitive-version flag and
use it on PR builds (#4693)
b98c863127 is described below
commit b98c863127669dbb8c4a78ce0960c78e3742db5a
Author: Jiadong Bai <[email protected]>
AuthorDate: Sat May 2 15:40:57 2026 -0700
chore(licensing): add --ignore-transitive-version flag and use it on PR
builds (#4693)
### What changes were proposed in this PR?
This PR relaxes the per-PR license-binary check so transitive-only
version bumps no longer block unrelated PRs, while still enforcing the
parts of the check that need legal review. Sub-task implementation for
#4691; broader context in #4688.
**Script (`bin/licensing/check_binary_deps.py`).**
- New flag `--ignore-transitive-version`. Without it, behavior is
unchanged (exact match).
- Added direct-dependency loaders per ecosystem, reading the primary
requirement files:
- `python` → `amber/requirements.txt` (PEP 503 canonical names)
- `npm` → `frontend/package.json` (`dependencies` + `devDependencies` +
`peer*` + `optional*`)
- `agent-npm` → `agent-service/package.json`
- `jar` → every `*.sbt` and `Dependencies.scala` in the repo
- Refactored the diff to surface four classes instead of just two:
- `added` (new package not claimed) — **always fails**
- `stale` (claimed but no longer bundled) — **always fails**
- `drift_direct` (claimed direct dep, version changed) — **always
fails** (a version bump can carry a license change)
- `drift_transitive` (claimed transitive dep, version changed) — **fails
by default; informational with `--ignore-transitive-version`**
- For jars, the script bridges sbt-native-packager's
`<groupId>.<artifactId>-<version>.jar` naming and SBT's bare artifactId
by matching the trailing artifact segment after the last `.`, with
Scala-version suffix stripping for `%%`/`%%%` libs.
**CI (`.github/workflows/build.yml`).**
- All four `check_binary_deps.py` invocations (frontend npm, jar,
python, agent-npm) now pass `--ignore-transitive-version`.
The exact-match check is preserved as the default and will be reused by
the planned nightly job (sub-task #4692) so transitive drift remains
visible and actionable on `main`.
### Any related issues, documentation, discussions?
Resolves #4691. Sibling sub-task: #4692 (nightly exact-match job).
Original report: #4688.
### How was this PR tested?
End-to-end smoke tests run locally against the real combined
LICENSE-binary built via `concat_license_binary.py` (113 python claims,
112 npm claims, 566 jar claims). For each ecosystem the four behavior
modes were exercised by mutating a synthetic `pip-licenses.csv` /
`3rdpartylicenses.json` / `lib/` directory:
Verified on the CI job:
https://github.com/apache/texera/actions/runs/25263088078/job/74073452760?pr=4693
<img width="1027" height="260" alt="Screenshot 2026-05-02 at 3 19 05 PM"
src="https://github.com/user-attachments/assets/38bef6c0-9d1b-4869-baf7-5204a93313e7"
/>
### Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (claude-opus-4-7)
---------
Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
---
.github/workflows/build.yml | 8 +-
bin/licensing/check_binary_deps.py | 282 +++++++++++++++++++++++++++++++++++--
2 files changed, 271 insertions(+), 19 deletions(-)
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
index 261920711c..45569e311c 100644
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -109,7 +109,7 @@ jobs:
run: yarn --cwd frontend run build:ci
- name: Check bundled npm packages against per-module LICENSE-binary
files
if: matrix.os == 'ubuntu-latest'
- run: ./bin/licensing/check_binary_deps.py npm
frontend/dist/3rdpartylicenses.json
+ run: ./bin/licensing/check_binary_deps.py --ignore-transitive-version
npm frontend/dist/3rdpartylicenses.json
- name: Run frontend unit tests
run: yarn --cwd frontend run test:ci --code-coverage
- name: Upload frontend coverage to Codecov
@@ -222,7 +222,7 @@ jobs:
)
check_exit=0
- ./bin/licensing/check_binary_deps.py jar "${lib_paths[@]}" ||
check_exit=$?
+ ./bin/licensing/check_binary_deps.py --ignore-transitive-version jar
"${lib_paths[@]}" || check_exit=$?
./bin/licensing/audit_jar_licenses.py "${lib_paths[@]}" || true
exit "$check_exit"
- name: Install dependencies
@@ -297,7 +297,7 @@ jobs:
run: pip-licenses --format=csv --ignore-packages pip-licenses
prettytable wcwidth > /tmp/pip-licenses.csv
- name: Check installed Python packages against per-module
LICENSE-binary files
if: matrix.python-version == '3.12'
- run: ./bin/licensing/check_binary_deps.py python /tmp/pip-licenses.csv
+ run: ./bin/licensing/check_binary_deps.py --ignore-transitive-version
python /tmp/pip-licenses.csv
- name: Create iceberg catalog database
run: psql -h localhost -U postgres -f sql/iceberg_postgres_catalog.sql
env:
@@ -358,7 +358,7 @@ jobs:
bun run bin/collect-licenses.ts > dist/3rdpartylicenses.json
- name: Check bundled agent-service packages against per-module
LICENSE-binary files
if: matrix.os == 'ubuntu-latest'
- run: ../bin/licensing/check_binary_deps.py agent-npm
dist/3rdpartylicenses.json
+ run: ../bin/licensing/check_binary_deps.py --ignore-transitive-version
agent-npm dist/3rdpartylicenses.json
- name: Install development dependencies
run: bun install --frozen-lockfile
- name: Lint with Prettier
diff --git a/bin/licensing/check_binary_deps.py
b/bin/licensing/check_binary_deps.py
index 22bbe6acfd..5b55c70e16 100755
--- a/bin/licensing/check_binary_deps.py
+++ b/bin/licensing/check_binary_deps.py
@@ -24,6 +24,13 @@ Usage:
check_binary_deps.py npm <path-to-frontend-3rdpartylicenses.json>
check_binary_deps.py agent-npm <path-to-agent-service-3rdpartylicenses.json>
check_binary_deps.py python <path-to-pip-licenses.csv>
+
+Strictness:
+ Default (exact match): version drift on any package — direct or transitive —
+ fails the run.
+ --ignore-transitive-version: version drift on a *transitive* dep is
+ reported as informational; presence of every claimed library and version
+ agreement on every *direct* dep are still enforced.
"""
from __future__ import annotations
@@ -49,13 +56,25 @@ PER_MODULE_LICENSE_BINARIES: list[str] = [
"agent-service/LICENSE-binary",
]
+# Primary requirement files used to mark deps as "direct" (vs. transitive).
+PRIMARY_REQUIREMENTS = {
+ "python": ["amber/requirements.txt", "amber/operator-requirements.txt"],
+ "npm": ["frontend/package.json"],
+ "agent-npm": ["agent-service/package.json"],
+ # All SBT files in the repo are scanned for libraryDependencies entries.
+}
+
+
+def _repo_root() -> Path:
+ return Path(__file__).resolve().parent.parent.parent
+
def build_default_license_binary() -> Path:
"""Concat all per-module LICENSE-binary files into a temp file using
bin/licensing/concat_license_binary.py and return its path. Used as
the default --license-binary when the caller doesn't pass one."""
here = Path(__file__).resolve().parent
- repo_root = here.parent.parent
+ repo_root = _repo_root()
inputs = [repo_root / p for p in PER_MODULE_LICENSE_BINARIES]
missing = [p for p in inputs if not p.is_file()]
if missing:
@@ -89,6 +108,81 @@ NPM_BULLET =
re.compile(r"^\s*-\s+(@?[\w@/.\-]+)@([^\s@]+)\s*$")
# ` - <name>==<version>` — pip form.
PY_BULLET = re.compile(r"^\s*-\s+([\w][\w.\-]*)==(\S+)\s*$")
+# Splits a jar basename like `netty-all-4.1.96.Final.jar` into (artifact,
version).
+# The version starts at the first dash followed by a digit.
+JAR_NAME_VERSION = re.compile(r"^(.+?)-(\d[^/]*)\.jar$")
+
+# SBT libraryDependencies syntax: "group" % "artifact" % "version" [...].
+# %% / %%% mark Scala / Scala.js libs whose artifact gains a `_<scalaVer>`
suffix
+# at resolution time; we capture the bare artifact and reconstruct both forms.
+SBT_DEP = re.compile(
+ r'"([\w.\-]+)"\s*(%%%?|%)\s*"([\w.\-]+)"\s*%\s*"([\w.\-]+)"'
+)
+
+
+# --- direct-dep loaders ----------------------------------------------------
+
+def load_direct_python() -> set[str]:
+ """PEP 503 canonical names from amber/requirements.txt (and any other
+ file listed in PRIMARY_REQUIREMENTS['python'])."""
+ direct: set[str] = set()
+ for rel in PRIMARY_REQUIREMENTS["python"]:
+ p = _repo_root() / rel
+ if not p.is_file():
+ continue
+ for raw in p.read_text().splitlines():
+ line = raw.strip()
+ if not line or line.startswith("#") or line.startswith("-"):
+ # `-`-prefixed lines are pip flags (--extra-index-url, -r,
...).
+ continue
+ # Strip env markers, extras, and version specifiers; we only
+ # need the package name.
+ name = re.split(r"[<>=!~;\s\[]", line, maxsplit=1)[0]
+ if name:
+ direct.add(canonicalize_python_name(name))
+ return direct
+
+
+def load_direct_npm(rel_path: str) -> set[str]:
+ """Top-level dep names from a package.json's dependencies /
devDependencies /
+ peerDependencies / optionalDependencies."""
+ direct: set[str] = set()
+ p = _repo_root() / rel_path
+ if not p.is_file():
+ return direct
+ data = json.loads(p.read_text())
+ for key in ("dependencies", "devDependencies", "peerDependencies",
"optionalDependencies"):
+ section = data.get(key) or {}
+ direct.update(section.keys())
+ return direct
+
+
+def load_direct_jar_artifacts() -> set[str]:
+ """ArtifactIds declared in any *.sbt or project/Dependencies.scala file
+ in the repo. For Scala libs (`%%`/`%%%`) the resolved jar gains a
+ `_<scalaVer>` suffix; we add the bare artifact here and let the matcher
+ strip the suffix when comparing."""
+ repo_root = _repo_root()
+ direct: set[str] = set()
+
+ # Scan SBT build files. Walking the tree keeps this resilient to new
+ # subprojects without having to keep an explicit list in sync.
+ skip_dirs = {"node_modules", "target", ".git", ".idea", ".bsp", "dist"}
+ for path in repo_root.rglob("*"):
+ if any(part in skip_dirs for part in path.parts):
+ continue
+ if not path.is_file():
+ continue
+ if path.suffix == ".sbt" or path.name == "Dependencies.scala":
+ try:
+ text = path.read_text(errors="replace")
+ except OSError:
+ continue
+ for m in SBT_DEP.finditer(text):
+ _group, _sep, artifact, _version = m.groups()
+ direct.add(artifact)
+ return direct
+
# --- extracting claims from LICENSE-binary ---------------------------------
@@ -192,9 +286,79 @@ def collect_python(path: Path) -> set[str]:
return result
-# --- matching & reporting --------------------------------------------------
+# --- diff ------------------------------------------------------------------
+
+# Trailing Scala-version suffix that SBT appends to %% artifacts at resolve
time.
+SCALA_SUFFIX = re.compile(r"_\d+(?:\.\d+)+$")
+
+
+def _index_npm(items: set[str]) -> dict[str, str]:
+ """{ '[email protected]', '@scope/[email protected]' } -> { 'react': '18.2.0',
'@scope/foo': '1.0' }."""
+ out: dict[str, str] = {}
+ for entry in items:
+ # Last '@' is the version separator; '@' inside scoped names is at
index 0.
+ idx = entry.rfind("@")
+ if idx <= 0:
+ continue
+ out[entry[:idx]] = entry[idx + 1:]
+ return out
+
+
+def _index_python(items: set[str]) -> dict[str, str]:
+ """{ 'numpy==2.1.0' } -> { 'numpy': '2.1.0' }."""
+ out: dict[str, str] = {}
+ for entry in items:
+ if "==" not in entry:
+ continue
+ name, _, ver = entry.partition("==")
+ out[name] = ver
+ return out
+
+
+def _index_jar(items: set[str]) -> dict[str, tuple[str, str]]:
+ """{ 'netty-all-4.1.96.Final.jar' } -> { 'netty-all': ('4.1.96.Final',
'<basename>') }.
+ Falls back to keying by the full basename when version extraction fails;
+ such entries can still be flagged as added/stale but not as drift."""
+ out: dict[str, tuple[str, str]] = {}
+ for jar in items:
+ m = JAR_NAME_VERSION.match(jar)
+ if m:
+ out[m.group(1)] = (m.group(2), jar)
+ else:
+ out[jar] = ("", jar) # unparseable — version "" sentinel
+ return out
+
-def report(added: list[str], stale: list[str], label: str, kind: str) -> int:
+def _is_direct_jar(artifact: str, direct_artifacts: set[str]) -> bool:
+ """sbt-native-packager's default JavaAppPackaging names dist jars
+ `<groupId>.<artifactId>-<version>.jar`, so the artifactId we extract from
+ the basename is `<groupId>.<artifactId>` (e.g. `io.netty.netty-buffer`).
+ SBT's libraryDependencies record only the bare artifactId, so we match
+ on the segment after the last `.`. Scala libs (`%%`) get a `_<scalaVer>`
+ suffix appended at resolve time which we strip first."""
+ bare = SCALA_SUFFIX.sub("", artifact)
+ if bare in direct_artifacts:
+ return True
+ if "." in bare:
+ tail = bare.rsplit(".", 1)[1]
+ # Re-strip in case the Scala suffix was after the dot.
+ tail = SCALA_SUFFIX.sub("", tail)
+ if tail in direct_artifacts:
+ return True
+ return False
+
+
+# --- reporting -------------------------------------------------------------
+
+def report(
+ added: list[str],
+ stale: list[str],
+ drift_direct: list[tuple[str, str, str]],
+ drift_transitive: list[tuple[str, str, str]],
+ label: str,
+ kind: str,
+ ignore_transitive_version: bool,
+) -> int:
rc = 0
if added:
print(f"NEW {label} not claimed by LICENSE-binary:")
@@ -220,11 +384,88 @@ def report(added: list[str], stale: list[str], label:
str, kind: str) -> int:
print()
rc = 1
+ if drift_direct:
+ print(f"DRIFT (direct) {label} — claimed version differs from
bundled:")
+ for name, claimed_v, real_v in sorted(drift_direct):
+ print(f" ~ {name}: LICENSE-binary={claimed_v} bundled={real_v}")
+ print()
+ print("ACTION REQUIRED")
+ print(f" Update LICENSE-binary to match the bundled version. Direct
deps")
+ print(f" always block CI — a version bump may carry license changes.")
+ print()
+ rc = 1
+
+ if drift_transitive:
+ if ignore_transitive_version:
+ print(f"DRIFT (transitive, informational) {label}:")
+ for name, claimed_v, real_v in sorted(drift_transitive):
+ print(f" ~ {name}: LICENSE-binary={claimed_v}
bundled={real_v}")
+ print(f" (--ignore-transitive-version is set; nightly
exact-match")
+ print(f" check on main is responsible for refreshing these.)")
+ print()
+ else:
+ print(f"DRIFT (transitive) {label} — claimed version differs from
bundled:")
+ for name, claimed_v, real_v in sorted(drift_transitive):
+ print(f" ~ {name}: LICENSE-binary={claimed_v}
bundled={real_v}")
+ print()
+ print("ACTION REQUIRED")
+ print(f" Update LICENSE-binary to match the bundled version, or
rerun")
+ print(f" with --ignore-transitive-version to treat transitive
drift as")
+ print(f" informational.")
+ print()
+ rc = 1
+
return rc
# --- main ------------------------------------------------------------------
+def diff_simple(
+ claim_idx: dict[str, str],
+ real_idx: dict[str, str],
+ direct_names: set[str],
+) -> tuple[list[str], list[str], list[tuple[str, str, str]], list[tuple[str,
str, str]]]:
+ """Diff claims vs reality keyed by name. Same shape for npm/python."""
+ added = sorted(real_idx.keys() - claim_idx.keys())
+ stale = sorted(claim_idx.keys() - real_idx.keys())
+ drift_direct: list[tuple[str, str, str]] = []
+ drift_transitive: list[tuple[str, str, str]] = []
+ for name in sorted(claim_idx.keys() & real_idx.keys()):
+ c, r = claim_idx[name], real_idx[name]
+ if c != r:
+ entry = (name, c, r)
+ (drift_direct if name in direct_names else
drift_transitive).append(entry)
+ return added, stale, drift_direct, drift_transitive
+
+
+def diff_jars(
+ claim_idx: dict[str, tuple[str, str]],
+ real_idx: dict[str, tuple[str, str]],
+ direct_artifacts: set[str],
+) -> tuple[list[str], list[str], list[tuple[str, str, str]], list[tuple[str,
str, str]]]:
+ """Like diff_simple but the index value is (version, jar_basename) so
+ added/stale can be reported using the full basename users will see in
+ LICENSE-binary."""
+ added: list[str] = []
+ stale: list[str] = []
+ drift_direct: list[tuple[str, str, str]] = []
+ drift_transitive: list[tuple[str, str, str]] = []
+ for artifact in sorted(real_idx.keys() - claim_idx.keys()):
+ added.append(real_idx[artifact][1])
+ for artifact in sorted(claim_idx.keys() - real_idx.keys()):
+ stale.append(claim_idx[artifact][1])
+ for artifact in sorted(claim_idx.keys() & real_idx.keys()):
+ cv, _cname = claim_idx[artifact]
+ rv, _rname = real_idx[artifact]
+ if cv != rv:
+ entry = (artifact, cv, rv)
+ if _is_direct_jar(artifact, direct_artifacts):
+ drift_direct.append(entry)
+ else:
+ drift_transitive.append(entry)
+ return added, stale, drift_direct, drift_transitive
+
+
def main() -> int:
ap = argparse.ArgumentParser()
ap.add_argument("kind", choices=["jar", "npm", "agent-npm", "python"])
@@ -238,6 +479,17 @@ def main() -> int:
"per-module files (see PER_MODULE_LICENSE_BINARIES)."
),
)
+ ap.add_argument(
+ "--ignore-transitive-version",
+ action="store_true",
+ help=(
+ "Treat version drift on transitive deps as informational instead "
+ "of failing. Direct deps (declared in primary requirement files: "
+ "build.sbt / package.json / requirements.txt) still block on any "
+ "drift, and any added or removed package still blocks regardless "
+ "of direct/transitive."
+ ),
+ )
args = ap.parse_args()
if args.license_binary is None:
@@ -251,9 +503,9 @@ def main() -> int:
if args.kind == "jar":
claimed = parse_prose(lb, "jar")
reality = collect_jars(args.inputs)
- added = sorted(reality - claimed)
- stale = sorted(claimed - reality)
- rc = report(added, stale, "JVM jars", "jar")
+ direct_artifacts = load_direct_jar_artifacts()
+ added, stale, dd, dt = diff_jars(_index_jar(claimed),
_index_jar(reality), direct_artifacts)
+ rc = report(added, stale, dd, dt, "JVM jars", "jar",
args.ignore_transitive_version)
if rc == 0:
print(f"OK: {len(reality)} JVM jars match LICENSE-binary.")
return rc
@@ -261,9 +513,9 @@ def main() -> int:
if args.kind == "npm":
claimed = parse_prose(lb, "npm")
reality = collect_npm(Path(args.inputs[0]))
- added = sorted(reality - claimed)
- stale = sorted(claimed - reality)
- rc = report(added, stale, "npm packages", "npm")
+ direct = load_direct_npm("frontend/package.json")
+ added, stale, dd, dt = diff_simple(_index_npm(claimed),
_index_npm(reality), direct)
+ rc = report(added, stale, dd, dt, "npm packages", "npm",
args.ignore_transitive_version)
if rc == 0:
print(f"OK: {len(reality)} npm packages match LICENSE-binary.")
return rc
@@ -271,9 +523,9 @@ def main() -> int:
if args.kind == "agent-npm":
claimed = parse_prose(lb, "agent-npm")
reality = collect_npm(Path(args.inputs[0]))
- added = sorted(reality - claimed)
- stale = sorted(claimed - reality)
- rc = report(added, stale, "agent-service npm packages", "agent-npm")
+ direct = load_direct_npm("agent-service/package.json")
+ added, stale, dd, dt = diff_simple(_index_npm(claimed),
_index_npm(reality), direct)
+ rc = report(added, stale, dd, dt, "agent-service npm packages",
"agent-npm", args.ignore_transitive_version)
if rc == 0:
print(f"OK: {len(reality)} agent-service npm packages match
LICENSE-binary.")
return rc
@@ -281,9 +533,9 @@ def main() -> int:
if args.kind == "python":
claimed = parse_prose(lb, "python")
reality = collect_python(Path(args.inputs[0]))
- added = sorted(reality - claimed)
- stale = sorted(claimed - reality)
- rc = report(added, stale, "Python packages", "python")
+ direct = load_direct_python()
+ added, stale, dd, dt = diff_simple(_index_python(claimed),
_index_python(reality), direct)
+ rc = report(added, stale, dd, dt, "Python packages", "python",
args.ignore_transitive_version)
if rc == 0:
print(f"OK: {len(reality)} Python packages match LICENSE-binary.")
return rc