This is an automated email from the ASF dual-hosted git repository.

Yicong-Huang pushed a commit to branch release/v1.1.0-incubating
in repository https://gitbox.apache.org/repos/asf/texera.git


The following commit(s) were added to refs/heads/release/v1.1.0-incubating by 
this push:
     new 25dfc9e7b5 chore(licensing): add --ignore-transitive-version flag and 
use it on PR builds (#4693)
25dfc9e7b5 is described below

commit 25dfc9e7b55b1e12ceada82d229f47eb2faba75b
Author: Jiadong Bai <[email protected]>
AuthorDate: Sat May 2 22:41:17 2026 +0000

    chore(licensing): add --ignore-transitive-version flag and use it on PR 
builds (#4693)
    
    ### What changes were proposed in this PR?
    
    This PR relaxes the per-PR license-binary check so transitive-only
    version bumps no longer block unrelated PRs, while still enforcing the
    parts of the check that need legal review. Sub-task implementation for
    #4691; broader context in #4688.
    
    **Script (`bin/licensing/check_binary_deps.py`).**
    
    - New flag `--ignore-transitive-version`. Without it, behavior is
    unchanged (exact match).
    - Added direct-dependency loaders per ecosystem, reading the primary
    requirement files:
      - `python` → `amber/requirements.txt` (PEP 503 canonical names)
    - `npm` → `frontend/package.json` (`dependencies` + `devDependencies` +
    `peer*` + `optional*`)
      - `agent-npm` → `agent-service/package.json`
      - `jar` → every `*.sbt` and `Dependencies.scala` in the repo
    - Refactored the diff to surface four classes instead of just two:
      - `added` (new package not claimed) — **always fails**
      - `stale` (claimed but no longer bundled) — **always fails**
    - `drift_direct` (claimed direct dep, version changed) — **always
    fails** (a version bump can carry a license change)
    - `drift_transitive` (claimed transitive dep, version changed) — **fails
    by default; informational with `--ignore-transitive-version`**
    - For jars, the script bridges sbt-native-packager's
    `<groupId>.<artifactId>-<version>.jar` naming and SBT's bare artifactId
    by matching the trailing artifact segment after the last `.`, with
    Scala-version suffix stripping for `%%`/`%%%` libs.
    
    **CI (`.github/workflows/build.yml`).**
    
    - All four `check_binary_deps.py` invocations (frontend npm, jar,
    python, agent-npm) now pass `--ignore-transitive-version`.
    
    The exact-match check is preserved as the default and will be reused by
    the planned nightly job (sub-task #4692) so transitive drift remains
    visible and actionable on `main`.
    
    ### Any related issues, documentation, discussions?
    
    Resolves #4691. Sibling sub-task: #4692 (nightly exact-match job).
    Original report: #4688.
    
    ### How was this PR tested?
    
    End-to-end smoke tests run locally against the real combined
    LICENSE-binary built via `concat_license_binary.py` (113 python claims,
    112 npm claims, 566 jar claims). For each ecosystem the four behavior
    modes were exercised by mutating a synthetic `pip-licenses.csv` /
    `3rdpartylicenses.json` / `lib/` directory:
    
    Verified on the CI job:
    
https://github.com/apache/texera/actions/runs/25263088078/job/74073452760?pr=4693
    
    <img width="1027" height="260" alt="Screenshot 2026-05-02 at 3 19 05 PM"
    
src="https://github.com/user-attachments/assets/38bef6c0-9d1b-4869-baf7-5204a93313e7";
    />
    
    ### Was this PR authored or co-authored using generative AI tooling?
    
    Generated-by: Claude Code (claude-opus-4-7)
    
    ---------
    
    Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
    
    (backported from commit b98c863127669dbb8c4a78ce0960c78e3742db5a)
---
 .github/workflows/build.yml        |   8 +-
 bin/licensing/check_binary_deps.py | 282 +++++++++++++++++++++++++++++++++++--
 2 files changed, 271 insertions(+), 19 deletions(-)

diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
index 261920711c..45569e311c 100644
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -109,7 +109,7 @@ jobs:
         run: yarn --cwd frontend run build:ci
       - name: Check bundled npm packages against per-module LICENSE-binary 
files
         if: matrix.os == 'ubuntu-latest'
-        run: ./bin/licensing/check_binary_deps.py npm 
frontend/dist/3rdpartylicenses.json
+        run: ./bin/licensing/check_binary_deps.py --ignore-transitive-version 
npm frontend/dist/3rdpartylicenses.json
       - name: Run frontend unit tests
         run: yarn --cwd frontend run test:ci --code-coverage
       - name: Upload frontend coverage to Codecov
@@ -222,7 +222,7 @@ jobs:
           )
 
           check_exit=0
-          ./bin/licensing/check_binary_deps.py jar "${lib_paths[@]}" || 
check_exit=$?
+          ./bin/licensing/check_binary_deps.py --ignore-transitive-version jar 
"${lib_paths[@]}" || check_exit=$?
           ./bin/licensing/audit_jar_licenses.py "${lib_paths[@]}" || true
           exit "$check_exit"
       - name: Install dependencies
@@ -297,7 +297,7 @@ jobs:
         run: pip-licenses --format=csv --ignore-packages pip-licenses 
prettytable wcwidth > /tmp/pip-licenses.csv
       - name: Check installed Python packages against per-module 
LICENSE-binary files
         if: matrix.python-version == '3.12'
-        run: ./bin/licensing/check_binary_deps.py python /tmp/pip-licenses.csv
+        run: ./bin/licensing/check_binary_deps.py --ignore-transitive-version 
python /tmp/pip-licenses.csv
       - name: Create iceberg catalog database
         run: psql -h localhost -U postgres -f sql/iceberg_postgres_catalog.sql
         env:
@@ -358,7 +358,7 @@ jobs:
           bun run bin/collect-licenses.ts > dist/3rdpartylicenses.json
       - name: Check bundled agent-service packages against per-module 
LICENSE-binary files
         if: matrix.os == 'ubuntu-latest'
-        run: ../bin/licensing/check_binary_deps.py agent-npm 
dist/3rdpartylicenses.json
+        run: ../bin/licensing/check_binary_deps.py --ignore-transitive-version 
agent-npm dist/3rdpartylicenses.json
       - name: Install development dependencies
         run: bun install --frozen-lockfile
       - name: Lint with Prettier
diff --git a/bin/licensing/check_binary_deps.py 
b/bin/licensing/check_binary_deps.py
index 22bbe6acfd..5b55c70e16 100755
--- a/bin/licensing/check_binary_deps.py
+++ b/bin/licensing/check_binary_deps.py
@@ -24,6 +24,13 @@ Usage:
   check_binary_deps.py npm       <path-to-frontend-3rdpartylicenses.json>
   check_binary_deps.py agent-npm <path-to-agent-service-3rdpartylicenses.json>
   check_binary_deps.py python    <path-to-pip-licenses.csv>
+
+Strictness:
+  Default (exact match): version drift on any package — direct or transitive —
+  fails the run.
+  --ignore-transitive-version: version drift on a *transitive* dep is
+  reported as informational; presence of every claimed library and version
+  agreement on every *direct* dep are still enforced.
 """
 from __future__ import annotations
 
@@ -49,13 +56,25 @@ PER_MODULE_LICENSE_BINARIES: list[str] = [
     "agent-service/LICENSE-binary",
 ]
 
+# Primary requirement files used to mark deps as "direct" (vs. transitive).
+PRIMARY_REQUIREMENTS = {
+    "python":    ["amber/requirements.txt", "amber/operator-requirements.txt"],
+    "npm":       ["frontend/package.json"],
+    "agent-npm": ["agent-service/package.json"],
+    # All SBT files in the repo are scanned for libraryDependencies entries.
+}
+
+
+def _repo_root() -> Path:
+    return Path(__file__).resolve().parent.parent.parent
+
 
 def build_default_license_binary() -> Path:
     """Concat all per-module LICENSE-binary files into a temp file using
     bin/licensing/concat_license_binary.py and return its path. Used as
     the default --license-binary when the caller doesn't pass one."""
     here = Path(__file__).resolve().parent
-    repo_root = here.parent.parent
+    repo_root = _repo_root()
     inputs = [repo_root / p for p in PER_MODULE_LICENSE_BINARIES]
     missing = [p for p in inputs if not p.is_file()]
     if missing:
@@ -89,6 +108,81 @@ NPM_BULLET = 
re.compile(r"^\s*-\s+(@?[\w@/.\-]+)@([^\s@]+)\s*$")
 # `  - <name>==<version>` — pip form.
 PY_BULLET  = re.compile(r"^\s*-\s+([\w][\w.\-]*)==(\S+)\s*$")
 
+# Splits a jar basename like `netty-all-4.1.96.Final.jar` into (artifact, 
version).
+# The version starts at the first dash followed by a digit.
+JAR_NAME_VERSION = re.compile(r"^(.+?)-(\d[^/]*)\.jar$")
+
+# SBT libraryDependencies syntax: "group" % "artifact" % "version" [...].
+# %% / %%% mark Scala / Scala.js libs whose artifact gains a `_<scalaVer>` 
suffix
+# at resolution time; we capture the bare artifact and reconstruct both forms.
+SBT_DEP = re.compile(
+    r'"([\w.\-]+)"\s*(%%%?|%)\s*"([\w.\-]+)"\s*%\s*"([\w.\-]+)"'
+)
+
+
+# --- direct-dep loaders ----------------------------------------------------
+
+def load_direct_python() -> set[str]:
+    """PEP 503 canonical names from amber/requirements.txt (and any other
+    file listed in PRIMARY_REQUIREMENTS['python'])."""
+    direct: set[str] = set()
+    for rel in PRIMARY_REQUIREMENTS["python"]:
+        p = _repo_root() / rel
+        if not p.is_file():
+            continue
+        for raw in p.read_text().splitlines():
+            line = raw.strip()
+            if not line or line.startswith("#") or line.startswith("-"):
+                # `-`-prefixed lines are pip flags (--extra-index-url, -r, 
...).
+                continue
+            # Strip env markers, extras, and version specifiers; we only
+            # need the package name.
+            name = re.split(r"[<>=!~;\s\[]", line, maxsplit=1)[0]
+            if name:
+                direct.add(canonicalize_python_name(name))
+    return direct
+
+
+def load_direct_npm(rel_path: str) -> set[str]:
+    """Top-level dep names from a package.json's dependencies / 
devDependencies /
+    peerDependencies / optionalDependencies."""
+    direct: set[str] = set()
+    p = _repo_root() / rel_path
+    if not p.is_file():
+        return direct
+    data = json.loads(p.read_text())
+    for key in ("dependencies", "devDependencies", "peerDependencies", 
"optionalDependencies"):
+        section = data.get(key) or {}
+        direct.update(section.keys())
+    return direct
+
+
+def load_direct_jar_artifacts() -> set[str]:
+    """ArtifactIds declared in any *.sbt or project/Dependencies.scala file
+    in the repo. For Scala libs (`%%`/`%%%`) the resolved jar gains a
+    `_<scalaVer>` suffix; we add the bare artifact here and let the matcher
+    strip the suffix when comparing."""
+    repo_root = _repo_root()
+    direct: set[str] = set()
+
+    # Scan SBT build files. Walking the tree keeps this resilient to new
+    # subprojects without having to keep an explicit list in sync.
+    skip_dirs = {"node_modules", "target", ".git", ".idea", ".bsp", "dist"}
+    for path in repo_root.rglob("*"):
+        if any(part in skip_dirs for part in path.parts):
+            continue
+        if not path.is_file():
+            continue
+        if path.suffix == ".sbt" or path.name == "Dependencies.scala":
+            try:
+                text = path.read_text(errors="replace")
+            except OSError:
+                continue
+            for m in SBT_DEP.finditer(text):
+                _group, _sep, artifact, _version = m.groups()
+                direct.add(artifact)
+    return direct
+
 
 # --- extracting claims from LICENSE-binary ---------------------------------
 
@@ -192,9 +286,79 @@ def collect_python(path: Path) -> set[str]:
     return result
 
 
-# --- matching & reporting --------------------------------------------------
+# --- diff ------------------------------------------------------------------
+
+# Trailing Scala-version suffix that SBT appends to %% artifacts at resolve 
time.
+SCALA_SUFFIX = re.compile(r"_\d+(?:\.\d+)+$")
+
+
+def _index_npm(items: set[str]) -> dict[str, str]:
+    """{ '[email protected]', '@scope/[email protected]' } -> { 'react': '18.2.0', 
'@scope/foo': '1.0' }."""
+    out: dict[str, str] = {}
+    for entry in items:
+        # Last '@' is the version separator; '@' inside scoped names is at 
index 0.
+        idx = entry.rfind("@")
+        if idx <= 0:
+            continue
+        out[entry[:idx]] = entry[idx + 1:]
+    return out
+
+
+def _index_python(items: set[str]) -> dict[str, str]:
+    """{ 'numpy==2.1.0' } -> { 'numpy': '2.1.0' }."""
+    out: dict[str, str] = {}
+    for entry in items:
+        if "==" not in entry:
+            continue
+        name, _, ver = entry.partition("==")
+        out[name] = ver
+    return out
+
+
+def _index_jar(items: set[str]) -> dict[str, tuple[str, str]]:
+    """{ 'netty-all-4.1.96.Final.jar' } -> { 'netty-all': ('4.1.96.Final', 
'<basename>') }.
+    Falls back to keying by the full basename when version extraction fails;
+    such entries can still be flagged as added/stale but not as drift."""
+    out: dict[str, tuple[str, str]] = {}
+    for jar in items:
+        m = JAR_NAME_VERSION.match(jar)
+        if m:
+            out[m.group(1)] = (m.group(2), jar)
+        else:
+            out[jar] = ("", jar)  # unparseable — version "" sentinel
+    return out
+
 
-def report(added: list[str], stale: list[str], label: str, kind: str) -> int:
+def _is_direct_jar(artifact: str, direct_artifacts: set[str]) -> bool:
+    """sbt-native-packager's default JavaAppPackaging names dist jars
+    `<groupId>.<artifactId>-<version>.jar`, so the artifactId we extract from
+    the basename is `<groupId>.<artifactId>` (e.g. `io.netty.netty-buffer`).
+    SBT's libraryDependencies record only the bare artifactId, so we match
+    on the segment after the last `.`. Scala libs (`%%`) get a `_<scalaVer>`
+    suffix appended at resolve time which we strip first."""
+    bare = SCALA_SUFFIX.sub("", artifact)
+    if bare in direct_artifacts:
+        return True
+    if "." in bare:
+        tail = bare.rsplit(".", 1)[1]
+        # Re-strip in case the Scala suffix was after the dot.
+        tail = SCALA_SUFFIX.sub("", tail)
+        if tail in direct_artifacts:
+            return True
+    return False
+
+
+# --- reporting -------------------------------------------------------------
+
+def report(
+    added: list[str],
+    stale: list[str],
+    drift_direct: list[tuple[str, str, str]],
+    drift_transitive: list[tuple[str, str, str]],
+    label: str,
+    kind: str,
+    ignore_transitive_version: bool,
+) -> int:
     rc = 0
     if added:
         print(f"NEW {label} not claimed by LICENSE-binary:")
@@ -220,11 +384,88 @@ def report(added: list[str], stale: list[str], label: 
str, kind: str) -> int:
         print()
         rc = 1
 
+    if drift_direct:
+        print(f"DRIFT (direct) {label} — claimed version differs from 
bundled:")
+        for name, claimed_v, real_v in sorted(drift_direct):
+            print(f"  ~ {name}: LICENSE-binary={claimed_v}  bundled={real_v}")
+        print()
+        print("ACTION REQUIRED")
+        print(f"  Update LICENSE-binary to match the bundled version. Direct 
deps")
+        print(f"  always block CI — a version bump may carry license changes.")
+        print()
+        rc = 1
+
+    if drift_transitive:
+        if ignore_transitive_version:
+            print(f"DRIFT (transitive, informational) {label}:")
+            for name, claimed_v, real_v in sorted(drift_transitive):
+                print(f"  ~ {name}: LICENSE-binary={claimed_v}  
bundled={real_v}")
+            print(f"  (--ignore-transitive-version is set; nightly 
exact-match")
+            print(f"   check on main is responsible for refreshing these.)")
+            print()
+        else:
+            print(f"DRIFT (transitive) {label} — claimed version differs from 
bundled:")
+            for name, claimed_v, real_v in sorted(drift_transitive):
+                print(f"  ~ {name}: LICENSE-binary={claimed_v}  
bundled={real_v}")
+            print()
+            print("ACTION REQUIRED")
+            print(f"  Update LICENSE-binary to match the bundled version, or 
rerun")
+            print(f"  with --ignore-transitive-version to treat transitive 
drift as")
+            print(f"  informational.")
+            print()
+            rc = 1
+
     return rc
 
 
 # --- main ------------------------------------------------------------------
 
+def diff_simple(
+    claim_idx: dict[str, str],
+    real_idx: dict[str, str],
+    direct_names: set[str],
+) -> tuple[list[str], list[str], list[tuple[str, str, str]], list[tuple[str, 
str, str]]]:
+    """Diff claims vs reality keyed by name. Same shape for npm/python."""
+    added = sorted(real_idx.keys() - claim_idx.keys())
+    stale = sorted(claim_idx.keys() - real_idx.keys())
+    drift_direct: list[tuple[str, str, str]] = []
+    drift_transitive: list[tuple[str, str, str]] = []
+    for name in sorted(claim_idx.keys() & real_idx.keys()):
+        c, r = claim_idx[name], real_idx[name]
+        if c != r:
+            entry = (name, c, r)
+            (drift_direct if name in direct_names else 
drift_transitive).append(entry)
+    return added, stale, drift_direct, drift_transitive
+
+
+def diff_jars(
+    claim_idx: dict[str, tuple[str, str]],
+    real_idx: dict[str, tuple[str, str]],
+    direct_artifacts: set[str],
+) -> tuple[list[str], list[str], list[tuple[str, str, str]], list[tuple[str, 
str, str]]]:
+    """Like diff_simple but the index value is (version, jar_basename) so
+    added/stale can be reported using the full basename users will see in
+    LICENSE-binary."""
+    added: list[str] = []
+    stale: list[str] = []
+    drift_direct: list[tuple[str, str, str]] = []
+    drift_transitive: list[tuple[str, str, str]] = []
+    for artifact in sorted(real_idx.keys() - claim_idx.keys()):
+        added.append(real_idx[artifact][1])
+    for artifact in sorted(claim_idx.keys() - real_idx.keys()):
+        stale.append(claim_idx[artifact][1])
+    for artifact in sorted(claim_idx.keys() & real_idx.keys()):
+        cv, _cname = claim_idx[artifact]
+        rv, _rname = real_idx[artifact]
+        if cv != rv:
+            entry = (artifact, cv, rv)
+            if _is_direct_jar(artifact, direct_artifacts):
+                drift_direct.append(entry)
+            else:
+                drift_transitive.append(entry)
+    return added, stale, drift_direct, drift_transitive
+
+
 def main() -> int:
     ap = argparse.ArgumentParser()
     ap.add_argument("kind", choices=["jar", "npm", "agent-npm", "python"])
@@ -238,6 +479,17 @@ def main() -> int:
             "per-module files (see PER_MODULE_LICENSE_BINARIES)."
         ),
     )
+    ap.add_argument(
+        "--ignore-transitive-version",
+        action="store_true",
+        help=(
+            "Treat version drift on transitive deps as informational instead "
+            "of failing. Direct deps (declared in primary requirement files: "
+            "build.sbt / package.json / requirements.txt) still block on any "
+            "drift, and any added or removed package still blocks regardless "
+            "of direct/transitive."
+        ),
+    )
     args = ap.parse_args()
 
     if args.license_binary is None:
@@ -251,9 +503,9 @@ def main() -> int:
     if args.kind == "jar":
         claimed = parse_prose(lb, "jar")
         reality = collect_jars(args.inputs)
-        added = sorted(reality - claimed)
-        stale = sorted(claimed - reality)
-        rc = report(added, stale, "JVM jars", "jar")
+        direct_artifacts = load_direct_jar_artifacts()
+        added, stale, dd, dt = diff_jars(_index_jar(claimed), 
_index_jar(reality), direct_artifacts)
+        rc = report(added, stale, dd, dt, "JVM jars", "jar", 
args.ignore_transitive_version)
         if rc == 0:
             print(f"OK: {len(reality)} JVM jars match LICENSE-binary.")
         return rc
@@ -261,9 +513,9 @@ def main() -> int:
     if args.kind == "npm":
         claimed = parse_prose(lb, "npm")
         reality = collect_npm(Path(args.inputs[0]))
-        added = sorted(reality - claimed)
-        stale = sorted(claimed - reality)
-        rc = report(added, stale, "npm packages", "npm")
+        direct = load_direct_npm("frontend/package.json")
+        added, stale, dd, dt = diff_simple(_index_npm(claimed), 
_index_npm(reality), direct)
+        rc = report(added, stale, dd, dt, "npm packages", "npm", 
args.ignore_transitive_version)
         if rc == 0:
             print(f"OK: {len(reality)} npm packages match LICENSE-binary.")
         return rc
@@ -271,9 +523,9 @@ def main() -> int:
     if args.kind == "agent-npm":
         claimed = parse_prose(lb, "agent-npm")
         reality = collect_npm(Path(args.inputs[0]))
-        added = sorted(reality - claimed)
-        stale = sorted(claimed - reality)
-        rc = report(added, stale, "agent-service npm packages", "agent-npm")
+        direct = load_direct_npm("agent-service/package.json")
+        added, stale, dd, dt = diff_simple(_index_npm(claimed), 
_index_npm(reality), direct)
+        rc = report(added, stale, dd, dt, "agent-service npm packages", 
"agent-npm", args.ignore_transitive_version)
         if rc == 0:
             print(f"OK: {len(reality)} agent-service npm packages match 
LICENSE-binary.")
         return rc
@@ -281,9 +533,9 @@ def main() -> int:
     if args.kind == "python":
         claimed = parse_prose(lb, "python")
         reality = collect_python(Path(args.inputs[0]))
-        added = sorted(reality - claimed)
-        stale = sorted(claimed - reality)
-        rc = report(added, stale, "Python packages", "python")
+        direct = load_direct_python()
+        added, stale, dd, dt = diff_simple(_index_python(claimed), 
_index_python(reality), direct)
+        rc = report(added, stale, dd, dt, "Python packages", "python", 
args.ignore_transitive_version)
         if rc == 0:
             print(f"OK: {len(reality)} Python packages match LICENSE-binary.")
         return rc

Reply via email to