bobbai00 opened a new pull request, #4675:
URL: https://github.com/apache/texera/pull/4675

   ### What changes were proposed in this PR?
   
   > **Stacked on top of #4668.** This PR's diff against \`main\` will reduce 
to a single commit (the auto-generation work) once #4668 is merged. Until then, 
this PR shows all of #4668's commits plus the auto-generation commit.
   
   Replaces the hand-curated per-module \`NOTICE-binary\` files introduced in 
#4668 with output from a new generator that extracts attribution from each 
module's bundled jars.
   
   **New script** — \`bin/licensing/generate_notice_binary.py\`:
   - Walks each module's \`lib/\` dir, opens every \`*.jar\` (skips 
\`org.apache.texera.*\`), extracts every \`META-INF/NOTICE\` (or root-level 
\`NOTICE\`) file.
   - Dedupes by SHA-1 of normalized content; jars sharing a NOTICE collapse 
into one block.
   - Each block: \`--- 80-dash sep ---\`, project heading derived from a 
hand-curated \`PROJECT_NAMES\` table (longest-prefix match → e.g. 
\`org.apache.hadoop.\` → \`Apache Hadoop\`), sep, \"Bundled jars\" listing, 
verbatim upstream NOTICE.
   - Sorted by jar-count desc; hash tiebreaker for stable order.
   - Normalizes CRLF→LF so committed and regenerated outputs match 
byte-for-byte through git.
   - Optional \`--extras <file>\` appends a verbatim block (used for non-jar 
attributions like aiohttp + Matplotlib).
   
   **\`amber/NOTICE-binary-extras\`** (new): the aiohttp + Matplotlib blocks, 
since those are Python wheels not jars.
   
   **6 per-module \`NOTICE-binary\` files regenerated** — replace the curated 
subsets. Block counts: 24 / 24 / 87 / 92 / 88 / 91 (was 18 / 18 / 25 / 26 / 26 
/ 27 in #4668). Higher counts because dedup is by exact content rather than by 
hand-grouped upstream project, so e.g. Hadoop sub-artifacts whose 
\`META-INF/NOTICE\` differ slightly across versions now show as separate 
blocks. Every distinct attribution actually shipped is preserved verbatim — 
strictly more ASF-compliant under Apache-2.0 §4(d).
   
   **CI verification** — new step in \`build.yml\`'s scala job, after the 
existing dist-unzip + license check:
   
   \`\`\`
   for each module: regenerate NOTICE-binary against /tmp/dists/<module>-*/lib, 
diff against committed
   fail with a one-line fix-up command if drift
   \`\`\`
   
   So future dep bumps: bump in \`build.sbt\` → CI fails on NOTICE drift → run 
\`./bin/licensing/generate_notice_binary.py <module>/NOTICE-binary <lib-dir> 
[--extras …]\` → commit.
   
   ### Any related issues, documentation, discussions?
   
   Closes #4674
   Depends on #4668 (this PR's base will retarget to a clean diff once #4668 
lands)
   
   ASF guidance: https://infra.apache.org/licensing-howto.html (Apache-2.0 
§4(d))
   
   ### How was this PR tested?
   
   - Generator run locally against jars extracted from 
\`ghcr.io/apache/texera-*:61ce334cb\` images for all 6 modules; output verified 
line-by-line against current curated NOTICE blocks.
   - CRLF→LF normalization verified: regenerated files produce byte-identical 
output to committed files (no spurious git auto-conversion drift).
   - CI step's logic exercised locally: \`generate_notice_binary.py /tmp/foo 
<lib-dir> --extras …\` then \`diff <module>/NOTICE-binary /tmp/foo\` → empty 
(clean).
   - Generator skips \`org.apache.texera.*\` jars (own first-party content, not 
third-party).
   
   ### Was this PR authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Opus 4.7)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to