Re: [PR] OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) (opennlp)

via GitHub Tue, 23 Jun 2026 06:26:12 -0700


rzo1 commented on PR #1103:
URL: https://github.com/apache/opennlp/pull/1103#issuecomment-4779686304


   @mawiesne on reviewability: even after subtracting the 10k data file, this 
"foundation" PR still welds together two independent concepts, and I think we 
can make it noticeably easier to review by splitting it into two stacked PRs:
   
   - **1a — normalizer engine + data.** 
`CharClass`/`CodePointSet`/`UnicodeWhitespace`/`UnicodeDash`, the rungs (NFC, 
NFKC, dash, quote, digit, German umlaut, confusable skeleton…), `Dimension`, 
the non-aligned `TextNormalizer`, plus `confusables.txt` and all its 
LICENSE/NOTICE/rat-excludes bookkeeping. This is where the license review 
belongs and it's mostly mechanical per-code-point substitution — quick to read.
   - **1b — the offset/alignment layer.** `Alignment`, `AlignedText`, 
`OffsetAwareNormalizer`, `buildAligned()`, and the `*Aligned` `CharClass` 
variants with their tests. This is the conceptually hard part (binary-search 
span mapping, expansion/deletion edge cases) and carries the densest tests — it 
deserves a focused read on its own ~800 lines instead of being skimmed inside a 
15k diff. It was also added after the initial submission, so it splits out 
cleanly along the history.
   
   That keeps each PR well under ~1.5k lines of real code and lets the reviewer 
spend their attention budget where it actually matters. If re-stacking the 
branch is more churn than it's worth, a lighter alternative is to keep it as 
one PR but call out in the description "skip `confusables.txt`; the alignment 
layer is files X/Y/Z" so the review path is obvious.
   
   Curious which you'd prefer, @mawiesne.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) (opennlp)

Reply via email to