rzo1 commented on PR #1103: URL: https://github.com/apache/opennlp/pull/1103#issuecomment-4779712904
Still a huge PR. If we want an actual human-based QM review rather than a rubber stamp, this is far too big as it stands. @mawiesne on reviewability: even after subtracting the 10k data file, this "foundation" PR still welds together two independent concepts, and I think we can make it noticeably easier to review by splitting it into two stacked PRs. - **1a, normalizer engine + data.** `CharClass`/`CodePointSet`/`UnicodeWhitespace`/`UnicodeDash`, the rungs (NFC, NFKC, dash, quote, digit, German umlaut, confusable skeleton, etc.), `Dimension`, the non-aligned `TextNormalizer`, plus `confusables.txt` and all its LICENSE/NOTICE/rat-excludes bookkeeping. This is where the license review belongs and it's mostly mechanical per-code-point substitution, so it's quick to read. - **1b, the offset/alignment layer.** `Alignment`, `AlignedText`, `OffsetAwareNormalizer`, `buildAligned()`, and the `*Aligned` `CharClass` variants with their tests. This is the conceptually hard part (binary-search span mapping, expansion/deletion edge cases) and carries the densest tests, so it deserves a focused read on its own ~800 lines instead of being skimmed inside a 15k diff. It was also added after the initial submission, so it splits out cleanly along the history. That keeps each PR well under ~1.5k lines of real code and lets the reviewer spend their attention budget where it actually matters. If re-stacking the branch is more churn than it's worth, a lighter alternative is to keep it as one PR but call out in the description "skip `confusables.txt`; the alignment layer is files X/Y/Z" so the review path is obvious. Curious which you'd prefer, @mawiesne and @krickert? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
