Re: [PR] OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) (opennlp)

via GitHub Sat, 20 Jun 2026 05:42:57 -0700


krickert commented on PR #1103:
URL: https://github.com/apache/opennlp/pull/1103#issuecomment-4757873602


   @rzo1 split it up like we discussed. OPENNLP-1850 is now four stacked PRs 
instead of the one big #1101 (closed):
   
   1. **#1103, normalization foundation** (base `main`): the dependency-free 
layer, meaning the `CharClass` engine, the normalizer rungs, `Dimension`, 
`TextNormalizer`, and confusables. Lands first, lowest risk.
   2. **#1104, UAX #29 tokenizer + Term model** (on #1103).
   3. **#1105, DL input normalization** (on #1104): the behavioral change, 
isolated for the focused review you wanted. It only depends on the foundation, 
so once #1103 lands it can re-target straight to `main`, no need to wait on the 
tokenizer.
   4. **#1106, docs** (on #1105).
   
   All your licensing points are addressed, riding with whichever stack ships 
the data file: attribution in `NOTICE.template` so it survives regen, full 
Unicode License V3 text in `LICENSE`, the four paths in `rat-excludes`, and the 
`ExtendedPictographic.txt` wording fixed to "filtered subset of 
emoji-data.txt," not "unmodified."
   
   Each stack builds and tests green on its own. Thanks again for steering 
this, the split is much cleaner to review.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) (opennlp)

Reply via email to