Re: [PR] Opennlp 1850 whitespace utf normalizae (opennlp)

via GitHub Fri, 19 Jun 2026 11:17:09 -0700


rzo1 commented on PR #1101:
URL: https://github.com/apache/opennlp/pull/1101#issuecomment-4753845899


   I know it is currently a draft state, but here are some thoughts after an 
intermediate pass, mostly around licensing, architecture and reviewability.
   
   **Licensing / release plumbing**
   
   The bundled Unicode data files (WordBreakProperty.txt, 
ExtendedPictographic.txt, confusables.txt and the WordBreakTest.txt under test) 
are all under the Unicode license, which is ASF Category A, so the content 
itself is fine to ship. A few things still need fixing before a release build 
is happy though:
   
   - The Unicode attribution was added to the generated `NOTICE`. It needs to 
go into `src/license/NOTICE.template` instead, otherwise it gets dropped the 
next time NOTICE is regenerated (same as we did for the stopword lists and the 
spellchecker in OPENNLP-1832).
   - `rat-excludes` is not updated for the four bundled `.txt` files. The 
default build hides this, but the `apache-release` profile runs RAT with the 
excludes, and RAT will not recognize the Unicode header. Please add the four 
paths with an OPENNLP-1850 comment.
   - `LICENSE` should carry the actual Unicode license text, the way it already 
does for the bundled stopword lists. The newer Unicode files only link to 
terms_of_use.html in their header rather than embedding the text, so a URL in 
NOTICE is not enough on its own.
   - `ExtendedPictographic.txt` is described in NOTICE as an unmodified 
`emoji-data.txt`, but it is actually a filtered subset: only the 
`Extended_Pictographic` property is kept (451 lines), and the file was renamed. 
The upstream emoji-data.txt carries six properties. The wording should say it 
is derived from emoji-data.txt by extracting that one property, not 
"unmodified".
   
   **Architecture**
   
   The bigger thing I would like to align on before this freezes as public API: 
there are now several overlapping ways to normalize text living next to each 
other. We already had the `CharSequenceNormalizer` family (Aggregate, Shrink, 
Number, Url, Twitter, Emoji). This PR adds about 14 more of those, plus a 
`TextNormalizer` builder over them, plus a `TextAnalyzer`/`AnalyzedToken` 
offset model in opennlp-api, plus a separate `TermAnalyzer`/`Term`/`Dimension` 
layered model in opennlp-runtime. The same rungs (nfc, nfkc, whitespace, dash, 
case fold, accent fold) end up declared in three places that can drift. 
`TextAnalyzer` and `TermAnalyzer` also do basically the same job (tokenize, 
normalize per token, keep the source span) but with different names and 
different tokenizers. I think we should pick one token-analysis entry point and 
one place that defines the steps before this ships, otherwise unifying it later 
is a breaking change.
   
   Related: the api vs runtime split is worth a second look. The classes placed 
in opennlp-api are concrete algorithms plus Unicode tables, not contracts, and 
the only cross-module consumer is opennlp-dl (which uses CharClass and already 
depends on opennlp-runtime). The feature is currently split across both modules 
for no clear reason.
   
   The DL changes (InferenceOptions normalize flags, AbstractDL.normalizeInput, 
and the NameFinderDL/DocumentCategorizerDL span decode rework) are good, but 
note these are behavioral changes to existing components rather than purely 
additive, so they deserve their own focused review.
   
   **Reviewability**
   
   This is around 70 files and 20k+ lines covering four fairly independent 
subsystems, which makes it hard to review as one unit. Would you be open to 
splitting it into stacked PRs under an OPENNLP-1850 umbrella, roughly: (1) the 
Unicode primitives in opennlp-api, (2) the UAX 29 tokenizer plus its data files 
and the rat/license plumbing, (3) the new CharSequenceNormalizer rungs, (4) the 
Term/Dimension layered model plus confusables, (5) the DL integration, and (6) 
the docs? That would let the foundational pieces go in quickly and give the 
contested parts (the overlapping abstractions, and the DL behavior change) the 
attention they need on their own.
   
   Happy to help with the NOTICE/rat/LICENSE fixes if useful.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Opennlp 1850 whitespace utf normalizae (opennlp)

Reply via email to