rzo1 commented on PR #1101: URL: https://github.com/apache/opennlp/pull/1101#issuecomment-4753845899
I know it is currently a draft state, but here are some thoughts after an intermediate pass, mostly around licensing, architecture and reviewability. **Licensing / release plumbing** The bundled Unicode data files (WordBreakProperty.txt, ExtendedPictographic.txt, confusables.txt and the WordBreakTest.txt under test) are all under the Unicode license, which is ASF Category A, so the content itself is fine to ship. A few things still need fixing before a release build is happy though: - The Unicode attribution was added to the generated `NOTICE`. It needs to go into `src/license/NOTICE.template` instead, otherwise it gets dropped the next time NOTICE is regenerated (same as we did for the stopword lists and the spellchecker in OPENNLP-1832). - `rat-excludes` is not updated for the four bundled `.txt` files. The default build hides this, but the `apache-release` profile runs RAT with the excludes, and RAT will not recognize the Unicode header. Please add the four paths with an OPENNLP-1850 comment. - `LICENSE` should carry the actual Unicode license text, the way it already does for the bundled stopword lists. The newer Unicode files only link to terms_of_use.html in their header rather than embedding the text, so a URL in NOTICE is not enough on its own. - `ExtendedPictographic.txt` is described in NOTICE as an unmodified `emoji-data.txt`, but it is actually a filtered subset: only the `Extended_Pictographic` property is kept (451 lines), and the file was renamed. The upstream emoji-data.txt carries six properties. The wording should say it is derived from emoji-data.txt by extracting that one property, not "unmodified". **Architecture** The bigger thing I would like to align on before this freezes as public API: there are now several overlapping ways to normalize text living next to each other. We already had the `CharSequenceNormalizer` family (Aggregate, Shrink, Number, Url, Twitter, Emoji). This PR adds about 14 more of those, plus a `TextNormalizer` builder over them, plus a `TextAnalyzer`/`AnalyzedToken` offset model in opennlp-api, plus a separate `TermAnalyzer`/`Term`/`Dimension` layered model in opennlp-runtime. The same rungs (nfc, nfkc, whitespace, dash, case fold, accent fold) end up declared in three places that can drift. `TextAnalyzer` and `TermAnalyzer` also do basically the same job (tokenize, normalize per token, keep the source span) but with different names and different tokenizers. I think we should pick one token-analysis entry point and one place that defines the steps before this ships, otherwise unifying it later is a breaking change. Related: the api vs runtime split is worth a second look. The classes placed in opennlp-api are concrete algorithms plus Unicode tables, not contracts, and the only cross-module consumer is opennlp-dl (which uses CharClass and already depends on opennlp-runtime). The feature is currently split across both modules for no clear reason. The DL changes (InferenceOptions normalize flags, AbstractDL.normalizeInput, and the NameFinderDL/DocumentCategorizerDL span decode rework) are good, but note these are behavioral changes to existing components rather than purely additive, so they deserve their own focused review. **Reviewability** This is around 70 files and 20k+ lines covering four fairly independent subsystems, which makes it hard to review as one unit. Would you be open to splitting it into stacked PRs under an OPENNLP-1850 umbrella, roughly: (1) the Unicode primitives in opennlp-api, (2) the UAX 29 tokenizer plus its data files and the rat/license plumbing, (3) the new CharSequenceNormalizer rungs, (4) the Term/Dimension layered model plus confusables, (5) the DL integration, and (6) the docs? That would let the foundational pieces go in quickly and give the contested parts (the overlapping abstractions, and the DL behavior change) the attention they need on their own. Happy to help with the NOTICE/rat/LICENSE fixes if useful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
