[PR] OPENNLP-1850: Layered Term model — Term, TermAnalyzer (2b/7) (opennlp)

via GitHub Tue, 23 Jun 2026 08:20:20 -0700


krickert opened a new pull request, #1111:
URL: https://github.com/apache/opennlp/pull/1111


   Part **2b** of the OPENNLP-1850 stack: the token-analysis layer, split out 
of the former tokenizer PR (#1104) on review request.
   
   A `Term` is one token projected through the ordered `Dimension` stack 
(original, NFC, NFKC, whitespace, dash, case fold, accent fold, confusable 
fold, stem, lemma), keeping its source `Span` and every intermediate form. 
`TermAnalyzer` segments with the UAX #29 `WordTokenizer` (from 2a) and applies 
the configured dimension prefix. Restores `Dimension`'s `{@link Term}`/`{@link 
TermAnalyzer}` javadoc now that those types exist.
   
   Base: `OPENNLP-1850-2a-tokenizer` (#1110). Stack: 1a → 1b → 2a → **2b 
(this)** → 2c → DL → docs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] OPENNLP-1850: Layered Term model — Term, TermAnalyzer (2b/7) (opennlp)

Reply via email to