krickert opened a new pull request, #1111:
URL: https://github.com/apache/opennlp/pull/1111
Part **2b** of the OPENNLP-1850 stack: the token-analysis layer, split out
of the former tokenizer PR (#1104) on review request.
A `Term` is one token projected through the ordered `Dimension` stack
(original, NFC, NFKC, whitespace, dash, case fold, accent fold, confusable
fold, stem, lemma), keeping its source `Span` and every intermediate form.
`TermAnalyzer` segments with the UAX #29 `WordTokenizer` (from 2a) and applies
the configured dimension prefix. Restores `Dimension`'s `{@link Term}`/`{@link
TermAnalyzer}` javadoc now that those types exist.
Base: `OPENNLP-1850-2a-tokenizer` (#1110). Stack: 1a → 1b → 2a → **2b
(this)** → 2c → DL → docs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]