krickert opened a new pull request, #1073: URL: https://github.com/apache/opennlp/pull/1073
## What - New `opennlp.tools.tokenize.BertTokenizer`: the full BERT tokenization pipeline (basic tokenization / normalization, then wordpiece). Lower casing + accent stripping on by default for uncased models, cased models opt out via constructor flag. - Direct fixes to `WordpieceTokenizer`: per-character Unicode-aware punctuation splitting, whole-word unknown-token replacement for partially matched words (matching the reference implementation), and `tokenizePos` now throws `UnsupportedOperationException` instead of returning `null`. ## Why See [OPENNLP-1837](https://issues.apache.org/jira/browse/OPENNLP-1837). Without basic tokenization, uncased models (including both models recommended by the opennlp-dl README) receive `[UNK]` for every capitalized or accented word. Measured embedding fidelity vs. the Python reference was cosine 0.09-0.57; with this fix it exceeds 0.999999. ## Recommendation The opennlp-dl components (`SentenceVectorsDL`, `DocumentCategorizerDL`, `NameFinderDL`) should adopt `BertTokenizer` as their default tokenization in a follow-up, so uncased models work correctly out of the box. ## Validation All expected token sequences in the new tests were generated with the HuggingFace `tokenizers` reference implementation. `BertTokenizer` was additionally verified byte-identical to the reference on the real bert-base-uncased vocabulary across a corpus covering capitalization, diacritics, punctuation runs, CJK, URLs and mixed whitespace (12/12 sentences). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
