krickert opened a new pull request, #1073:
URL: https://github.com/apache/opennlp/pull/1073

   ## What
   
   - New `opennlp.tools.tokenize.BertTokenizer`: the full BERT tokenization 
pipeline (basic tokenization / normalization, then wordpiece). Lower casing + 
accent stripping on by default for uncased models, cased models opt out via 
constructor flag.
   - Direct fixes to `WordpieceTokenizer`: per-character Unicode-aware 
punctuation splitting, whole-word unknown-token replacement for partially 
matched words (matching the reference implementation), and `tokenizePos` now 
throws `UnsupportedOperationException` instead of returning `null`.
   
   ## Why
   
   See [OPENNLP-1837](https://issues.apache.org/jira/browse/OPENNLP-1837). 
Without basic tokenization, uncased models (including both models recommended 
by the opennlp-dl README) receive `[UNK]` for every capitalized or accented 
word. Measured embedding fidelity vs. the Python reference was cosine 
0.09-0.57; with this fix it exceeds 0.999999.
   
   ## Recommendation
   
   The opennlp-dl components (`SentenceVectorsDL`, `DocumentCategorizerDL`, 
`NameFinderDL`) should adopt `BertTokenizer` as their default tokenization in a 
follow-up, so uncased models work correctly out of the box.
   
   ## Validation
   
   All expected token sequences in the new tests were generated with the 
HuggingFace `tokenizers` reference implementation. `BertTokenizer` was 
additionally verified byte-identical to the reference on the real 
bert-base-uncased vocabulary across a corpus covering capitalization, 
diacritics, punctuation runs, CJK, URLs and mixed whitespace (12/12 sentences).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to