krickert commented on PR #1073:
URL: https://github.com/apache/opennlp/pull/1073#issuecomment-4680497953

   
   Thanks for the thorough review — all points addressed in the latest commit:
   
   **1. Breaking changes:** Agreed on all three. I've added an explicit "As of 
OpenNLP 3.0.0" paragraph to the `WordpieceTokenizer` class Javadoc listing each 
behavior change (punctuation-run splitting, whole-word `[UNK]` replacement, 
`tokenizePos` throwing). Since this branch targets 3.0.0-SNAPSHOT they land in 
a major release; happy to also tag the JIRA for release notes if that's the 
convention here.
   
   **2. Dependency direction:** Done -  the shared character predicates 
(`isControl`, `isWhitespace`, `isPunctuation`, `isCjk`) and 
`isolatePunctuation` now live in a package-private `BertNormalization` helper 
used by both classes, so `WordpieceTokenizer` no longer depends on 
`BertTokenizer`.
   
   **3. `isControl`:** Good catch. I verified against HuggingFace `tokenizers` 
first (it strips Co/Cn the same way the Python reference does) and widened the 
check to all `C*` categories (Cc, Cf, Cs, Co, Cn), with a new test covering 
U+E000 (private use) and U+FDD0 (noncharacter). The full-vocabulary parity 
harness now passes 13/13, including a sentence with embedded C* characters.
   
   **Minor (lowercase/accent coupling):**  `lowerCase` intentionally mirrors 
the reference default coupling (`strip_accents` follows `do_lower_case` unless 
overridden); this is now noted in the class Javadoc. A decoupled `stripAccents` 
flag is easy to add later without breaking the constructor surface if a 
cased+accent-stripped model ever shows up.
   
   Also addressed the two Copilot notes: `maxTokenLength` is now validated 
(`IllegalArgumentException` on negative values) and the `BertTokenizer` 
constructor null-checks all three special tokens, both with tests.
   
   Follow-up JIRA for switching the `opennlp-dl` components to `BertTokenizer` 
- I'll create the JIRA ticket now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to