Re: [PR] OPENNLP-1838: Adopt BertTokenizer in opennlp-dl components (opennlp)

via GitHub Fri, 12 Jun 2026 09:16:02 -0700


Copilot commented on code in PR #1075:
URL: https://github.com/apache/opennlp/pull/1075#discussion_r3404715497



##########
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java:
##########
@@ -122,6 +123,50 @@ protected WordpieceTokenizer createTokenizer(
     return new WordpieceTokenizer(vocab.keySet());
   }
 
+  /**
+   * Creates a {@link BertTokenizer} that performs the full BERT tokenization
+   * pipeline: basic tokenization (text normalization) followed by wordpiece.
+   * The special tokens are selected based on the vocabulary: if it contains
+   * RoBERTa-style tokens, those are used, otherwise the BERT defaults.
+   *
+   * @param vocab The vocabulary map.
+   * @param lowerCase {@code true} for uncased models (lower casing and accent
+   *     stripping), {@code false} for cased models.
+   * @return A configured {@link BertTokenizer}.
+   */
+  protected BertTokenizer createTokenizer(
+      final Map<String, Integer> vocab, final boolean lowerCase) {
+    if (vocab.containsKey(
+            WordpieceTokenizer.ROBERTA_CLS_TOKEN)
+        && vocab.containsKey(
+            WordpieceTokenizer.ROBERTA_SEP_TOKEN)) {
+      final String unk = vocab.containsKey(
+          WordpieceTokenizer.ROBERTA_UNK_TOKEN)
+          ? WordpieceTokenizer.ROBERTA_UNK_TOKEN
+          : WordpieceTokenizer.BERT_UNK_TOKEN;
+      return new BertTokenizer(
+          vocab.keySet(),
+          lowerCase,
+          WordpieceTokenizer.ROBERTA_CLS_TOKEN,
+          WordpieceTokenizer.ROBERTA_SEP_TOKEN,
+          unk);

Review Comment:
   `createTokenizer(vocab, lowerCase)` falls back to 
`WordpieceTokenizer.BERT_UNK_TOKEN` when the vocabulary has RoBERTa CLS/SEP 
tokens but is missing `ROBERTA_UNK_TOKEN`. If the vocabulary also doesn't 
contain `[UNK]`, this produces a tokenizer that can emit an unknown token not 
present in the vocabulary, failing later during id mapping with a less-direct 
error. Prefer validating that at least one supported UNK token is present and 
throw a clear `IllegalArgumentException` otherwise.



##########
opennlp-eval-tests/src/test/java/opennlp/dl/vectors/SentenceVectorsDLEval.java:
##########
@@ -61,6 +61,13 @@ public void generateVectorsTest() throws Exception {
       Assertions.assertEquals(0.20219636, vectors[1], 0.00001);
       Assertions.assertEquals(0.41306049, vectors[2], 0.00001);
       Assertions.assertEquals(384, vectors.length);
+
+      // The uncased model lower cases during tokenization, so a capitalized
+      // variant must produce the same vectors. Prior to BERT basic
+      // tokenization, every capitalized word was mapped to [UNK].
+      final float[] capitalized = sv.getVectors("George Washington was 
President");
+
+      Assertions.assertArrayEquals(vectors, capitalized);

Review Comment:
   `Assertions.assertArrayEquals(vectors, capitalized)` asserts exact float 
equality, which can be brittle across ONNX Runtime versions/platforms even when 
the inputs are intended to be equivalent. Use the `assertArrayEquals(float[], 
float[], float delta)` overload with a small tolerance (similar to the scalar 
assertions above).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1838: Adopt BertTokenizer in opennlp-dl components (opennlp)

Reply via email to