Copilot commented on code in PR #1075:
URL: https://github.com/apache/opennlp/pull/1075#discussion_r3404715497
##########
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java:
##########
@@ -122,6 +123,50 @@ protected WordpieceTokenizer createTokenizer(
return new WordpieceTokenizer(vocab.keySet());
}
+ /**
+ * Creates a {@link BertTokenizer} that performs the full BERT tokenization
+ * pipeline: basic tokenization (text normalization) followed by wordpiece.
+ * The special tokens are selected based on the vocabulary: if it contains
+ * RoBERTa-style tokens, those are used, otherwise the BERT defaults.
+ *
+ * @param vocab The vocabulary map.
+ * @param lowerCase {@code true} for uncased models (lower casing and accent
+ * stripping), {@code false} for cased models.
+ * @return A configured {@link BertTokenizer}.
+ */
+ protected BertTokenizer createTokenizer(
+ final Map<String, Integer> vocab, final boolean lowerCase) {
+ if (vocab.containsKey(
+ WordpieceTokenizer.ROBERTA_CLS_TOKEN)
+ && vocab.containsKey(
+ WordpieceTokenizer.ROBERTA_SEP_TOKEN)) {
+ final String unk = vocab.containsKey(
+ WordpieceTokenizer.ROBERTA_UNK_TOKEN)
+ ? WordpieceTokenizer.ROBERTA_UNK_TOKEN
+ : WordpieceTokenizer.BERT_UNK_TOKEN;
+ return new BertTokenizer(
+ vocab.keySet(),
+ lowerCase,
+ WordpieceTokenizer.ROBERTA_CLS_TOKEN,
+ WordpieceTokenizer.ROBERTA_SEP_TOKEN,
+ unk);
Review Comment:
`createTokenizer(vocab, lowerCase)` falls back to
`WordpieceTokenizer.BERT_UNK_TOKEN` when the vocabulary has RoBERTa CLS/SEP
tokens but is missing `ROBERTA_UNK_TOKEN`. If the vocabulary also doesn't
contain `[UNK]`, this produces a tokenizer that can emit an unknown token not
present in the vocabulary, failing later during id mapping with a less-direct
error. Prefer validating that at least one supported UNK token is present and
throw a clear `IllegalArgumentException` otherwise.
##########
opennlp-eval-tests/src/test/java/opennlp/dl/vectors/SentenceVectorsDLEval.java:
##########
@@ -61,6 +61,13 @@ public void generateVectorsTest() throws Exception {
Assertions.assertEquals(0.20219636, vectors[1], 0.00001);
Assertions.assertEquals(0.41306049, vectors[2], 0.00001);
Assertions.assertEquals(384, vectors.length);
+
+ // The uncased model lower cases during tokenization, so a capitalized
+ // variant must produce the same vectors. Prior to BERT basic
+ // tokenization, every capitalized word was mapped to [UNK].
+ final float[] capitalized = sv.getVectors("George Washington was
President");
+
+ Assertions.assertArrayEquals(vectors, capitalized);
Review Comment:
`Assertions.assertArrayEquals(vectors, capitalized)` asserts exact float
equality, which can be brittle across ONNX Runtime versions/platforms even when
the inputs are intended to be equivalent. Use the `assertArrayEquals(float[],
float[], float delta)` overload with a small tolerance (similar to the scalar
assertions above).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]