[PR] OPENNLP-1836 - Fix input encoding in SentenceVectorsDL (opennlp)

via GitHub Wed, 10 Jun 2026 04:59:44 -0700


krickert opened a new pull request, #1072:
URL: https://github.com/apache/opennlp/pull/1072


   See https://issues.apache.org/jira/browse/OPENNLP-1836
   
   SentenceVectorsDL sent an all-zero attention_mask and all-one token_type_ids 
to the ONNX model, so the encoder attended to nothing. This fixes the encoding 
to the standard single-segment BERT convention (mask=1, types=0), consistent 
with DocumentCategorizerDL, and additionally:
   
   - closes the OnnxTensor inputs and OrtSession.Result (native memory leak)
   - replaces the NPE on a vocabulary miss with a descriptive 
IllegalArgumentException
   - adds a unit test for the encoding (tokenize is now package-private static, 
no ONNX session needed)
   - updates SentenceVectorsDLEval expectations
   
   Eval values were verified empirically: the unfixed code reproduces the 
previously pinned values exactly against the public 
sentence-transformers/all-MiniLM-L6-v2 ONNX export, and the corrected encoding 
produces the new pinned values (dimension 384).
   
   Note: this is a behavioral fix - vectors persisted from the old encoding are 
not comparable with the corrected output and should be re-embedded.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] OPENNLP-1836 - Fix input encoding in SentenceVectorsDL (opennlp)

Reply via email to