krickert opened a new pull request, #1072: URL: https://github.com/apache/opennlp/pull/1072
See https://issues.apache.org/jira/browse/OPENNLP-1836 SentenceVectorsDL sent an all-zero attention_mask and all-one token_type_ids to the ONNX model, so the encoder attended to nothing. This fixes the encoding to the standard single-segment BERT convention (mask=1, types=0), consistent with DocumentCategorizerDL, and additionally: - closes the OnnxTensor inputs and OrtSession.Result (native memory leak) - replaces the NPE on a vocabulary miss with a descriptive IllegalArgumentException - adds a unit test for the encoding (tokenize is now package-private static, no ONNX session needed) - updates SentenceVectorsDLEval expectations Eval values were verified empirically: the unfixed code reproduces the previously pinned values exactly against the public sentence-transformers/all-MiniLM-L6-v2 ONNX export, and the corrected encoding produces the new pinned values (dimension 384). Note: this is a behavioral fix - vectors persisted from the old encoding are not comparable with the corrected output and should be re-embedded. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
