[PR] OPENNLP-1839: Fix native memory leak and vocabulary NPE in DocumentCategorizerDL (opennlp)

via GitHub Thu, 11 Jun 2026 20:35:51 -0700


krickert opened a new pull request, #1074:
URL: https://github.com/apache/opennlp/pull/1074


   ## What
   
   - `categorize()` leaked native memory on every call: the `OnnxTensor` inputs 
and the `OrtSession.Result` were never closed. Tensors are now released in a 
`finally` block and the result via try-with-resources (`getValue()` copies into 
Java arrays first, so this is safe).
   - A token missing from the vocabulary caused `vocab.get(...)` to auto-unbox 
`null` into an `int`, throwing an opaque `NullPointerException` that the broad 
catch in `categorize()` swallowed into an empty score array. The mapping loop 
is now a testable `tokenIds()` helper that throws `IllegalArgumentException` 
naming the missing token, which indicates the vocabulary file does not match 
the model.
   
   ## Why
   
   See [OPENNLP-1839](https://issues.apache.org/jira/browse/OPENNLP-1839). 
Long-running services calling `categorize()` repeatedly accumulate off-heap 
allocations until the process is killed. This applies the same 
resource-management pattern as the `SentenceVectorsDL` fix (OPENNLP-1836, 
#1072).
   
   ## Validation
   
   New `DocumentCategorizerDLTest` covers the token-id mapping and the 
vocabulary-miss error. All existing `opennlp-dl` tests pass.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] OPENNLP-1839: Fix native memory leak and vocabulary NPE in DocumentCategorizerDL (opennlp)

Reply via email to