krickert commented on PR #1105:
URL: https://github.com/apache/opennlp/pull/1105#issuecomment-4763363290
**opennlp-dl now compile-depends on all of opennlp-runtime.**
Fixed at the root. The minimal normalization engine (`CharClass`,
`CodePointSet`, `UnicodeWhitespace`, `UnicodeDash`) moved to `opennlp-api`,
which already ships small concrete implementations (e.g. `BertTokenizer`,
`WhitespaceTokenizer`). `opennlp-dl` now compiles against `opennlp-api` plus
onnxruntime only; `opennlp-runtime` is back to a test-only dependency.
**@Deprecated on find(String[]) is too broad.**
Fixed. The `@Deprecated` is removed. `find`'s javadoc instead documents the
single case it diverges: whitespace folding is length-preserving, so `find` is
only affected when `normalizeDashes` is enabled and a non-BMP dash is present,
in which case `findInOriginal` gives the exact original mapping. Every other
caller is correct and no longer sees a warning. (This also resolves
`NameFinderDLEval` calling a deprecated method.)
**findInOriginal is only on the concrete type, not on TokenNameFinder.**
Added it as a capability interface: `OffsetMappingNameFinder extends
TokenNameFinder` in `opennlp-api` declares `findInOriginal`, and `NameFinderDL`
implements it, so an interface-typed caller (UIMA, eval harness) reaches the
offset-correct path with `finder instanceof OffsetMappingNameFinder` — a plain
type check, no reflection. I went with a separate interface rather than a
default method on `TokenNameFinder` for a concrete reason:
`TokenNameFinder.find` is documented to return *token* spans (indices into the
token array, as `NameFinderME` does), while `findInOriginal` returns
*character* offsets in the original input, so a `default findInOriginal(t) {
return find(t); }` would hand back token indices where the method promises
character offsets. The only mismatch-free way to put it directly on
`TokenNameFinder` is a default that throws `UnsupportedOperationException`,
which callers have to guard against anyway, so it buys nothing over the
capability check. (For full honesty, `N
ameFinderDL.find` itself already returns character offsets rather than token
spans, so the DL coordinate story is a pre-existing deviation from the
interface contract.) It's your interface, so if you'd still rather it live on
`TokenNameFinder` in some form, say the word and I'll switch it.
**Missing end-to-end test.**
Added a real one. `NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash`
runs the actual `dslim/bert-base-NER` ONNX model with `normalizeDashes` enabled
and a non-BMP dash (Yezidi hyphen, U+10EAD) before the entity, and asserts the
`findInOriginal` span covers "George Washington" at its true original offset
rather than the one-unit-shorter normalized offset. It exercises the full
normalize/tokenize/infer/decode/map-back path with no production test seam, and
it passes. The offset-mapping math is additionally unit-tested model-free in
`AbstractDLChunkingTest`. This also moots the "`NameFinderDLEval` still calls
deprecated `find()`" point, since `find()` is no longer deprecated.
**Nit: README/javadoc say offset-preserving "for BMP".**
Fixed. The README now states the stronger guarantee: whitespace folding
never moves offsets, and `findInOriginal` maps spans back through the
`Alignment` so they stay correct in the original input even for non-BMP dashes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]