rzo1 commented on PR #1103:
URL: https://github.com/apache/opennlp/pull/1103#issuecomment-4762676928
Thx for the PR. Here are some suggestions:
- Dimension javadoc forward-references Term/TermAnalyzer, which only arrive
in #1104. Standalone javadoc on this branch emits unresolved-reference
warnings. Either downgrade those {@link} to {@code}, or confirm the stack
always merges as a unit.
- Offset mapping isn't reachable through the builder.
normalizeMapped/OffsetMap exist only directly on CharClass, while
TextNormalizer.build() composes plain CharSequenceNormalizers and discards
offsets, so the NormalizedText/OffsetMap capability can't be reached via
TextNormalizer. Is that intentional for now?
- OffsetMap buffer growth uses Arrays.copyOf(buffer, buffer.length * 2),
which overflows to NegativeArraySizeException past ~2^30 chars instead of a
clean OOM. Overflow-aware growth would be tidier. Edge case, per-document use.
- Confusables.load() has no per-line guard, so a malformed bundled line
surfaces as ExceptionInInitializerError, whereas the user-facing
CodePointSet.fromFile reports the offending line number. Minor inconsistency. A
checksum or version assertion would also back the "unmodified upstream 17.0.0"
NOTICE claim.
- Nit: serialVersionUID = 1L on the new rungs vs random longs on the legacy
ones, and TextNormalizer.builder() returns a TextNormalizer that is itself the
mutable builder (a nested Builder would read cleaner).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]