THausherr commented on PR #2769:
URL: https://github.com/apache/tika/pull/2769#issuecomment-4258694078
Here's what copilot says after told it that adjusting the test is the wrong
priority:
Adding a globally registered parser changes the set/order of parsers that
AutoDetectParser discovers. ServiceLoader iteration order is not guaranteed,
and changes in classpath/jar ordering in CI can affect parser
selection/behavior. Even if the new parser isn’t intended for ODT, it can still
perturb overall parser discovery and embedded parsing flow, which in turn
changes the emitted SAX events and thus the extracted phone-number order.
Additionally, EncodeOCRParser currently advertises support for some
non-ocr-* image types (image/jp2, image/jpx, image/x-portable-pixmap), which
makes it “more eligible” than intended and increases the chance of unintended
participation.
Recommended fix (don’t weaken the test; fix the regression)
1. Restrict EncodeOCRParser supported types to only image/ocr-* (opt-in
via override), removing the non-ocr- image types. That keeps it from being
considered for generic image parsing and reduces collateral changes in parse
output.
2. If the intent is truly opt-in-only: consider removing the ServiceLoader
registration and requiring explicit inclusion via tika-config.xml. That
completely avoids global side effects on unrelated parsing/tests.
(I'm just posting this, I have no opinon myself on this)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]