THausherr commented on PR #2769:
URL: https://github.com/apache/tika/pull/2769#issuecomment-4258694078

   Here's what copilot says after told it that adjusting the test is the wrong 
priority:
   
   Adding a globally registered parser changes the set/order of parsers that 
AutoDetectParser discovers. ServiceLoader iteration order is not guaranteed, 
and changes in classpath/jar ordering in CI can affect parser 
selection/behavior. Even if the new parser isn’t intended for ODT, it can still 
perturb overall parser discovery and embedded parsing flow, which in turn 
changes the emitted SAX events and thus the extracted phone-number order.
   
   Additionally, EncodeOCRParser currently advertises support for some 
non-ocr-* image types (image/jp2, image/jpx, image/x-portable-pixmap), which 
makes it “more eligible” than intended and increases the chance of unintended 
participation.
   Recommended fix (don’t weaken the test; fix the regression)
   
       1. Restrict EncodeOCRParser supported types to only image/ocr-* (opt-in 
via override), removing the non-ocr- image types. That keeps it from being 
considered for generic image parsing and reduces collateral changes in parse 
output.
   
   2. If the intent is truly opt-in-only: consider removing the ServiceLoader 
registration and requiring explicit inclusion via tika-config.xml. That 
completely avoids global side effects on unrelated parsing/tests.
   
   (I'm just posting this, I have no opinon myself on this)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to