[ https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566951#comment-17566951 ]
Hudson commented on TIKA-3812: ------------------------------ UNSTABLE: Integrated in Jenkins build Tika ยป tika-main-jdk8 #685 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/685/]) TIKA-3812 -- add unit tests to confirm parser order with >= 2.4.1 (tallison: [https://github.com/apache/tika/commit/19b0337d60d91778d6837f88e62d151586888a79]) * (add) tika-parsers/tika-parsers-extended/tika-parser-scientific-package/src/test/resources/2.4.0-no-tesseract.txt * (add) tika-parsers/tika-parsers-extended/tika-parser-scientific-package/src/test/resources/2.4.0-tesseract.txt * (edit) tika-parsers/tika-parsers-extended/tika-parser-scientific-package/pom.xml * (add) tika-parsers/tika-parsers-extended/tika-parser-scientific-package/src/test/java/org/apache/tika/parser/scientific/integration/TestParsers.java * (add) tika-parsers/tika-parsers-extended/tika-parser-scientific-package/src/test/resources/2.4.1-no-tesseract.txt * (add) tika-parsers/tika-parsers-extended/tika-parser-scientific-package/src/test/resources/2.4.1-tesseract.txt > Parser Order: image get parsed by GDALParser instead of TesseractOCRParser > -------------------------------------------------------------------------- > > Key: TIKA-3812 > URL: https://issues.apache.org/jira/browse/TIKA-3812 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.4.1 > Reporter: Eugen Caruntu > Priority: Minor > Attachments: parser-diffs.tgz > > > The selected parser seems to be different in 2.4.1. For example sending an > image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, > now gets parsed by GDALParser. > Seems that when multiple parsers support same file types, the selected parser > depends on the order in which they get loaded. > For example the GDALParser, ImageParser and TesseractOCRParser all support > image/jpeg, image/png, image/gif ... > A recent change is reversing the parser order (TIKA-3750). > Re-configuring the GDALParser by excluding the image mime types might work, > but there could be other duplicated parsers. -- This message was sent by Atlassian Jira (v8.20.10#820010)