[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-07-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563221#comment-17563221
 ] 

Tim Allison commented on TIKA-3812:
---

Ugh.  Thank you.  I'll look into this.  I don't think we have any integration 
tests with the scientific parsers.  That type of testing would have caught 
this.  Y. Let me see what else may have changed.

I'm not sure of the path forward.  The fix in TIKA-3750 was indeed a fix, and 
I'm hesitant to revert that.

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Fix For: 2.4.2
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-07-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563243#comment-17563243
 ] 

Tim Allison commented on TIKA-3812:
---

These are the diffs when tesseract is not installed:
{noformat}
application/x-hdf 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: 
class org.apache.tika.parser.hdf.HDFParser
application/x-netcdf 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: 
class org.apache.tika.parser.netcdf.NetCDFParser
image/bmp 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class 
org.apache.tika.parser.image.ImageParser
image/gif 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class 
org.apache.tika.parser.image.ImageParser
image/jpeg 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class 
org.apache.tika.parser.image.JpegParser
image/png 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class 
org.apache.tika.parser.image.ImageParser
video/mp4 2.4.1: class org.apache.tika.parser.external.CompositeExternalParser 
2.4.0: class org.apache.tika.parser.mp4.MP4Parser
{noformat}

These are the diffs when tesseract is installed:
{noformat}
application/x-hdf 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: 
class org.apache.tika.parser.hdf.HDFParser
application/x-netcdf 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: 
class org.apache.tika.parser.netcdf.NetCDFParser
image/bmp 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class 
org.apache.tika.parser.image.ImageParser
image/gif 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class 
org.apache.tika.parser.image.ImageParser
image/jp2 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class 
org.apache.tika.parser.ocr.TesseractOCRParser
image/jpeg 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class 
org.apache.tika.parser.image.JpegParser
image/png 2.4.1: class org.apache.tika.parser.gdal.GDALParser 2.4.0: class 
org.apache.tika.parser.image.ImageParser
video/mp4 2.4.1: class org.apache.tika.parser.external.CompositeExternalParser 
2.4.0: class org.apache.tika.parser.mp4.MP4Parser
{noformat}

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Fix For: 2.4.2
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-07-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563250#comment-17563250
 ] 

Tim Allison commented on TIKA-3812:
---

I think the above behavior is actually an improvement in 2.4.1. If you have 
{{tika-parser-scientific-package}} on your class path, I think you'd want that 
to run instead of the ImageParser and Tesseract, no?  Or, are you interested in 
other parsers in the scientific-package and do not want GDAL?

Further, if you have exiftool installed, that is now getting triggered on mp4, 
which should be the desired behavior.

What do you think?

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Fix For: 2.4.2
>
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-07-06 Thread Eugen Caruntu (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563260#comment-17563260
 ] 

Eugen Caruntu commented on TIKA-3812:
-

Thank you for confirming the overlapping parser/mime types.

We can handle this in configuration file with a mime-exclude for GDALParser for 
those images, since this seems limited to that mostly.

I understand that adding parsers besides the standard ones will take precedence 
and seems fair.

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-07-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563269#comment-17563269
 ] 

Tim Allison commented on TIKA-3812:
---

I'm sorry for the surprise.  Surprises are not intended and not good.

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-07-14 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566943#comment-17566943
 ] 

Tilman Hausherr commented on TIKA-3812:
---

Build fails on my machine (W10):
{noformat}
[ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.979 s 
<<< FAILURE! - in org.apache.tika.parser.scientific.integration.TestParsers
[ERROR] 
org.apache.tika.parser.scientific.integration.TestParsers.testDiffsFrom241  
Time elapsed: 0.951 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: expected:  but was: 
        at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:55)
        at 
org.junit.jupiter.api.AssertionUtils.failNotEqual(AssertionUtils.java:62)
        at 
org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:182)
        at 
org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:177)
        at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:1141)
        at 
org.apache.tika.parser.scientific.integration.TestParsers.testDiffsFrom241(TestParsers.java:66)
 {noformat}

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-07-14 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566946#comment-17566946
 ] 

Tim Allison commented on TIKA-3812:
---

Sorry.  Just pushed fix.  Tracking to see if that doesn't fix it.

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-07-14 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566951#comment-17566951
 ] 

Hudson commented on TIKA-3812:
--

UNSTABLE: Integrated in Jenkins build Tika » tika-main-jdk8 #685 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/685/])
TIKA-3812 -- add unit tests to confirm parser order with >= 2.4.1 (tallison: 
[https://github.com/apache/tika/commit/19b0337d60d91778d6837f88e62d151586888a79])
* (add) 
tika-parsers/tika-parsers-extended/tika-parser-scientific-package/src/test/resources/2.4.0-no-tesseract.txt
* (add) 
tika-parsers/tika-parsers-extended/tika-parser-scientific-package/src/test/resources/2.4.0-tesseract.txt
* (edit) 
tika-parsers/tika-parsers-extended/tika-parser-scientific-package/pom.xml
* (add) 
tika-parsers/tika-parsers-extended/tika-parser-scientific-package/src/test/java/org/apache/tika/parser/scientific/integration/TestParsers.java
* (add) 
tika-parsers/tika-parsers-extended/tika-parser-scientific-package/src/test/resources/2.4.1-no-tesseract.txt
* (add) 
tika-parsers/tika-parsers-extended/tika-parser-scientific-package/src/test/resources/2.4.1-tesseract.txt


> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-07-14 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566954#comment-17566954
 ] 

Tilman Hausherr commented on TIKA-3812:
---

Thanks, it works now.

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-07-14 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566974#comment-17566974
 ] 

Hudson commented on TIKA-3812:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #686 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/686/])
TIKA-3812 -- test should work whether or not ffmpeg and exiftool are installed 
or not. (tallison: 
[https://github.com/apache/tika/commit/9dc592f95200885af75671f0e0381770d8b8298f])
* (edit) 
tika-parsers/tika-parsers-extended/tika-parser-scientific-package/src/test/resources/2.4.1-tesseract.txt
* (edit) 
tika-parsers/tika-parsers-extended/tika-parser-scientific-package/src/test/resources/2.4.1-no-tesseract.txt
* (edit) 
tika-parsers/tika-parsers-extended/tika-parser-scientific-package/src/test/java/org/apache/tika/parser/scientific/integration/TestParsers.java


> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-10-04 Thread David Pilato (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612617#comment-17612617
 ] 

David Pilato commented on TIKA-3812:


I'm still having issues with 2.5.0.

Basically my OCR tests are working as expected with 2.4.0. It's breaking in 
2.4.1 and 2.5.0.

I'm trying to do OCR on a PNG image.

The image I'm trying to run OCR on is available at 
[https://github.com/dadoonet/fscrawler/blob/master/test-documents/src/main/resources/documents/test-ocr.png]

 

I don't see any metadata extracted with 2.5.0. I have no idea what is happening 
and if I need to change something on my end... 

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-10-04 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612638#comment-17612638
 ] 

Tim Allison commented on TIKA-3812:
---

Ugh [~dadoonet]...sorry.  Are you using tika-scientific-parsers?  Can you share 
the contents of the metadata object after the test -- which parsers touched the 
file?

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-10-04 Thread David Pilato (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612645#comment-17612645
 ] 

David Pilato commented on TIKA-3812:


[~tallison] I always had {{tika-parsers-scientific-package}} in my 
{{{}pom.xml{}}}.

The {{metadata}} object seems to be {{{}null{}}}. Which looks weird to me. May 
be I did something stupid I can't see...

I will try to reproduce this with a small piece of test code. 

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-10-04 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612656#comment-17612656
 ] 

Tim Allison commented on TIKA-3812:
---

I added two unit tests (thank you for the png!): 
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-extended/tika-parsers-extended-integration-tests/src/test/java/org/apache/tika/parser/ocr/TestOCR.java



> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-10-04 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612664#comment-17612664
 ] 

Tim Allison commented on TIKA-3812:
---

Y, sorry.  That test actually shows that gdal is being called when I get the 
imports correct.  

We could modify gdal to implement AbstractImageParser, which would call 
tesseract on those file formats that tesseract handles.  This, however, would 
skip the metadata extraction by the "regular" image parsers.

So, the way to get the legacy behavior is to have gdal skip the "regular" image 
formats through configuration, or we could modify the GDAL parser to skip 
jpg/png etc as its new behavior.

No good options.



> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-10-04 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612665#comment-17612665
 ] 

Tim Allison commented on TIKA-3812:
---

In 1.x, the regular image parsers took precedence over gdal so it was never 
called on png/jpeg, etc.

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-10-04 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612666#comment-17612666
 ] 

Tim Allison commented on TIKA-3812:
---

So, the proposal would be to remove png, jpeg and gif from the file formats 
that GDAL processes.  Users could configure their GDAL parsers to parse those 
formats if they wanted.  This would be closer to the older behavior.

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-10-04 Thread David Pilato (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612667#comment-17612667
 ] 

David Pilato commented on TIKA-3812:


I'm totally fine modifying the code on my side to make this work. But it's 
unclear to me what I need to do ;) 

 

Here is how I'm building the parser and the context. What should I change there?

[https://github.com/dadoonet/fscrawler/blob/master/tika/src/main/java/fr/pilato/elasticsearch/crawler/fs/tika/TikaInstance.java#L84-L178]

 

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-10-04 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612671#comment-17612671
 ] 

Tim Allison commented on TIKA-3812:
---

https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-extended/tika-parsers-extended-integration-tests/src/test/resources/config/tika-config-restricted-gdal.xml

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-10-04 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612695#comment-17612695
 ] 

Hudson commented on TIKA-3812:
--

FAILURE: Integrated in Jenkins build Tika » tika-main-jdk8 #830 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/830/])
TIKA-3812 -- add unit test to confirm plain png and jpeg work (tallison: 
[https://github.com/apache/tika/commit/f69c0ba5d976a72f8075fbd81d732d6ca74a188d])
* (add) 
tika-parsers/tika-parsers-extended/tika-parsers-extended-integration-tests/src/test/resources/test-documents/testOCR.jpg
* (edit) 
tika-integration-tests/tika-resource-loading-tests/src/test/java/org/apache/custom/parser/CustomParserTest.java
* (add) 
tika-parsers/tika-parsers-extended/tika-parsers-extended-integration-tests/src/test/resources/test-documents/testOCR.png
* (edit) 
tika-parsers/tika-parsers-extended/tika-parsers-extended-integration-tests/pom.xml
* (add) 
tika-parsers/tika-parsers-extended/tika-parsers-extended-integration-tests/src/test/java/org/apache/tika/parser/ocr/TestOCR.java


> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-10-04 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612732#comment-17612732
 ] 

Hudson commented on TIKA-3812:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #831 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/831/])
TIKA-3812 -- fix unit test to confirm plain png and jpeg work with config file 
(tallison: 
[https://github.com/apache/tika/commit/634f9191f1a1f3cd21a5ff4311af249663567716])
* (edit) 
tika-parsers/tika-parsers-extended/tika-parsers-extended-integration-tests/pom.xml
* (edit) 
tika-parsers/tika-parsers-extended/tika-parsers-extended-integration-tests/src/test/java/org/apache/tika/parser/ocr/TestOCR.java
* (add) 
tika-parsers/tika-parsers-extended/tika-parsers-extended-integration-tests/src/test/resources/config/tika-config-restricted-gdal.xml


> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-10-05 Thread David Pilato (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612937#comment-17612937
 ] 

David Pilato commented on TIKA-3812:


When excluding {{GDALParser}} from the {{{}DefaultParser{}}}, I'm able to get 
the same behavior as before.

But I don't know how to add the {{GDALParser}} again with exclusions on the 
mime-types you shown using Java code. Which method I should use to mimic the 
following?

{{    }}
{{      image/jpeg}}
{{      image/png}}
{{      image/jp2}}
{{      image/gif}}
{{    }}

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-10-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612939#comment-17612939
 ] 

Tim Allison commented on TIKA-3812:
---

I'd frankly have a default tika-config.xml and just load it from there.  Let me 
try some things.

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-10-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612941#comment-17612941
 ] 

Tim Allison commented on TIKA-3812:
---

Added a unit test.  It is UGLY.

https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-extended/tika-parsers-extended-integration-tests/src/test/java/org/apache/tika/parser/ocr/TestOCR.java#L68

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-10-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612942#comment-17612942
 ] 

Tim Allison commented on TIKA-3812:
---

The benefit of having a default tika-config.xml is that users can look at it 
and make changes as necessary.  With hard coded configuration, it gets a bit 
more difficult.

If you need to inject the tesseract config stuff, you might consider a 
tika-config template like we have in our unit tests?  See, for example 
{{"{ATTACHMENT_STRATEGY"}}: 
https://github.com/apache/tika/blob/main/tika-integration-tests/tika-pipes-solr-integration-tests/src/test/resources/tika-config-solr-urls.xml

But I don't want to put more work on you/break what works!

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-10-05 Thread David Pilato (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612949#comment-17612949
 ] 

David Pilato commented on TIKA-3812:


Amazing! That helps!

I definitely want to read the settings from an XML file in the future so I 
could explose all settings instead of doing that programmatically. There is 
actually a support for this already but it's not the default in FSCrawler and 
it needs a lot of documentation update ;) 

 

 

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-10-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612985#comment-17612985
 ] 

Tim Allison commented on TIKA-3812:
---

Documentation?!  But you have unit tests! :P LOL

If documentation's your thing, we could certainly use more help! :D

Let us know if you have any questions!

> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3812) Parser Order: image get parsed by GDALParser instead of TesseractOCRParser

2022-10-05 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612994#comment-17612994
 ] 

Hudson commented on TIKA-3812:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #833 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/833/])
TIKA-3812 add example of how to configure gdal programmatically (tallison: 
[https://github.com/apache/tika/commit/33c21a3a4c9b4805908600083786eb11c127fd94])
* (edit) 
tika-parsers/tika-parsers-extended/tika-parsers-extended-integration-tests/src/test/java/org/apache/tika/parser/ocr/TestOCR.java


> Parser Order: image get parsed by GDALParser instead of TesseractOCRParser
> --
>
> Key: TIKA-3812
> URL: https://issues.apache.org/jira/browse/TIKA-3812
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Eugen Caruntu
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: parser-diffs.tgz
>
>
> The selected parser seems to be different in 2.4.1. For example sending an 
> image (jpg/png) that was previously (2.4.0) processed by TesseractOCRParser, 
> now gets parsed by GDALParser.
> Seems that when multiple parsers support same file types, the selected parser 
> depends on the order in which they get loaded.
> For example the GDALParser, ImageParser and TesseractOCRParser all support 
> image/jpeg, image/png, image/gif ...
> A recent change is reversing the parser order (TIKA-3750).
> Re-configuring the GDALParser by excluding the image mime types might work, 
> but there could be other duplicated parsers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)