[
https://issues.apache.org/jira/browse/NIFI-15098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lars Francke updated NIFI-15098:
--------------------------------
Description:
Running the {color:#000000}TestExtractMediaMetadata on a system which does have
Tesseract installed but NOT the english tesseract data files fails:{color}
{noformat}
[pool-3-thread-1] INFO org.apache.tika.parser.ocr.TesseractOCRParser -
Tesseract is installed and is being invoked. This can add greatly to processing
time. If you do not want tesseract to be applied to your files see:
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
[pool-3-thread-1] ERROR org.apache.nifi.processors.media.ExtractMediaMetadata -
ExtractMediaMetadata[id=f5217b8a-4ac0-4876-83e0-179346ad855c] Failed to extract
media metadata from FlowFile[0,16color-10x10.bmp,198B]:
org.apache.nifi.processor.exception.ProcessException: java.io.IOException:
org.apache.tika.exception.TikaException: TesseractOCRParser bad exit value 1
err msg: Error opening data file /usr/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your
"tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize
tesseract.org.apache.nifi.processor.exception.ProcessException:
java.io.IOException: org.apache.tika.exception.TikaException:
TesseractOCRParser bad exit value 1 err msg: Error opening data file
/usr/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your
"tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
[snip]
at
org.apache.nifi.processors.media.ExtractMediaMetadata.lambda$onTrigger$0(ExtractMediaMetadata.java:176)
at org.apache.nifi.util.MockProcessSession.read(MockProcessSession.java:633)
[snip]
at
org.apache.tika.parser.ocr.TesseractOCRParser.runOCRProcess(TesseractOCRParser.java:493)
at
org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:447)
at
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:334)
at
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:276)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
at
org.apache.tika.parser.image.AbstractImageParser.parse(AbstractImageParser.java:106)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:284)
at
org.apache.nifi.processors.media.ExtractMediaMetadata.tika_parse(ExtractMediaMetadata.java:200)
at
org.apache.nifi.processors.media.ExtractMediaMetadata.lambda$onTrigger$0(ExtractMediaMetadata.java:173)
... 93 moreorg.opentest4j.AssertionFailedError: Expected all Transferred
FlowFiles to go to success but 1 were routed to failure
[snip] {noformat}
I see multiple options to handle this:
* Ignore
* At least document the behavior for the tests in question (testBmp and
testJpg)
* Assuming that OCR is not even intended for this to extract metadata we can
disable OCR entirely (my recommendation)
h2. Disabling OCR
This is mentioned in the error message as well:
[https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr]
That has this snippet
{code:java}
TesseractOCRConfig config = new TesseractOCRConfig();
config.setSkipOcr(true);
ParseContext context = new ParseContext();
context.set(TesseractOCRConfig.class, config);
Parser parser = new AutoDetectParser();
parser.parse(inputStream, handler, metadata, context); {code}
I tried this snippet and it makes the tests green even without Tesseract data
files installed.
As the tests actually check for the extracted metadata, OCR does not seem to be
needed to get this metadata. I assume this'll give a nice speed boost as well I
believe this'd be my favorite solution. If you agree I can put up a PR.
was:
Running the {color:#000000}TestExtractMediaMetadata on a system which does have
Tesseract installed but NOT the english tesseract data files fails:{color}
{noformat}
[pool-3-thread-1] INFO org.apache.tika.parser.ocr.TesseractOCRParser -
Tesseract is installed and is being invoked. This can add greatly to processing
time. If you do not want tesseract to be applied to your files see:
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
[pool-3-thread-1] ERROR org.apache.nifi.processors.media.ExtractMediaMetadata -
ExtractMediaMetadata[id=f5217b8a-4ac0-4876-83e0-179346ad855c] Failed to extract
media metadata from FlowFile[0,16color-10x10.bmp,198B]:
org.apache.nifi.processor.exception.ProcessException: java.io.IOException:
org.apache.tika.exception.TikaException: TesseractOCRParser bad exit value 1
err msg: Error opening data file /usr/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your
"tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize
tesseract.org.apache.nifi.processor.exception.ProcessException:
java.io.IOException: org.apache.tika.exception.TikaException:
TesseractOCRParser bad exit value 1 err msg: Error opening data file
/usr/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your
"tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
[snip]
at
org.apache.nifi.processors.media.ExtractMediaMetadata.lambda$onTrigger$0(ExtractMediaMetadata.java:176)
at org.apache.nifi.util.MockProcessSession.read(MockProcessSession.java:633)
[snip]
at
org.apache.tika.parser.ocr.TesseractOCRParser.runOCRProcess(TesseractOCRParser.java:493)
at
org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:447)
at
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:334)
at
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:276)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
at
org.apache.tika.parser.image.AbstractImageParser.parse(AbstractImageParser.java:106)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:284)
at
org.apache.nifi.processors.media.ExtractMediaMetadata.tika_parse(ExtractMediaMetadata.java:200)
at
org.apache.nifi.processors.media.ExtractMediaMetadata.lambda$onTrigger$0(ExtractMediaMetadata.java:173)
... 93 moreorg.opentest4j.AssertionFailedError: Expected all Transferred
FlowFiles to go to success but 1 were routed to failure
[snip] {noformat}
I see multiple options to handle this:
* Ignore
* At least document the behavior for the tests in question (testBmp and
testJpg)
* Assuming that OCR is not even intended for this to extract metadata we can
disable OCR entirely (my recommendation)
h2. Disabling OCR
This is mentioned in the error message as well:
[https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr]
That has this snippet
{code:java}
TesseractOCRConfig config = new TesseractOCRConfig();
config.setSkipOcr(true);
ParseContext context = new ParseContext();
context.set(TesseractOCRConfig.class, config);
Parser parser = new AutoDetectParser();
parser.parse(inputStream, handler, metadata, context); {code}
I tried this snippet and it makes the tests green even without Tesseract data
files installed.
As the tests actually check for the extracted metadata OCR does not seem to be
needed. As I assume this'll give a nice speed boost as well I believe this'd be
my favorite solution. If you agree I can put up a PR.
> TestExtractMediaMetadata fails when Tesseract ENG data is missing
> -----------------------------------------------------------------
>
> Key: NIFI-15098
> URL: https://issues.apache.org/jira/browse/NIFI-15098
> Project: Apache NiFi
> Issue Type: Bug
> Affects Versions: 2.6.0
> Reporter: Lars Francke
> Assignee: Lars Francke
> Priority: Minor
>
> Running the {color:#000000}TestExtractMediaMetadata on a system which does
> have Tesseract installed but NOT the english tesseract data files
> fails:{color}
> {noformat}
> [pool-3-thread-1] INFO org.apache.tika.parser.ocr.TesseractOCRParser -
> Tesseract is installed and is being invoked. This can add greatly to
> processing time. If you do not want tesseract to be applied to your files
> see:
> https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
> [pool-3-thread-1] ERROR org.apache.nifi.processors.media.ExtractMediaMetadata
> - ExtractMediaMetadata[id=f5217b8a-4ac0-4876-83e0-179346ad855c] Failed to
> extract media metadata from FlowFile[0,16color-10x10.bmp,198B]:
> org.apache.nifi.processor.exception.ProcessException: java.io.IOException:
> org.apache.tika.exception.TikaException: TesseractOCRParser bad exit value 1
> err msg: Error opening data file /usr/share/tessdata/eng.traineddata
> Please make sure the TESSDATA_PREFIX environment variable is set to your
> "tessdata" directory.
> Failed loading language 'eng'
> Tesseract couldn't load any languages!
> Could not initialize
> tesseract.org.apache.nifi.processor.exception.ProcessException:
> java.io.IOException: org.apache.tika.exception.TikaException:
> TesseractOCRParser bad exit value 1 err msg: Error opening data file
> /usr/share/tessdata/eng.traineddata
> Please make sure the TESSDATA_PREFIX environment variable is set to your
> "tessdata" directory.
> Failed loading language 'eng'
> Tesseract couldn't load any languages!
> Could not initialize tesseract.
> [snip]
> at
> org.apache.nifi.processors.media.ExtractMediaMetadata.lambda$onTrigger$0(ExtractMediaMetadata.java:176)
> at
> org.apache.nifi.util.MockProcessSession.read(MockProcessSession.java:633)
> [snip]
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.runOCRProcess(TesseractOCRParser.java:493)
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:447)
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:334)
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:276)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
> at
> org.apache.tika.parser.image.AbstractImageParser.parse(AbstractImageParser.java:106)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:284)
> at
> org.apache.nifi.processors.media.ExtractMediaMetadata.tika_parse(ExtractMediaMetadata.java:200)
> at
> org.apache.nifi.processors.media.ExtractMediaMetadata.lambda$onTrigger$0(ExtractMediaMetadata.java:173)
> ... 93 moreorg.opentest4j.AssertionFailedError: Expected all Transferred
> FlowFiles to go to success but 1 were routed to failure
> [snip] {noformat}
>
> I see multiple options to handle this:
> * Ignore
> * At least document the behavior for the tests in question (testBmp and
> testJpg)
> * Assuming that OCR is not even intended for this to extract metadata we can
> disable OCR entirely (my recommendation)
> h2. Disabling OCR
> This is mentioned in the error message as well:
> [https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr]
>
> That has this snippet
> {code:java}
> TesseractOCRConfig config = new TesseractOCRConfig();
> config.setSkipOcr(true);
> ParseContext context = new ParseContext();
> context.set(TesseractOCRConfig.class, config);
>
> Parser parser = new AutoDetectParser();
> parser.parse(inputStream, handler, metadata, context); {code}
> I tried this snippet and it makes the tests green even without Tesseract data
> files installed.
> As the tests actually check for the extracted metadata, OCR does not seem to
> be needed to get this metadata. I assume this'll give a nice speed boost as
> well I believe this'd be my favorite solution. If you agree I can put up a PR.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)