[ 
https://issues.apache.org/jira/browse/NIFI-15098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Francke updated NIFI-15098:
--------------------------------
    Description: 
Running the {color:#000000}TestExtractMediaMetadata on a system which does have 
Tesseract installed but NOT the english tesseract data files fails:{color}
{noformat}
[pool-3-thread-1] INFO org.apache.tika.parser.ocr.TesseractOCRParser - 
Tesseract is installed and is being invoked. This can add greatly to processing 
time.  If you do not want tesseract to be applied to your files see: 
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
[pool-3-thread-1] ERROR org.apache.nifi.processors.media.ExtractMediaMetadata - 
ExtractMediaMetadata[id=f5217b8a-4ac0-4876-83e0-179346ad855c] Failed to extract 
media metadata from FlowFile[0,16color-10x10.bmp,198B]: 
org.apache.nifi.processor.exception.ProcessException: java.io.IOException: 
org.apache.tika.exception.TikaException: TesseractOCRParser bad exit value 1 
err msg: Error opening data file /usr/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your 
"tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize 
tesseract.org.apache.nifi.processor.exception.ProcessException: 
java.io.IOException: org.apache.tika.exception.TikaException: 
TesseractOCRParser bad exit value 1 err msg: Error opening data file 
/usr/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your 
"tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

[snip]

at 
org.apache.nifi.processors.media.ExtractMediaMetadata.lambda$onTrigger$0(ExtractMediaMetadata.java:176)
    at org.apache.nifi.util.MockProcessSession.read(MockProcessSession.java:633)

[snip]

    at 
org.apache.tika.parser.ocr.TesseractOCRParser.runOCRProcess(TesseractOCRParser.java:493)
    at 
org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:447)
    at 
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:334)
    at 
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:276)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
    at 
org.apache.tika.parser.image.AbstractImageParser.parse(AbstractImageParser.java:106)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:284)
    at 
org.apache.nifi.processors.media.ExtractMediaMetadata.tika_parse(ExtractMediaMetadata.java:200)
    at 
org.apache.nifi.processors.media.ExtractMediaMetadata.lambda$onTrigger$0(ExtractMediaMetadata.java:173)
    ... 93 moreorg.opentest4j.AssertionFailedError: Expected all Transferred 
FlowFiles to go to success but 1 were routed to failure

[snip] {noformat}
 

I see multiple options to handle this:
 * Ignore
 * At least document the behavior for the tests in question (testBmp and 
testJpg)
 * Assuming that OCR is not even intended for this to extract metadata we can 
disable OCR entirely (my recommendation)

h2. Disabling OCR

This is mentioned in the error message as well: 
[https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr]

 

That has this snippet
{code:java}
        TesseractOCRConfig config = new TesseractOCRConfig();
        config.setSkipOcr(true);
        ParseContext context = new ParseContext();
        context.set(TesseractOCRConfig.class, config);
        
        Parser parser = new AutoDetectParser();
        parser.parse(inputStream, handler, metadata, context); {code}
I tried this snippet and it makes the tests green even without Tesseract data 
files installed.

As the tests actually check for the extracted metadata, OCR does not seem to be 
needed to get this metadata. I assume this'll give a nice speed boost as well I 
believe this'd be my favorite solution. If you agree I can put up a PR.

 

  was:
Running the {color:#000000}TestExtractMediaMetadata on a system which does have 
Tesseract installed but NOT the english tesseract data files fails:{color}
{noformat}
[pool-3-thread-1] INFO org.apache.tika.parser.ocr.TesseractOCRParser - 
Tesseract is installed and is being invoked. This can add greatly to processing 
time.  If you do not want tesseract to be applied to your files see: 
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
[pool-3-thread-1] ERROR org.apache.nifi.processors.media.ExtractMediaMetadata - 
ExtractMediaMetadata[id=f5217b8a-4ac0-4876-83e0-179346ad855c] Failed to extract 
media metadata from FlowFile[0,16color-10x10.bmp,198B]: 
org.apache.nifi.processor.exception.ProcessException: java.io.IOException: 
org.apache.tika.exception.TikaException: TesseractOCRParser bad exit value 1 
err msg: Error opening data file /usr/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your 
"tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize 
tesseract.org.apache.nifi.processor.exception.ProcessException: 
java.io.IOException: org.apache.tika.exception.TikaException: 
TesseractOCRParser bad exit value 1 err msg: Error opening data file 
/usr/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your 
"tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

[snip]

at 
org.apache.nifi.processors.media.ExtractMediaMetadata.lambda$onTrigger$0(ExtractMediaMetadata.java:176)
    at org.apache.nifi.util.MockProcessSession.read(MockProcessSession.java:633)

[snip]

    at 
org.apache.tika.parser.ocr.TesseractOCRParser.runOCRProcess(TesseractOCRParser.java:493)
    at 
org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:447)
    at 
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:334)
    at 
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:276)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
    at 
org.apache.tika.parser.image.AbstractImageParser.parse(AbstractImageParser.java:106)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:284)
    at 
org.apache.nifi.processors.media.ExtractMediaMetadata.tika_parse(ExtractMediaMetadata.java:200)
    at 
org.apache.nifi.processors.media.ExtractMediaMetadata.lambda$onTrigger$0(ExtractMediaMetadata.java:173)
    ... 93 moreorg.opentest4j.AssertionFailedError: Expected all Transferred 
FlowFiles to go to success but 1 were routed to failure

[snip] {noformat}
 

I see multiple options to handle this:
 * Ignore
 * At least document the behavior for the tests in question (testBmp and 
testJpg)
 * Assuming that OCR is not even intended for this to extract metadata we can 
disable OCR entirely (my recommendation)

h2. Disabling OCR

This is mentioned in the error message as well: 
[https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr]

 

That has this snippet
{code:java}
        TesseractOCRConfig config = new TesseractOCRConfig();
        config.setSkipOcr(true);
        ParseContext context = new ParseContext();
        context.set(TesseractOCRConfig.class, config);
        
        Parser parser = new AutoDetectParser();
        parser.parse(inputStream, handler, metadata, context); {code}
I tried this snippet and it makes the tests green even without Tesseract data 
files installed.

As the tests actually check for the extracted metadata OCR does not seem to be 
needed. As I assume this'll give a nice speed boost as well I believe this'd be 
my favorite solution. If you agree I can put up a PR.

 


> TestExtractMediaMetadata fails when Tesseract ENG data is missing
> -----------------------------------------------------------------
>
>                 Key: NIFI-15098
>                 URL: https://issues.apache.org/jira/browse/NIFI-15098
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Lars Francke
>            Assignee: Lars Francke
>            Priority: Minor
>
> Running the {color:#000000}TestExtractMediaMetadata on a system which does 
> have Tesseract installed but NOT the english tesseract data files 
> fails:{color}
> {noformat}
> [pool-3-thread-1] INFO org.apache.tika.parser.ocr.TesseractOCRParser - 
> Tesseract is installed and is being invoked. This can add greatly to 
> processing time.  If you do not want tesseract to be applied to your files 
> see: 
> https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
> [pool-3-thread-1] ERROR org.apache.nifi.processors.media.ExtractMediaMetadata 
> - ExtractMediaMetadata[id=f5217b8a-4ac0-4876-83e0-179346ad855c] Failed to 
> extract media metadata from FlowFile[0,16color-10x10.bmp,198B]: 
> org.apache.nifi.processor.exception.ProcessException: java.io.IOException: 
> org.apache.tika.exception.TikaException: TesseractOCRParser bad exit value 1 
> err msg: Error opening data file /usr/share/tessdata/eng.traineddata
> Please make sure the TESSDATA_PREFIX environment variable is set to your 
> "tessdata" directory.
> Failed loading language 'eng'
> Tesseract couldn't load any languages!
> Could not initialize 
> tesseract.org.apache.nifi.processor.exception.ProcessException: 
> java.io.IOException: org.apache.tika.exception.TikaException: 
> TesseractOCRParser bad exit value 1 err msg: Error opening data file 
> /usr/share/tessdata/eng.traineddata
> Please make sure the TESSDATA_PREFIX environment variable is set to your 
> "tessdata" directory.
> Failed loading language 'eng'
> Tesseract couldn't load any languages!
> Could not initialize tesseract.
> [snip]
> at 
> org.apache.nifi.processors.media.ExtractMediaMetadata.lambda$onTrigger$0(ExtractMediaMetadata.java:176)
>     at 
> org.apache.nifi.util.MockProcessSession.read(MockProcessSession.java:633)
> [snip]
>     at 
> org.apache.tika.parser.ocr.TesseractOCRParser.runOCRProcess(TesseractOCRParser.java:493)
>     at 
> org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:447)
>     at 
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:334)
>     at 
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:276)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
>     at 
> org.apache.tika.parser.image.AbstractImageParser.parse(AbstractImageParser.java:106)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
>     at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:284)
>     at 
> org.apache.nifi.processors.media.ExtractMediaMetadata.tika_parse(ExtractMediaMetadata.java:200)
>     at 
> org.apache.nifi.processors.media.ExtractMediaMetadata.lambda$onTrigger$0(ExtractMediaMetadata.java:173)
>     ... 93 moreorg.opentest4j.AssertionFailedError: Expected all Transferred 
> FlowFiles to go to success but 1 were routed to failure
> [snip] {noformat}
>  
> I see multiple options to handle this:
>  * Ignore
>  * At least document the behavior for the tests in question (testBmp and 
> testJpg)
>  * Assuming that OCR is not even intended for this to extract metadata we can 
> disable OCR entirely (my recommendation)
> h2. Disabling OCR
> This is mentioned in the error message as well: 
> [https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr]
>  
> That has this snippet
> {code:java}
>         TesseractOCRConfig config = new TesseractOCRConfig();
>         config.setSkipOcr(true);
>         ParseContext context = new ParseContext();
>         context.set(TesseractOCRConfig.class, config);
>         
>         Parser parser = new AutoDetectParser();
>         parser.parse(inputStream, handler, metadata, context); {code}
> I tried this snippet and it makes the tests green even without Tesseract data 
> files installed.
> As the tests actually check for the extracted metadata, OCR does not seem to 
> be needed to get this metadata. I assume this'll give a nice speed boost as 
> well I believe this'd be my favorite solution. If you agree I can put up a PR.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to