[ 
https://issues.apache.org/jira/browse/TIKA-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428841#comment-17428841
 ] 

Tim Allison commented on TIKA-2970:
-----------------------------------

Please ask this and similar questions on the u...@tika.apache.org list 
(https://tika.apache.org/mail-lists.html).  

That said, I recently updated our wikis to document the new configurations in 
2.x.

See the Overriding Default Configuration: 
https://cwiki.apache.org/confluence/display/tika/tikaocr 

See the general statement about parser configuration: 
https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0

I also updated the documentation for the PDFParser.  There's still more to do.  
Please let us know on user@ what else requires improved documentation.  Thank 
you!

> Configuring Tesseract for OCR of PDF via Tika Config is not working
> -------------------------------------------------------------------
>
>                 Key: TIKA-2970
>                 URL: https://issues.apache.org/jira/browse/TIKA-2970
>             Project: Tika
>          Issue Type: Improvement
>          Components: ocr
>    Affects Versions: 1.22
>            Reporter: David Eric Pugh
>            Assignee: Tim Allison
>            Priority: Critical
>             Fix For: 1.23
>
>
> Based on TIKA-2705, I thought I could eliminate the use of the properties 
> files for configuring PDF and OCR processing, and just use a tika-config.xml 
> file.
> I believe I have a unit test that demonstrates that if you need to override 
> the tesseract path for OCR, you end up always with the default Tesseract 
> configuration, which leads to Tika throwing an error: 
> https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java#L328
>    
> In stepping through the code, it seems like every time we consult the context:
> ```
> TesseractOCRConfig tesseractConfig =
>                 context.get(TesseractOCRConfig.class, 
> DEFAULT_TESSERACT_CONFIG);
> ```
> We always get back the default.  The context never has our customized 
> TesseractOCRConfig!   Despite the fact that when we load up the TikaConfig in 
> the first case, I notice that we do create a TesseractOCRParser object WITH 
> the various parameters...   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to