[ https://issues.apache.org/jira/browse/TIKA-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428841#comment-17428841 ]
Tim Allison commented on TIKA-2970: ----------------------------------- Please ask this and similar questions on the u...@tika.apache.org list (https://tika.apache.org/mail-lists.html). That said, I recently updated our wikis to document the new configurations in 2.x. See the Overriding Default Configuration: https://cwiki.apache.org/confluence/display/tika/tikaocr See the general statement about parser configuration: https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0 I also updated the documentation for the PDFParser. There's still more to do. Please let us know on user@ what else requires improved documentation. Thank you! > Configuring Tesseract for OCR of PDF via Tika Config is not working > ------------------------------------------------------------------- > > Key: TIKA-2970 > URL: https://issues.apache.org/jira/browse/TIKA-2970 > Project: Tika > Issue Type: Improvement > Components: ocr > Affects Versions: 1.22 > Reporter: David Eric Pugh > Assignee: Tim Allison > Priority: Critical > Fix For: 1.23 > > > Based on TIKA-2705, I thought I could eliminate the use of the properties > files for configuring PDF and OCR processing, and just use a tika-config.xml > file. > I believe I have a unit test that demonstrates that if you need to override > the tesseract path for OCR, you end up always with the default Tesseract > configuration, which leads to Tika throwing an error: > https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java#L328 > > In stepping through the code, it seems like every time we consult the context: > ``` > TesseractOCRConfig tesseractConfig = > context.get(TesseractOCRConfig.class, > DEFAULT_TESSERACT_CONFIG); > ``` > We always get back the default. The context never has our customized > TesseractOCRConfig! Despite the fact that when we load up the TikaConfig in > the first case, I notice that we do create a TesseractOCRParser object WITH > the various parameters... -- This message was sent by Atlassian Jira (v8.3.4#803005)