[
https://issues.apache.org/jira/browse/TIKA-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956230#comment-16956230
]
Hudson commented on TIKA-2970:
------------------------------
SUCCESS: Integrated in Jenkins build Tika-trunk #1716 (See
[https://builds.apache.org/job/Tika-trunk/1716/])
TIKA-2970 -- ensure that configuration of the tesseract parser is (tallison:
[https://github.com/apache/tika/commit/6b237ac3cdb0b43b86884248f14424e217e19ea2])
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
* (edit) CHANGES.txt
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
* (edit)
tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* (add)
tika-parsers/src/test/resources/org/apache/tika/parser/pdf/tika-ocr-config.xml
> Configuring Tesseract for OCR of PDF via Tika Config is not working
> -------------------------------------------------------------------
>
> Key: TIKA-2970
> URL: https://issues.apache.org/jira/browse/TIKA-2970
> Project: Tika
> Issue Type: Improvement
> Components: ocr
> Affects Versions: 1.22
> Reporter: David Eric Pugh
> Assignee: Tim Allison
> Priority: Critical
> Fix For: 1.23
>
>
> Based on TIKA-2705, I thought I could eliminate the use of the properties
> files for configuring PDF and OCR processing, and just use a tika-config.xml
> file.
> I believe I have a unit test that demonstrates that if you need to override
> the tesseract path for OCR, you end up always with the default Tesseract
> configuration, which leads to Tika throwing an error:
> https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java#L328
>
> In stepping through the code, it seems like every time we consult the context:
> ```
> TesseractOCRConfig tesseractConfig =
> context.get(TesseractOCRConfig.class,
> DEFAULT_TESSERACT_CONFIG);
> ```
> We always get back the default. The context never has our customized
> TesseractOCRConfig! Despite the fact that when we load up the TikaConfig in
> the first case, I notice that we do create a TesseractOCRParser object WITH
> the various parameters...
--
This message was sent by Atlassian Jira
(v8.3.4#803005)