[ https://issues.apache.org/jira/browse/TIKA-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006689#comment-16006689 ]
Eugen Mayer commented on TIKA-2359: ----------------------------------- oh holy..seriously? By default OCR by simply having a lib installed which is installed by libreoffice? This is incredibly odd, seriously. for the googlers cat /etc/tika.xml <?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"> <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/> </parser> </parsers> </properties> export TIKA_CONFIG=/etc/tika.xm And the just run java -jar tika.jar test.pdf > Extreme slow parsing on the attachment attached > ----------------------------------------------- > > Key: TIKA-2359 > URL: https://issues.apache.org/jira/browse/TIKA-2359 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Eugen Mayer > Attachments: Sample-doc-file-2000kb.doc > > > i have 93s for parsing this document using 1.14 in server or in cli mode. > Java: > java version "1.8.0_121" > Java(TM) SE Runtime Environment (build 1.8.0_121-b13) > Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) > debian-jessie, 8GB ram in a docker container, current xeon 3GHz, so decent (2 > cores limited) -- This message was sent by Atlassian JIRA (v6.3.15#6346)