[ 
https://issues.apache.org/jira/browse/TIKA-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008628#comment-16008628
 ] 

Chris A. Mattmann commented on TIKA-2359:
-----------------------------------------

This is a tough one. In general I'd be fine to add a parameter in the tesseract 
config that's a boolean org.apache.tika.parser.ocr.tesseract.enable (default 
"false"). That said, to do so, would inhibit those since TIKA-93 that expect if 
they install Tesseract, Tika picks it up, and uses it. So, it would be an 
extremely non-back compat change b/c now we would require users to install some 
config file, update their java sysprops, or tika config parameters, which isn't 
nice at all. Part of the convenience of Tika "picking up" tesseract is that it 
is zero config, zero maintenance. 

 Any change to this needs careful thought, documentation updates on the wiki, 
in CHANGES.txt, and convenience scripts, etc, that make it extremely painless 
for the one time upgrade, and going forward to use OCR with Tika. I am in the 
boat of users that depends/relies on this by default if tesseract is 
available/installed.

Consider the opposite - would it be so hard to simply add a property to turn it 
on/off, and have it on by default (and then allow it to  be disabled with e.g., 
java -Dorg.apache.tika.parser.ocr.tesseract=false? To me that's easier, handles 
the back compat better, and is less intrusive.

My 2c.

> Extreme slow parsing on the attachment attached
> -----------------------------------------------
>
>                 Key: TIKA-2359
>                 URL: https://issues.apache.org/jira/browse/TIKA-2359
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Eugen Mayer
>         Attachments: Sample-doc-file-2000kb.doc
>
>
> i have 93s for parsing this document using 1.14 in server or in cli mode.
> Java:
> java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> debian-jessie, 8GB ram in a docker container, current xeon 3GHz, so decent (2 
> cores limited)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to