[ 
https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15933957#comment-15933957
 ] 

Tim Allison commented on TIKA-2293:
-----------------------------------

[~ThejanWijesinghe], thank you for sharing this and running some comparisons 
with our current Tesseract parser.

I really like:
 1. The notion that users don't have to figure out how to install Tesseract on 
their system.  "Simple" plug and play.
 2. The theoretical simplicity of not having to create the temp files and make 
a system call to python and tesseract etc.
 3. The notion of being able to use some of the lower-level features of 
Tesseract that aren't available from the commandline...but I only have a vague 
notion of these...what features from the underlying Tesseract do we need that 
aren't available from the commandline?

I'm concerned about:
 1a. The LGPL license on ghost4j means that we can't bundle that with our jars. 
Do I understand the license of ghost4j?  If so, and if we don't include 
ghost4j, what will happen?  Is that only used for PDFs...so we'd be on our own 
for those, right?
 1b. There's another LGPL license on leptonica4j's rococoa dependency.  What 
happens if we can't bundle that?
 2.  The general notion of packaging native libs.  I undid that choice with our 
sqlite parser and required that users add that jar to their classpath.
 3.  We'd be adding 38 MB to the tika-app and tika-server jars.  That's just 
for the Windows dlls, right? Do I understand correctly that Linux users would 
be on their own to install {{libtesseract.so}}?
 4. tess4j comes with the English language pack.  Users who wanted other 
languages would still have to grab and install the other language packs in the 
tess-data directory, which cuts into the appeal for "runs tesseract out of the 
box".

>  Tess4jOCRParser - A simpler Java version of TesseractOCRParser
> ---------------------------------------------------------------
>
>                 Key: TIKA-2293
>                 URL: https://issues.apache.org/jira/browse/TIKA-2293
>             Project: Tika
>          Issue Type: Improvement
>          Components: ocr
>            Reporter: Thejan Wijesinghe
>             Fix For: 1.15
>
>
> Right now, TesseractOCRParser calls tesseract and imagemagick from command 
> line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API 
> instead of the runtime.exec way to executing tesseract out of process.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to