[ 
https://issues.apache.org/jira/browse/TIKA-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278296#comment-17278296
 ] 

Peter Kronenberg commented on TIKA-3286:
----------------------------------------

Well, it's a marginal improvement.  But I still feel that the Tesseract message 
is misleading and not very user friendly.  This is what I get:
{code:java}
Please make sure the TESSDATA_PREFIX environment variable is set to your 
"tessdata" directory.
Failed loading language 'abc'
Tesseract couldn't load any languages!
Could not initialize tesseract.{code}
This is a condition that just requires a 1-line error message: "Language 'abc' 
not found".  Lines 1, 3 and 4 are just plain wrong.  It found the tessdata 
directory.  It *can* load languages, just not the non-existent ones.  And it 
*did* initialize tesseract. It just couldn't load the language.

I realize that your philosophy is probably that you're just passing back what 
Tesseract says, but I think Tika is trying to add a layer of user-friendlies to 
it.  So why not avoid passing any bad data to Tesseract in the first place?

Also, it's still not handling the script languages

> Tika does not issue an error when language file doesn't exist; not supporting 
> script files
> ------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3286
>                 URL: https://issues.apache.org/jira/browse/TIKA-3286
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Peter Kronenberg
>            Priority: Major
>         Attachments: list-lang.png, nolang.png, script.png
>
>
> Tika uses a regular expression to validate the language string, assuming it 
> is set of  ISO-639-2 language code separated by plus signs.  However, Script 
> files (in the _script_ directory) can have any arbitrary name, with the only 
> rule being that they start with a capital letter.  The scripts were 
> introduced in 4.0.0, [https://github.com/manisandro/gImageReader/issues/323]
>  
> In addition, if the user specifies an invalid language (i.e., the string 
> matches the regular expression, but there is no corresponding language file 
> in Tessdata), no error message is issued.  Tesseract issues some very ugly 
> and misleading messages which simply assume that you haven't set the 
> _tessdata_ directory correctly, but they are not captured by Tika (and not 
> sure they would be appropriate anyway).  Tika just blindly calls Tesseract 
> but then doesn't get any output back.
>   !nolang.png!
> I suggest parsing the language string by the plus sign and not doing any 
> other validating on the string, but instead, actually checking to see that 
> the file exists in either _tessdata_ or _tessdata/script_.
> If any of them don’t exists, then throw an exception, similar to what is done 
> now when the language doesn't match the regular expression.
> I've started to prototype this.
>  
> Later: I'm trying to clarify how the scripts are intended to be used.  The 
> page referenced above as well as 
> [https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#LANGUAGES
>  
> |https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#LANGUAGES]imply
>  that the _-l_ option accepts the name of a language or script.  I assumed it 
> would look in _tessdata_ first and if not found, would look in 
> _tessdata/script_.  But it seems you have to enter the path.
>   !script.png!
> _tesseract --list-lang_ displays them this way
>   !list-lang.png!
> so it clearly knows about the _script_ directory.  But it expects the user to 
> know it as well.  Not sure if we want to make Tika work more friendly



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to