[ 
https://issues.apache.org/jira/browse/TIKA-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Kronenberg updated TIKA-3286:
-----------------------------------
    Attachment: nolang.png

> Tika not issue an error when language file doesn't exist; not supporting 
> script files
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-3286
>                 URL: https://issues.apache.org/jira/browse/TIKA-3286
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Peter Kronenberg
>            Priority: Major
>         Attachments: image-2021-01-28-13-57-44-888.png, list-lang.png, 
> nolang.png, script.png
>
>
> Tika uses a regular expression to validate the language string, assuming it 
> is set of  ISO-639-2 language code separated by plus signs.  However, Script 
> files (in the _script_ directory) can have any arbitrary name, with the only 
> rule being that the start with a capital letter.  The scripts were introduced 
> in 4.0.0, [https://github.com/manisandro/gImageReader/issues/323]
>  
> In addition, if the user specifies an invalid language (i.e., the string 
> matches the regular expression, but there is no corresponding language file 
> in Tessdata, no error message is issued.  Tesseract issues some very ugly and 
> misleading messages which simply assume that you haven't set the _tessdata_ 
> directory correctly, but they are not captured by Tika (and not sure they 
> would be appropriate anyway).  Tika just blindly calls Tesseract but then 
> doesn't get any output back.
> !image-2021-01-28-13-52-14-978.png!
> I suggest parsing the language string by the plus sign and not doing any 
> other validating on the string, but instead, actually checking to see that 
> the file exists in either _tessdata_ or _tessdata/script_.
> If any of them don’t exists, then throw an exception, similar to what is done 
> now when the language doesn't match the regular expression.
> I've started to prototype this.
>  
> Later: I'm trying to clarify how the scripts are intended to be used.  The 
> page referenced above as well as 
> [https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#LANGUAGES
>  
> |https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#LANGUAGES]imply
>  that the _-l_ option accepts the name of a language or script.  I assumed it 
> would look in _tessdata_ first and if not found, would look in 
> _tessdata/script_.  But it seems you have to enter the path.
>   !image-2021-01-28-13-52-56-294.png!
> _tesseract --list-lang_s displays them this way
>  
> so it clearly knows about the _script_ directory.  But it expects the user to 
> know it as well.  Not sure if we want to make Tika work more friendly



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to