[
https://issues.apache.org/jira/browse/TIKA-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peter Kronenberg updated TIKA-3286:
-----------------------------------
Attachment: script.png
> Tika not issue an error when language file doesn't exist; not supporting
> script files
> -------------------------------------------------------------------------------------
>
> Key: TIKA-3286
> URL: https://issues.apache.org/jira/browse/TIKA-3286
> Project: Tika
> Issue Type: Improvement
> Reporter: Peter Kronenberg
> Priority: Major
> Attachments: list-lang.png, script.png
>
>
> Tika uses a regular expression to validate the language string, assuming it
> is set of ISO-639-2 language code separated by plus signs. However, Script
> files (in the _script_ directory) can have any arbitrary name, with the only
> rule being that the start with a capital letter. The scripts were introduced
> in 4.0.0, [https://github.com/manisandro/gImageReader/issues/323]
>
> In addition, if the user specifies an invalid language (i.e., the string
> matches the regular expression, but there is no corresponding language file
> in Tessdata, no error message is issued. Tesseract issues some very ugly and
> misleading messages which simply assume that you haven't set the _tessdata_
> directory correctly, but they are not captured by Tika (and not sure they
> would be appropriate anyway). Tika just blindly calls Tesseract but then
> doesn't get any output back.
> !image-2021-01-28-13-52-14-978.png!
> I suggest parsing the language string by the plus sign and not doing any
> other validating on the string, but instead, actually checking to see that
> the file exists in either _tessdata_ or _tessdata/script_.
> If any of them don’t exists, then throw an exception, similar to what is done
> now when the language doesn't match the regular expression.
> I've started to prototype this.
>
> Later: I'm trying to clarify how the scripts are intended to be used. The
> page referenced above as well as
> [https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#LANGUAGES
>
> |https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#LANGUAGES]imply
> that the _-l_ option accepts the name of a language or script. I assumed it
> would look in _tessdata_ first and if not found, would look in
> _tessdata/script_. But it seems you have to enter the path.
> !image-2021-01-28-13-52-56-294.png!
> _tesseract --list-lang_s displays them this way
>
> so it clearly knows about the _script_ directory. But it expects the user to
> know it as well. Not sure if we want to make Tika work more friendly
--
This message was sent by Atlassian Jira
(v8.3.4#803005)