[jira] [Updated] (TIKA-3286) Tika not issue an error when language file doesn't exist; not supporting script files

Peter Kronenberg (Jira) Thu, 28 Jan 2021 11:01:00 -0800


     [ 
https://issues.apache.org/jira/browse/TIKA-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Peter Kronenberg updated TIKA-3286:
-----------------------------------
    Description: 
Tika uses a regular expression to validate the language string, assuming it is 
set of  ISO-639-2 language code separated by plus signs.  However, Script files 
(in the _script_ directory) can have any arbitrary name, with the only rule 
being that they start with a capital letter.  The scripts were introduced in 
4.0.0, [https://github.com/manisandro/gImageReader/issues/323]

 

In addition, if the user specifies an invalid language (i.e., the string 
matches the regular expression, but there is no corresponding language file in 
Tessdata, no error message is issued.  Tesseract issues some very ugly and 
misleading messages which simply assume that you haven't set the _tessdata_ 
directory correctly, but they are not captured by Tika (and not sure they would 
be appropriate anyway).  Tika just blindly calls Tesseract but then doesn't get 
any output back.

  !nolang.png!

I suggest parsing the language string by the plus sign and not doing any other 
validating on the string, but instead, actually checking to see that the file 
exists in either _tessdata_ or _tessdata/script_.

If any of them don’t exists, then throw an exception, similar to what is done 
now when the language doesn't match the regular expression.

I've started to prototype this.

 

Later: I'm trying to clarify how the scripts are intended to be used.  The page 
referenced above as well as 
[https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#LANGUAGES
 
|https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#LANGUAGES]imply
 that the _-l_ option accepts the name of a language or script.  I assumed it 
would look in _tessdata_ first and if not found, would look in 
_tessdata/script_.  But it seems you have to enter the path.

  !script.png!

_tesseract --list-lang_s displays them this way

  !list-lang.png!

so it clearly knows about the _script_ directory.  But it expects the user to 
know it as well.  Not sure if we want to make Tika work more friendly

  was:
Tika uses a regular expression to validate the language string, assuming it is 
set of  ISO-639-2 language code separated by plus signs.  However, Script files 
(in the _script_ directory) can have any arbitrary name, with the only rule 
being that the start with a capital letter.  The scripts were introduced in 
4.0.0, [https://github.com/manisandro/gImageReader/issues/323]

 

In addition, if the user specifies an invalid language (i.e., the string 
matches the regular expression, but there is no corresponding language file in 
Tessdata, no error message is issued.  Tesseract issues some very ugly and 
misleading messages which simply assume that you haven't set the _tessdata_ 
directory correctly, but they are not captured by Tika (and not sure they would 
be appropriate anyway).  Tika just blindly calls Tesseract but then doesn't get 
any output back.

  !nolang.png!

I suggest parsing the language string by the plus sign and not doing any other 
validating on the string, but instead, actually checking to see that the file 
exists in either _tessdata_ or _tessdata/script_.

If any of them don’t exists, then throw an exception, similar to what is done 
now when the language doesn't match the regular expression.

I've started to prototype this.

 

Later: I'm trying to clarify how the scripts are intended to be used.  The page 
referenced above as well as 
[https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#LANGUAGES
 
|https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#LANGUAGES]imply
 that the _-l_ option accepts the name of a language or script.  I assumed it 
would look in _tessdata_ first and if not found, would look in 
_tessdata/script_.  But it seems you have to enter the path.

  !script.png!

_tesseract --list-lang_s displays them this way

  !list-lang.png!

so it clearly knows about the _script_ directory.  But it expects the user to 
know it as well.  Not sure if we want to make Tika work more friendly


> Tika not issue an error when language file doesn't exist; not supporting 
> script files
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-3286
>                 URL: https://issues.apache.org/jira/browse/TIKA-3286
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Peter Kronenberg
>            Priority: Major
>         Attachments: list-lang.png, nolang.png, script.png
>
>
> Tika uses a regular expression to validate the language string, assuming it 
> is set of  ISO-639-2 language code separated by plus signs.  However, Script 
> files (in the _script_ directory) can have any arbitrary name, with the only 
> rule being that they start with a capital letter.  The scripts were 
> introduced in 4.0.0, [https://github.com/manisandro/gImageReader/issues/323]
>  
> In addition, if the user specifies an invalid language (i.e., the string 
> matches the regular expression, but there is no corresponding language file 
> in Tessdata, no error message is issued.  Tesseract issues some very ugly and 
> misleading messages which simply assume that you haven't set the _tessdata_ 
> directory correctly, but they are not captured by Tika (and not sure they 
> would be appropriate anyway).  Tika just blindly calls Tesseract but then 
> doesn't get any output back.
>   !nolang.png!
> I suggest parsing the language string by the plus sign and not doing any 
> other validating on the string, but instead, actually checking to see that 
> the file exists in either _tessdata_ or _tessdata/script_.
> If any of them don’t exists, then throw an exception, similar to what is done 
> now when the language doesn't match the regular expression.
> I've started to prototype this.
>  
> Later: I'm trying to clarify how the scripts are intended to be used.  The 
> page referenced above as well as 
> [https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#LANGUAGES
>  
> |https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#LANGUAGES]imply
>  that the _-l_ option accepts the name of a language or script.  I assumed it 
> would look in _tessdata_ first and if not found, would look in 
> _tessdata/script_.  But it seems you have to enter the path.
>   !script.png!
> _tesseract --list-lang_s displays them this way
>   !list-lang.png!
> so it clearly knows about the _script_ directory.  But it expects the user to 
> know it as well.  Not sure if we want to make Tika work more friendly



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (TIKA-3286) Tika not issue an error when language file doesn't exist; not supporting script files

Reply via email to