[
https://issues.apache.org/jira/browse/TIKA-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826902#comment-15826902
]
ASF GitHub Bot commented on TIKA-2231:
--------------------------------------
GitHub user ham1 opened a pull request:
https://github.com/apache/tika/pull/147
TIKA-2231: Improved param validation of TesseractOCRConfig.setLanguage()
I also improved and added more test cases.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ham1/tika TIKA-2231
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/tika/pull/147.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #147
----
commit 5c51534a5731dba0ed22bc04b7da9d95adfb6f50
Author: Graham Russell <[email protected]>
Date: 2017-01-17T21:48:49Z
TIKA-2231: Improved param validation of TesseractOCRConfig.setLanguage()
and added more tests
----
> Invalid language code exception
> -------------------------------
>
> Key: TIKA-2231
> URL: https://issues.apache.org/jira/browse/TIKA-2231
> Project: Tika
> Issue Type: Bug
> Components: ocr
> Affects Versions: 1.14
> Reporter: Peter Weiss
> Priority: Minor
> Labels: beginner, easyfix, easytest, newbie
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> There is a regex in TesseractOCRConfig.setLanguage(String language) which
> attempts to validate the language being set. Unfortunately it does not allow
> you to set some languages that are valid for tesseract.
> For example:
> TesseractOCRConfig config = new TesseractOCRConfig();
> config.setLanguage("chi_tra");
> This throws an IllegalArgumentException because of the '_' in the language
> name. "chi_tra" is a valid tesseract language code.
> Need to update the regex to allow '_' character.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)