[ https://issues.apache.org/jira/browse/TIKA-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826902#comment-15826902 ]
ASF GitHub Bot commented on TIKA-2231: -------------------------------------- GitHub user ham1 opened a pull request: https://github.com/apache/tika/pull/147 TIKA-2231: Improved param validation of TesseractOCRConfig.setLanguage() I also improved and added more test cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ham1/tika TIKA-2231 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/147.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #147 ---- commit 5c51534a5731dba0ed22bc04b7da9d95adfb6f50 Author: Graham Russell <gra...@ham1.co.uk> Date: 2017-01-17T21:48:49Z TIKA-2231: Improved param validation of TesseractOCRConfig.setLanguage() and added more tests ---- > Invalid language code exception > ------------------------------- > > Key: TIKA-2231 > URL: https://issues.apache.org/jira/browse/TIKA-2231 > Project: Tika > Issue Type: Bug > Components: ocr > Affects Versions: 1.14 > Reporter: Peter Weiss > Priority: Minor > Labels: beginner, easyfix, easytest, newbie > Original Estimate: 1h > Remaining Estimate: 1h > > There is a regex in TesseractOCRConfig.setLanguage(String language) which > attempts to validate the language being set. Unfortunately it does not allow > you to set some languages that are valid for tesseract. > For example: > TesseractOCRConfig config = new TesseractOCRConfig(); > config.setLanguage("chi_tra"); > This throws an IllegalArgumentException because of the '_' in the language > name. "chi_tra" is a valid tesseract language code. > Need to update the regex to allow '_' character. -- This message was sent by Atlassian JIRA (v6.3.4#6332)