Related to: https://answers.launchpad.net/sikuli/+question/658535
This and related problems seem to have been reported before in various 
forums, but not addressed.

Tesseract refuse to read any *.traineddata-file when TESSDATA_PREFIX 
contain a national character. 

A normal Windows user would not be able to produce any other path than what 
the keyboard can output, thus UTF-8 encoding a string is out of the 
question.

Tesseract interpret the national character and output another (ä -> õ) that 
indicate the application convert codepage Windows (win1252) to DOS 
(ibm850). It accept the same folder and files if one create a symlink 
pointing to the same folder without any national character in the path. 
Tested national character å, ä and ö (Å, Ä and Ö), but guess more 
characters can be affected as seen in the related issue.
 
Step 3 and 7 can be replaced by a call to tesseract instead.

How to reproduce: 
1) Start Command Line
2) SET TESSDATA_PREFIX=C:\bäst\tessdata\ 
3) "C:\Program Files\gs\gs10.05.1\bin\gswin64c.exe" -sDEVICE=ocr -r300 
-dNOPAUSE -dQUIET -dBATCH -dFirstPage=1 -dLastPage=3 -o- 
"C:\bäst\test1.pdf" 
4) Output (notice the altered character): 
Error opening data file C:\Users\bõst\tessdata\eng.traineddata 
Please make sure the TESSDATA_PREFIX environment variable is set to your 
"tessdata" directory. 
Failed loading language 'eng' Tesseract couldn't load any languages! 
**** Unable to open the initial device, quitting. 
5) Create symlink to the same folder to end up like C:\Symlink\tessdata 
6) SET TESSDATA_PREFIX=C:\Symlink\tessdata\ 
7) "C:\Program Files\gs\gs10.05.1\bin\gswin64c.exe" -sDEVICE=ocr -r300 
-dNOPAUSE -dQUIET -dBATCH -dFirstPage=1 -dLastPage=3 -o- 
"C:\bäst\test1.pdf" 
8) Output (notice the same path that contain a national character (ä): 
Testdata
This is a test of Tesseract.

I tried UTF-8, but as the output message indicate it interpret that as well 
to... something else. ├ñ in tessdata_prefix become +±. 
I also tried U+00E4, but that was not it. 
Should it be something like \u00e4 or perhaps \\u00e4 or even something 
else... ? 

I get the same problem running tesseract directly, just as others have 
reported.
The UTF-8/Unicode support present for paths need some attention to produce 
the expected output. 

It would be most welcome if the UTF-8 path conversion was removed 
altogether.

Note that Ghostscript itself in the example above handle the national 
character nicely.

//Jan-Erik

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/1c398d62-8546-41d4-9ed6-83763a80a037n%40googlegroups.com.

Reply via email to