Re: [tesseract-ocr] TESSDATA_PREFIX doesn't work with national character(s)

Jan-Erik Lärka Mon, 04 Aug 2025 05:59:09 -0700

 The problem is that there are two places attempting to use TESSDATA_PREFIX 
and they have conflicting requirements. 
Tesseract itself checks TESSDATA_PREFIX to see if the prefix directory 
exists, it does this in C++ using std::filesystem::exists().


On Windows the only way that *both* the verification and loading will work 
is if the prefix exists and is composed solely of 7-bit ASCII characters, 
because that is both a valid UTF-8 encoded string, and a valid OS-specific 
path.

The code related to this therefore need a little tlc and massage to allow 
national characters.

måndag 4 augusti 2025 kl. 09:08:12 UTC+2 skrev Jan-Erik Lärka:

>
> Note that the character appear as ä in the message, but in the command 
> line window (DOS) everything has to be in codepage 850.
> So the mapping is somewhat off
> The interesting part is that the original example find the path, but some 
> other part of tesseract refuse to use it.
>
> C:\Temp>set tessdata_prefix=C:\b„st\tessdata\
>
> C:\Temp>@"C:\Program Files\gs\gs10.05.1\bin\gswin64c.exe" -sDEVICE=ocr 
> -r300 -dNOPAUSE -dQUIET -dBATCH -dFirstPage=1 -dLastPage=1 -o- 
> "C:\bäst\1.pdf"
> Warning: TESSDATA_PREFIX C:\bäst\tessdata\ does not exist, ignore it
> Error opening data file ./eng.traineddata
> Please make sure the TESSDATA_PREFIX environment variable is set to your 
> "tessdata" directory.
> Failed loading language 'eng'
> Tesseract couldn't load any languages!
> **** Unable to open the initial device, quitting.
>
> lördag 2 augusti 2025 kl. 02:44:44 UTC+2 skrev [email protected]:
>
>> Out of curiosity, would it work if you try:
>>
>> SET TESSDATA_PREFIX=C:\b„st\tessdata\
>>
>>
>> On Fri, 1 Aug 2025, 13:32 Jan-Erik Lärka, <[email protected]> 
>> wrote:
>>
>>> Related to: https://answers.launchpad.net/sikuli/+question/658535
>>> This and related problems seem to have been reported before in various 
>>> forums, but not addressed.
>>>
>>> Tesseract refuse to read any *.traineddata-file when TESSDATA_PREFIX 
>>> contain a national character. 
>>>
>>> A normal Windows user would not be able to produce any other path than 
>>> what the keyboard can output, thus UTF-8 encoding a string is out of the 
>>> question.
>>>
>>> Tesseract interpret the national character and output another (ä -> õ) 
>>> that indicate the application convert codepage Windows (win1252) to DOS 
>>> (ibm850). It accept the same folder and files if one create a symlink 
>>> pointing to the same folder without any national character in the path. 
>>> Tested national character å, ä and ö (Å, Ä and Ö), but guess more 
>>> characters can be affected as seen in the related issue.
>>>  
>>> Step 3 and 7 can be replaced by a call to tesseract instead.
>>>
>>> How to reproduce: 
>>> 1) Start Command Line
>>> 2) SET TESSDATA_PREFIX=C:\bäst\tessdata\ 
>>> 3) "C:\Program Files\gs\gs10.05.1\bin\gswin64c.exe" -sDEVICE=ocr -r300 
>>> -dNOPAUSE -dQUIET -dBATCH -dFirstPage=1 -dLastPage=3 -o- 
>>> "C:\bäst\test1.pdf" 
>>> 4) Output (notice the altered character): 
>>> Error opening data file C:\Users\bõst\tessdata\eng.traineddata 
>>> Please make sure the TESSDATA_PREFIX environment variable is set to your 
>>> "tessdata" directory. 
>>> Failed loading language 'eng' Tesseract couldn't load any languages! 
>>> **** Unable to open the initial device, quitting. 
>>> 5) Create symlink to the same folder to end up like C:\Symlink\tessdata 
>>> 6) SET TESSDATA_PREFIX=C:\Symlink\tessdata\ 
>>> 7) "C:\Program Files\gs\gs10.05.1\bin\gswin64c.exe" -sDEVICE=ocr -r300 
>>> -dNOPAUSE -dQUIET -dBATCH -dFirstPage=1 -dLastPage=3 -o- 
>>> "C:\bäst\test1.pdf" 
>>> 8) Output (notice the same path that contain a national character (ä): 
>>> Testdata
>>> This is a test of Tesseract.
>>>
>>> I tried UTF-8, but as the output message indicate it interpret that as 
>>> well to... something else. ├ñ in tessdata_prefix become +±. 
>>> I also tried U+00E4, but that was not it. 
>>> Should it be something like \u00e4 or perhaps \\u00e4 or even something 
>>> else... ? 
>>>
>>> I get the same problem running tesseract directly, just as others have 
>>> reported.
>>> The UTF-8/Unicode support present for paths need some attention to 
>>> produce the expected output. 
>>>
>>> It would be most welcome if the UTF-8 path conversion was removed 
>>> altogether.
>>>
>>> Note that Ghostscript itself in the example above handle the national 
>>> character nicely.
>>>
>>> //Jan-Erik
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To view this discussion visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/1c398d62-8546-41d4-9ed6-83763a80a037n%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/1c398d62-8546-41d4-9ed6-83763a80a037n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/49e40629-c550-42c1-85ef-70da728e0056n%40googlegroups.com.

Re: [tesseract-ocr] TESSDATA_PREFIX doesn't work with national character(s)

Reply via email to