Note that the character appear as ä in the message, but in the command line 
window (DOS) everything has to be in codepage 850.
So the mapping is somewhat off
The interesting part is that the original example find the path, but some 
other part of tesseract refuse to use it.

C:\Temp>set tessdata_prefix=C:\b„st\tessdata\

C:\Temp>@"C:\Program Files\gs\gs10.05.1\bin\gswin64c.exe" -sDEVICE=ocr 
-r300 -dNOPAUSE -dQUIET -dBATCH -dFirstPage=1 -dLastPage=1 -o- 
"C:\bäst\1.pdf"
Warning: TESSDATA_PREFIX C:\bäst\tessdata\ does not exist, ignore it
Error opening data file ./eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your 
"tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
**** Unable to open the initial device, quitting.

lördag 2 augusti 2025 kl. 02:44:44 UTC+2 skrev [email protected]:

> Out of curiosity, would it work if you try:
>
> SET TESSDATA_PREFIX=C:\b„st\tessdata\
>
>
> On Fri, 1 Aug 2025, 13:32 Jan-Erik Lärka, <[email protected]> 
> wrote:
>
>> Related to: https://answers.launchpad.net/sikuli/+question/658535
>> This and related problems seem to have been reported before in various 
>> forums, but not addressed.
>>
>> Tesseract refuse to read any *.traineddata-file when TESSDATA_PREFIX 
>> contain a national character. 
>>
>> A normal Windows user would not be able to produce any other path than 
>> what the keyboard can output, thus UTF-8 encoding a string is out of the 
>> question.
>>
>> Tesseract interpret the national character and output another (ä -> õ) 
>> that indicate the application convert codepage Windows (win1252) to DOS 
>> (ibm850). It accept the same folder and files if one create a symlink 
>> pointing to the same folder without any national character in the path. 
>> Tested national character å, ä and ö (Å, Ä and Ö), but guess more 
>> characters can be affected as seen in the related issue.
>>  
>> Step 3 and 7 can be replaced by a call to tesseract instead.
>>
>> How to reproduce: 
>> 1) Start Command Line
>> 2) SET TESSDATA_PREFIX=C:\bäst\tessdata\ 
>> 3) "C:\Program Files\gs\gs10.05.1\bin\gswin64c.exe" -sDEVICE=ocr -r300 
>> -dNOPAUSE -dQUIET -dBATCH -dFirstPage=1 -dLastPage=3 -o- 
>> "C:\bäst\test1.pdf" 
>> 4) Output (notice the altered character): 
>> Error opening data file C:\Users\bõst\tessdata\eng.traineddata 
>> Please make sure the TESSDATA_PREFIX environment variable is set to your 
>> "tessdata" directory. 
>> Failed loading language 'eng' Tesseract couldn't load any languages! 
>> **** Unable to open the initial device, quitting. 
>> 5) Create symlink to the same folder to end up like C:\Symlink\tessdata 
>> 6) SET TESSDATA_PREFIX=C:\Symlink\tessdata\ 
>> 7) "C:\Program Files\gs\gs10.05.1\bin\gswin64c.exe" -sDEVICE=ocr -r300 
>> -dNOPAUSE -dQUIET -dBATCH -dFirstPage=1 -dLastPage=3 -o- 
>> "C:\bäst\test1.pdf" 
>> 8) Output (notice the same path that contain a national character (ä): 
>> Testdata
>> This is a test of Tesseract.
>>
>> I tried UTF-8, but as the output message indicate it interpret that as 
>> well to... something else. ├ñ in tessdata_prefix become +±. 
>> I also tried U+00E4, but that was not it. 
>> Should it be something like \u00e4 or perhaps \\u00e4 or even something 
>> else... ? 
>>
>> I get the same problem running tesseract directly, just as others have 
>> reported.
>> The UTF-8/Unicode support present for paths need some attention to 
>> produce the expected output. 
>>
>> It would be most welcome if the UTF-8 path conversion was removed 
>> altogether.
>>
>> Note that Ghostscript itself in the example above handle the national 
>> character nicely.
>>
>> //Jan-Erik
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/1c398d62-8546-41d4-9ed6-83763a80a037n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/1c398d62-8546-41d4-9ed6-83763a80a037n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/0f976137-c382-4fa4-911b-ae2e89f352f2n%40googlegroups.com.

Reply via email to