The problem is that there are two places attempting to use TESSDATA_PREFIX and they have conflicting requirements. Tesseract itself checks TESSDATA_PREFIX to see if the prefix directory exists, it does this in C++ using std::filesystem::exists().
On Windows the only way that *both* the verification and loading will work is if the prefix exists and is composed solely of 7-bit ASCII characters, because that is both a valid UTF-8 encoded string, and a valid OS-specific path. The code related to this therefore need a little tlc and massage to allow national characters. måndag 4 augusti 2025 kl. 09:08:12 UTC+2 skrev Jan-Erik Lärka: > > Note that the character appear as ä in the message, but in the command > line window (DOS) everything has to be in codepage 850. > So the mapping is somewhat off > The interesting part is that the original example find the path, but some > other part of tesseract refuse to use it. > > C:\Temp>set tessdata_prefix=C:\b„st\tessdata\ > > C:\Temp>@"C:\Program Files\gs\gs10.05.1\bin\gswin64c.exe" -sDEVICE=ocr > -r300 -dNOPAUSE -dQUIET -dBATCH -dFirstPage=1 -dLastPage=1 -o- > "C:\bäst\1.pdf" > Warning: TESSDATA_PREFIX C:\bäst\tessdata\ does not exist, ignore it > Error opening data file ./eng.traineddata > Please make sure the TESSDATA_PREFIX environment variable is set to your > "tessdata" directory. > Failed loading language 'eng' > Tesseract couldn't load any languages! > **** Unable to open the initial device, quitting. > > lördag 2 augusti 2025 kl. 02:44:44 UTC+2 skrev [email protected]: > >> Out of curiosity, would it work if you try: >> >> SET TESSDATA_PREFIX=C:\b„st\tessdata\ >> >> >> On Fri, 1 Aug 2025, 13:32 Jan-Erik Lärka, <[email protected]> >> wrote: >> >>> Related to: https://answers.launchpad.net/sikuli/+question/658535 >>> This and related problems seem to have been reported before in various >>> forums, but not addressed. >>> >>> Tesseract refuse to read any *.traineddata-file when TESSDATA_PREFIX >>> contain a national character. >>> >>> A normal Windows user would not be able to produce any other path than >>> what the keyboard can output, thus UTF-8 encoding a string is out of the >>> question. >>> >>> Tesseract interpret the national character and output another (ä -> õ) >>> that indicate the application convert codepage Windows (win1252) to DOS >>> (ibm850). It accept the same folder and files if one create a symlink >>> pointing to the same folder without any national character in the path. >>> Tested national character å, ä and ö (Å, Ä and Ö), but guess more >>> characters can be affected as seen in the related issue. >>> >>> Step 3 and 7 can be replaced by a call to tesseract instead. >>> >>> How to reproduce: >>> 1) Start Command Line >>> 2) SET TESSDATA_PREFIX=C:\bäst\tessdata\ >>> 3) "C:\Program Files\gs\gs10.05.1\bin\gswin64c.exe" -sDEVICE=ocr -r300 >>> -dNOPAUSE -dQUIET -dBATCH -dFirstPage=1 -dLastPage=3 -o- >>> "C:\bäst\test1.pdf" >>> 4) Output (notice the altered character): >>> Error opening data file C:\Users\bõst\tessdata\eng.traineddata >>> Please make sure the TESSDATA_PREFIX environment variable is set to your >>> "tessdata" directory. >>> Failed loading language 'eng' Tesseract couldn't load any languages! >>> **** Unable to open the initial device, quitting. >>> 5) Create symlink to the same folder to end up like C:\Symlink\tessdata >>> 6) SET TESSDATA_PREFIX=C:\Symlink\tessdata\ >>> 7) "C:\Program Files\gs\gs10.05.1\bin\gswin64c.exe" -sDEVICE=ocr -r300 >>> -dNOPAUSE -dQUIET -dBATCH -dFirstPage=1 -dLastPage=3 -o- >>> "C:\bäst\test1.pdf" >>> 8) Output (notice the same path that contain a national character (ä): >>> Testdata >>> This is a test of Tesseract. >>> >>> I tried UTF-8, but as the output message indicate it interpret that as >>> well to... something else. ├ñ in tessdata_prefix become +±. >>> I also tried U+00E4, but that was not it. >>> Should it be something like \u00e4 or perhaps \\u00e4 or even something >>> else... ? >>> >>> I get the same problem running tesseract directly, just as others have >>> reported. >>> The UTF-8/Unicode support present for paths need some attention to >>> produce the expected output. >>> >>> It would be most welcome if the UTF-8 path conversion was removed >>> altogether. >>> >>> Note that Ghostscript itself in the example above handle the national >>> character nicely. >>> >>> //Jan-Erik >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion visit >>> https://groups.google.com/d/msgid/tesseract-ocr/1c398d62-8546-41d4-9ed6-83763a80a037n%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/1c398d62-8546-41d4-9ed6-83763a80a037n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/49e40629-c550-42c1-85ef-70da728e0056n%40googlegroups.com.

