Out of curiosity, would it work if you try: SET TESSDATA_PREFIX=C:\b„st\tessdata\
On Fri, 1 Aug 2025, 13:32 Jan-Erik Lärka, <[email protected]> wrote: > Related to: https://answers.launchpad.net/sikuli/+question/658535 > This and related problems seem to have been reported before in various > forums, but not addressed. > > Tesseract refuse to read any *.traineddata-file when TESSDATA_PREFIX > contain a national character. > > A normal Windows user would not be able to produce any other path than > what the keyboard can output, thus UTF-8 encoding a string is out of the > question. > > Tesseract interpret the national character and output another (ä -> õ) > that indicate the application convert codepage Windows (win1252) to DOS > (ibm850). It accept the same folder and files if one create a symlink > pointing to the same folder without any national character in the path. > Tested national character å, ä and ö (Å, Ä and Ö), but guess more > characters can be affected as seen in the related issue. > > Step 3 and 7 can be replaced by a call to tesseract instead. > > How to reproduce: > 1) Start Command Line > 2) SET TESSDATA_PREFIX=C:\bäst\tessdata\ > 3) "C:\Program Files\gs\gs10.05.1\bin\gswin64c.exe" -sDEVICE=ocr -r300 > -dNOPAUSE -dQUIET -dBATCH -dFirstPage=1 -dLastPage=3 -o- > "C:\bäst\test1.pdf" > 4) Output (notice the altered character): > Error opening data file C:\Users\bõst\tessdata\eng.traineddata > Please make sure the TESSDATA_PREFIX environment variable is set to your > "tessdata" directory. > Failed loading language 'eng' Tesseract couldn't load any languages! > **** Unable to open the initial device, quitting. > 5) Create symlink to the same folder to end up like C:\Symlink\tessdata > 6) SET TESSDATA_PREFIX=C:\Symlink\tessdata\ > 7) "C:\Program Files\gs\gs10.05.1\bin\gswin64c.exe" -sDEVICE=ocr -r300 > -dNOPAUSE -dQUIET -dBATCH -dFirstPage=1 -dLastPage=3 -o- > "C:\bäst\test1.pdf" > 8) Output (notice the same path that contain a national character (ä): > Testdata > This is a test of Tesseract. > > I tried UTF-8, but as the output message indicate it interpret that as > well to... something else. ├ñ in tessdata_prefix become +±. > I also tried U+00E4, but that was not it. > Should it be something like \u00e4 or perhaps \\u00e4 or even something > else... ? > > I get the same problem running tesseract directly, just as others have > reported. > The UTF-8/Unicode support present for paths need some attention to produce > the expected output. > > It would be most welcome if the UTF-8 path conversion was removed > altogether. > > Note that Ghostscript itself in the example above handle the national > character nicely. > > //Jan-Erik > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion visit > https://groups.google.com/d/msgid/tesseract-ocr/1c398d62-8546-41d4-9ed6-83763a80a037n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/1c398d62-8546-41d4-9ed6-83763a80a037n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/CAJDV7CLwpnyU%2BGMf2J9iXYmBJGyMycKMYOqdokADKWxzPDLYpQ%40mail.gmail.com.

