Re: [tesseract-ocr] TESSDATA_PREFIX doesn't work with national character(s)

Nikola Smolenski Fri, 01 Aug 2025 17:44:26 -0700

Out of curiosity, would it work if you try:

SET TESSDATA_PREFIX=C:\b„st\tessdata\



On Fri, 1 Aug 2025, 13:32 Jan-Erik Lärka, <[email protected]>
wrote:

> Related to: https://answers.launchpad.net/sikuli/+question/658535
> This and related problems seem to have been reported before in various
> forums, but not addressed.
>
> Tesseract refuse to read any *.traineddata-file when TESSDATA_PREFIX
> contain a national character.
>
> A normal Windows user would not be able to produce any other path than
> what the keyboard can output, thus UTF-8 encoding a string is out of the
> question.
>
> Tesseract interpret the national character and output another (ä -> õ)
> that indicate the application convert codepage Windows (win1252) to DOS
> (ibm850). It accept the same folder and files if one create a symlink
> pointing to the same folder without any national character in the path.
> Tested national character å, ä and ö (Å, Ä and Ö), but guess more
> characters can be affected as seen in the related issue.
>
> Step 3 and 7 can be replaced by a call to tesseract instead.
>
> How to reproduce:
> 1) Start Command Line
> 2) SET TESSDATA_PREFIX=C:\bäst\tessdata\
> 3) "C:\Program Files\gs\gs10.05.1\bin\gswin64c.exe" -sDEVICE=ocr -r300
> -dNOPAUSE -dQUIET -dBATCH -dFirstPage=1 -dLastPage=3 -o-
> "C:\bäst\test1.pdf"
> 4) Output (notice the altered character):
> Error opening data file C:\Users\bõst\tessdata\eng.traineddata
> Please make sure the TESSDATA_PREFIX environment variable is set to your
> "tessdata" directory.
> Failed loading language 'eng' Tesseract couldn't load any languages!
> **** Unable to open the initial device, quitting.
> 5) Create symlink to the same folder to end up like C:\Symlink\tessdata
> 6) SET TESSDATA_PREFIX=C:\Symlink\tessdata\
> 7) "C:\Program Files\gs\gs10.05.1\bin\gswin64c.exe" -sDEVICE=ocr -r300
> -dNOPAUSE -dQUIET -dBATCH -dFirstPage=1 -dLastPage=3 -o-
> "C:\bäst\test1.pdf"
> 8) Output (notice the same path that contain a national character (ä):
> Testdata
> This is a test of Tesseract.
>
> I tried UTF-8, but as the output message indicate it interpret that as
> well to... something else. ├ñ in tessdata_prefix become +±.
> I also tried U+00E4, but that was not it.
> Should it be something like \u00e4 or perhaps \\u00e4 or even something
> else... ?
>
> I get the same problem running tesseract directly, just as others have
> reported.
> The UTF-8/Unicode support present for paths need some attention to produce
> the expected output.
>
> It would be most welcome if the UTF-8 path conversion was removed
> altogether.
>
> Note that Ghostscript itself in the example above handle the national
> character nicely.
>
> //Jan-Erik
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion visit
> https://groups.google.com/d/msgid/tesseract-ocr/1c398d62-8546-41d4-9ed6-83763a80a037n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/1c398d62-8546-41d4-9ed6-83763a80a037n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJDV7CLwpnyU%2BGMf2J9iXYmBJGyMycKMYOqdokADKWxzPDLYpQ%40mail.gmail.com.

Re: [tesseract-ocr] TESSDATA_PREFIX doesn't work with national character(s)

Reply via email to