Re: [tesseract-ocr] how to see which fonts are used in .traineddata files

Zdenko Podobny Sat, 03 Oct 2020 02:22:04 -0700

1. try the latest version
2. try play with psm: e.g. tesseract 20201002.png - --psm 11 --dpi 300
produces:


8 27 26 10 04 03 01

N29 19 16 14 09 03

131 27 25 18 12 03

N21 18 16 13 07 04

N32 232112 10 07

N 36 34 30 27 21 01

X35 3417 13 10 08

N36 33 29 28 14 09

R 33 32 31 21 06 01

- oe ————

—— — ——— —— a = —

R 37 27 19 09 05 03

-———

Fra anny

156136

-——

3198(19): ‘on iam mn

10:52:25 28.11.19 1 09


.. . custom image segmentation would help too (and then to OCR each "cell"
individually)

Zdenko


so 3. 10. 2020 o 7:06 H Brenner <hyltonbren...@gmail.com> napísal(a):

> Hi,
>
> I have tesseract 3.02 on a Windows 10 PC.
>
> I am trying to recognise text on a form scanned with a camera that has
> numbers mostly in tabular form with a small amount of Hebrew characters
> plus one English "graphical" word. I processed the photo to remove a pink
> background pattern, and to enhance the text in the image (the original -
> minus the pink pattern - produced the same results)
>
> [image: 3198Rfat.png]
>
> The Hebrew text on the bottom 2 lines is cut off on the right, but this
> does not matter to me.
>
> Only the numbers are of interest to me in the output.
>
> I am running tesseract in Python using the pytesseract wrapper, and I am
> running the following command:
>
>    - Imaj=Image.open(ImgPath)  # ImgPath is the full path to the .png
>    file.
>    - print('\n\n','v'*20,'\n',
>    pytesseract.image_to_string(Imaj),'\n','^'*20,'\n\n')  # use eng default
>
> I believe this corresponds to the command-line:
>
>    - tesseract  ImgPath  out    (I used the actual path)
>
> The output that I get is the following:
>
>    -  7547512723 2
>    -
>    - 1334718913
>    - 0000000000
>    - 3927010465.
>    - 4483273819..
>    - 0.|..1|.|.1ln/_1|.7_n/.01
>    - 0556107919..
>    - 1|11n/Tln/_nJ110._O...|__
>    - 6978344327..
>    - n/..|9._..l9._Q.:1Jn.o3n/___
>    - _/0._1|.|9._n0EunD3./:
>    - n/L232333333““
>    -
>    -  A —:1 qnnwn N
>    -
>    - 156138
>    -
>    - ::§1§§?13:?76ﬁ-ﬁ333ii‘iﬁ1
>    - 10:52:25 29.11.19 :1 ma‘
>
> Most of it is meaningless gibberish to me. Only the highlighted text is
> recognised correctly/
>
> When I ran it with the Hebrew language selected, it produced similar
> results, but with *some *of the Hebrew characters and only the "156138"
> recognised correctly.
>
> Running tesseract manually (English) in a 'CMD' window produced the
> attached file 'out.txt'.
>
> I suspect that the font used in the form is the problem - the form was not
> printed on a normal Windows, Mac or linux computer.
>
> Which fonts were used to create heb.traineddata? Is there a way for me to
> display them?
>
> Do I have to train tesseract with the font in the form?
>
> Any help will be appreciated!
>
> Thanks!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/a6602b5e-307e-406d-8650-510e8c2225e6n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/a6602b5e-307e-406d-8650-510e8c2225e6n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xwpL-8%3DS4OwmtxNtkR47E-q5%2BtpncF%2BkPa05QkwGWWvA%40mail.gmail.com.

Re: [tesseract-ocr] how to see which fonts are used in .traineddata files

Reply via email to