1. try the latest version 2. try play with psm: e.g. tesseract 20201002.png - --psm 11 --dpi 300 produces:
8 27 26 10 04 03 01 N29 19 16 14 09 03 131 27 25 18 12 03 N21 18 16 13 07 04 N32 232112 10 07 N 36 34 30 27 21 01 X35 3417 13 10 08 N36 33 29 28 14 09 R 33 32 31 21 06 01 - oe ———— —— — ——— —— a = — R 37 27 19 09 05 03 -——— Fra anny 156136 -—— 3198(19): ‘on iam mn 10:52:25 28.11.19 1 09 .. . custom image segmentation would help too (and then to OCR each "cell" individually) Zdenko so 3. 10. 2020 o 7:06 H Brenner <hyltonbren...@gmail.com> napísal(a): > Hi, > > I have tesseract 3.02 on a Windows 10 PC. > > I am trying to recognise text on a form scanned with a camera that has > numbers mostly in tabular form with a small amount of Hebrew characters > plus one English "graphical" word. I processed the photo to remove a pink > background pattern, and to enhance the text in the image (the original - > minus the pink pattern - produced the same results) > > [image: 3198Rfat.png] > > The Hebrew text on the bottom 2 lines is cut off on the right, but this > does not matter to me. > > Only the numbers are of interest to me in the output. > > I am running tesseract in Python using the pytesseract wrapper, and I am > running the following command: > > - Imaj=Image.open(ImgPath) # ImgPath is the full path to the .png > file. > - print('\n\n','v'*20,'\n', > pytesseract.image_to_string(Imaj),'\n','^'*20,'\n\n') # use eng default > > I believe this corresponds to the command-line: > > - tesseract ImgPath out (I used the actual path) > > The output that I get is the following: > > - 7547512723 2 > - > - 1334718913 > - 0000000000 > - 3927010465. > - 4483273819.. > - 0.|..1|.|.1ln/_1|.7_n/.01 > - 0556107919.. > - 1|11n/Tln/_nJ110._O...|__ > - 6978344327.. > - n/..|9._..l9._Q.:1Jn.o3n/___ > - _/0._1|.|9._n0EunD3./: > - n/L232333333““ > - > - A —:1 qnnwn N > - > - 156138 > - > - ::§1§§?13:?76fi-fi333ii‘ifi1 > - 10:52:25 29.11.19 :1 ma‘ > > Most of it is meaningless gibberish to me. Only the highlighted text is > recognised correctly/ > > When I ran it with the Hebrew language selected, it produced similar > results, but with *some *of the Hebrew characters and only the "156138" > recognised correctly. > > Running tesseract manually (English) in a 'CMD' window produced the > attached file 'out.txt'. > > I suspect that the font used in the form is the problem - the form was not > printed on a normal Windows, Mac or linux computer. > > Which fonts were used to create heb.traineddata? Is there a way for me to > display them? > > Do I have to train tesseract with the font in the form? > > Any help will be appreciated! > > Thanks! > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/a6602b5e-307e-406d-8650-510e8c2225e6n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/a6602b5e-307e-406d-8650-510e8c2225e6n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xwpL-8%3DS4OwmtxNtkR47E-q5%2BtpncF%2BkPa05QkwGWWvA%40mail.gmail.com.