Hi, I have tesseract 3.02 on a Windows 10 PC.
I am trying to recognise text on a form scanned with a camera that has numbers mostly in tabular form with a small amount of Hebrew characters plus one English "graphical" word. I processed the photo to remove a pink background pattern, and to enhance the text in the image (the original - minus the pink pattern - produced the same results) [image: 3198Rfat.png] The Hebrew text on the bottom 2 lines is cut off on the right, but this does not matter to me. Only the numbers are of interest to me in the output. I am running tesseract in Python using the pytesseract wrapper, and I am running the following command: - Imaj=Image.open(ImgPath) # ImgPath is the full path to the .png file. - print('\n\n','v'*20,'\n', pytesseract.image_to_string(Imaj),'\n','^'*20,'\n\n') # use eng default I believe this corresponds to the command-line: - tesseract ImgPath out (I used the actual path) The output that I get is the following: - 7547512723 2 - - 1334718913 - 0000000000 - 3927010465. - 4483273819.. - 0.|..1|.|.1ln/_1|.7_n/.01 - 0556107919.. - 1|11n/Tln/_nJ110._O...|__ - 6978344327.. - n/..|9._..l9._Q.:1Jn.o3n/___ - _/0._1|.|9._n0EunD3./: - n/L232333333““ - - A —:1 qnnwn N - - 156138 - - ::§1§§?13:?76fi-fi333ii‘ifi1 - 10:52:25 29.11.19 :1 ma‘ Most of it is meaningless gibberish to me. Only the highlighted text is recognised correctly/ When I ran it with the Hebrew language selected, it produced similar results, but with *some *of the Hebrew characters and only the "156138" recognised correctly. Running tesseract manually (English) in a 'CMD' window produced the attached file 'out.txt'. I suspect that the font used in the form is the problem - the form was not printed on a normal Windows, Mac or linux computer. Which fonts were used to create heb.traineddata? Is there a way for me to display them? Do I have to train tesseract with the font in the form? Any help will be appreciated! Thanks! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a6602b5e-307e-406d-8650-510e8c2225e6n%40googlegroups.com.
7547512723 2 1334718913 0000000000 3927010465. 4483273819.. 0.|..1|.|.1ln/_1|.7_n/.01 0556107919.. 1|11n/Tln/_nJ110._O...|__ 6978344327.. n/..|9._..l9._Q.:1Jn.o3n/___ _/0._1|.|9._n0EunD3./: n/L232333333““ A —:1 qnnwn N 156138 ::§1§§?13:?76fi-fi333ii‘ifi1 10:52:25 29.11.19 :1 ma‘