Hi,

I have tesseract 3.02 on a Windows 10 PC.

I am trying to recognise text on a form scanned with a camera that has 
numbers mostly in tabular form with a small amount of Hebrew characters 
plus one English "graphical" word. I processed the photo to remove a pink 
background pattern, and to enhance the text in the image (the original - 
minus the pink pattern - produced the same results)

[image: 3198Rfat.png]

The Hebrew text on the bottom 2 lines is cut off on the right, but this 
does not matter to me.

Only the numbers are of interest to me in the output.

I am running tesseract in Python using the pytesseract wrapper, and I am 
running the following command:

   - Imaj=Image.open(ImgPath)  # ImgPath is the full path to the .png file.
   - print('\n\n','v'*20,'\n', 
   pytesseract.image_to_string(Imaj),'\n','^'*20,'\n\n')  # use eng default

I believe this corresponds to the command-line:

   - tesseract  ImgPath  out    (I used the actual path)
   
The output that I get is the following:

   -  7547512723 2
   - 
   - 1334718913
   - 0000000000
   - 3927010465.
   - 4483273819..
   - 0.|..1|.|.1ln/_1|.7_n/.01
   - 0556107919..
   - 1|11n/Tln/_nJ110._O...|__
   - 6978344327..
   - n/..|9._..l9._Q.:1Jn.o3n/___
   - _/0._1|.|9._n0EunD3./:
   - n/L232333333““
   - 
   -  A —:1 qnnwn N
   - 
   - 156138
   - 
   - ::§1§§?13:?76fi-fi333ii‘ifi1
   - 10:52:25 29.11.19 :1 ma‘

Most of it is meaningless gibberish to me. Only the highlighted text is 
recognised correctly/

When I ran it with the Hebrew language selected, it produced similar 
results, but with *some *of the Hebrew characters and only the "156138" 
recognised correctly.

Running tesseract manually (English) in a 'CMD' window produced the 
attached file 'out.txt'.

I suspect that the font used in the form is the problem - the form was not 
printed on a normal Windows, Mac or linux computer.

Which fonts were used to create heb.traineddata? Is there a way for me to 
display them?

Do I have to train tesseract with the font in the form?

Any help will be appreciated!

Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a6602b5e-307e-406d-8650-510e8c2225e6n%40googlegroups.com.
7547512723 2

1334718913
0000000000
3927010465.
4483273819..
0.|..1|.|.1ln/_1|.7_n/.01
0556107919..
1|11n/Tln/_nJ110._O...|__
6978344327..
n/..|9._..l9._Q.:1Jn.o3n/___
_/0._1|.|9._n0EunD3./:
n/L232333333““

 A —:1 qnnwn N

156138

::§1§§?13:?76fi-fi333ii‘ifi1
10:52:25 29.11.19 :1 ma‘

Reply via email to