I have a similar problem: trying to apply user patterns - such as  " 
\d*>d*+ \d*> " - to minimise errors when I convert the OCR field of payment 
slips read on a flatbed scanner. I have a nice gtk script that uses 
scanimage, imagemagick and tesseract, but tesseract is making too many 
stupid errors (such as converting a pain 6 into o - accent egu). The script 
can post-process and deal with such simple errors - but there are too many 
cases that it cannot deal with, user patterns would be ideal

This paymjent-slip application cannot provide 4 leading concrete 
characters. But I just read 1 or 2 lines - so speed (the reason given for 
having the 4-character rule) is not an argument.

I saw a remark that dropping that rule would require setting 
kSaneNumConcreteChars to 0. I this parameter configurable? is it compiled 
into tesseract? Can I "patch"  this into my tesseract?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4e5afb60-002b-449b-a39b-743c34aca72a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to