Re: Tesseract Training

Dmitry Silaev Mon, 17 Jan 2011 04:21:23 -0800

Dear Sochenda,

I've checked the Unicode table range you've sent and now I see what the
problem is. I'd agree that in such "algorithmic" writing system (contrasted
with simpler "positional" systems like say Roman or Cyrillic) the stages of
pre-/post-processing are inevitable.


I'd suggest making special hand-crafted or generated training images. In
these images you would properly space out all the joint character
combinations as well as character components that can make up Khmer
characters. Then you would edit the resulting box files to assign codes
according to your coding system. The noted process should be repeated as
many times as required to achieve the sample count of 15-20 for every glyph.

At the recognition stage, if trained properly, overlapping bounding boxes is
not a problem for Tess. My experience shows that it is very inventive in
character segmentation even in cases of BB overlap. So I hope you should
have no severe difficulties with partially overhanging or underlying glyphs.

Your post-processor should be able to "decode" recognition output using an
algorithmic approach to form good Unicode characters. You can also use some
Khmer bigram or trigram statistics to do error correction. Probably you'd
want to play around with Tess's dictionary facility but I doubt it would be
helpful in your case.

Dmitry

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Tesseract Training

Reply via email to