subject:"Tesseract OCR produces non\-existing spaces in the middle of the words\: how to change spacing tolarance\?"

Re: Tesseract OCR produces non-existing spaces in the middle of the words: how to change spacing tolarance?

2009-11-13 Thread Ray Smith

Yes the spacing algorithm needs a total rewrite. The problem is that trying to be general makes it more difficult to get the typical case right. When text is justified in a narrow column, eg a newpaper, the space between letters and between words can vary from line to line, so it is difficult to t

Re: Tesseract OCR produces non-existing spaces in the middle of the words: how to change spacing tolarance?

2009-11-13 Thread patrickq

I have had the same experience getting spaces in many spots where none should exist. Since I have no idea how to navigate the many Tess variables, my approach has been to test and remove such spaces myself post-scan, based on the width & spacing of characters in the current word. Indeed italic or

Re: Tesseract OCR produces non-existing spaces in the middle of the words: how to change spacing tolarance?

2009-11-13 Thread Svetlin Nakov

In fact tesseract constantly and consistently fails on italic uppercase fonts. In such fonts characters are have low spacing (in measured in vertical spacing) and in many cases even overlap. I tried to fix the source code with no success. It is not a matter of ajdusting few constants. It is a desi

Tesseract OCR produces non-existing spaces in the middle of the words: how to change spacing tolarance?

2009-11-12 Thread Svetlin Nakov

Hello colleagues, I have the following problem: after a successful training, during the OCR process Tesseract puts additional spaces non-existing in the text in the middle of some words, e.g. it splits the word "HRISTOVICH" to "HRISTO" + [space] + "VICH". In this particular example the word is