Dictionnary issues

Jonathan Thu, 03 Mar 2011 09:21:35 -0800

Hi all,

I'm working on a project that involves detecting text in street level
images. I have already written a code that allows me to extract text
areas from my images.


I work with Tesseract 3.0, and, first of all, I tried running
Tesseract on full images (1080x1920), just to see the results I could
get. Obviously, because of trees, fences, walls, etc, there are a lot
of false recognition from Tesseract, but some texts are also well
recognized. So, to improve the recognition, I give to Tesseract only
the text areas segmented by my code and hope that recognition would be
good although the scenes are very difficult.

I know that when the image is too complicated (not enough contrast
between text and background, many shadows...) detection is really
difficult and may not give good results, however even in some
supposedly very simple cases like this one (black text on white
background with a slight blur):
http://tesseract-ocr.googlegroups.com/web/Paris_12-080422_0687-34-00001_0001312_box_0006.png?gda=4bbSHWQAAAAq5Pp34OGAuWVwGRkvOnHabRkL_yLtSqEDTbGzFn1v-X8wOvnLz5Tja6xhmVIF9MLYgjPKDhWo7fDwKv7hdSobhowICgXY9oBdZxkhoGyvOFXq71KIRN2DRDZ98DIdT53NzgFmQudIVZfn2evkHEao
Tesseract recognizes:
http://tesseract-ocr.googlegroups.com/web/Paris_12-080422_0687-34-00001_0001312_box_0006_boxes.png?gda=KSimu2oAAAAq5Pp34OGAuWVwGRkvOnHabRkL_yLtSqEDTbGzFn1v-X8wOvnLz5Tja6xhmVIF9MLYgjPKDhWo7fDwKv7hdSobOWEMBOZDXT0mTiVSy6rk8qwfOToRrNOWJtPPKSAn4D797daDQaep90o7AOpSKHW0

I do not understand this result. Indeed, I use Tesseract with the
option "-l fra" for french language. Normally, in the french
dictionnary, the word "Cloison" exists, so I do not understand why
Tesseract recognizes a "0" instead of a "o".

Does the dictionary actually plays a role in the recognition? Because
it is clear that the "0" and "o" have same shape-based confidence
value, but the dictionary should also aim at choosing "o" rather than
"0", am I wrong?

In addition, Tesseract does not seem to take into account the scale
between two adjacent boxes? It recognizes "ll" for the segmented
quotation mark (see images above) while it recognizes correctly 'i'
just before "ll".

I also tried to add lines to the file "fra.unicharambigs" to correct
false recognition of the 'n' as "l'I" (line in unicharambigs: 3 l'I 1
n 0) and the 'm' as "ITI" (line in unicharambigs: 3 ITI 1 m 0), I ran
combine_tessdata to make a new "fra.traineddata", but there is no
change.

So, i tried to "help" Tesseract by giving it our own segmented text
image, in this case, the blur is removed and the recognition gives
better results as you can see on this image:
http://tesseract-ocr.googlegroups.com/web/Paris_12-080422_0687-34-00001_0001312_box_0006_boxes+(2).png?gda=_sgwA3IAAAAq5Pp34OGAuWVwGRkvOnHabRkL_yLtSqEDTbGzFn1v-X8wOvnLz5Tja6xhmVIF9MLYgjPKDhWo7fDwKv7hdSobOWEMBOZDXT0mTiVSy6rk8juef4gIssVZMUVd4ovTnHRV4u3aa4iAIyYQIqbG9naPgh6o8ccLBvP6Chud5KMzIQ
or this one too:
http://tesseract-ocr.googlegroups.com/web/Paris_12-080422_0687-34-00001_0001312_box_0005_boxes.png?gda=pgvY8moAAAAq5Pp34OGAuWVwGRkvOnHabRkL_yLtSqEDTbGzFn1v-X8wOvnLz5Tja6xhmVIF9MLYgjPKDhWo7fDwKv7hdSobIvfRTlYBT-BD2NUWBDUNMqwfOToRrNOWJtPPKSAn4D797daDQaep90o7AOpSKHW0

I guess that "Faux" and "plafonds" (and mayber even "Faux-plafonds")
are present in the basic Tesseract dictionnary since the recognition
is good with original Tesseract.
However, if I use a new dictionary I have created, with a list of
about 350k french words, using wordlist2dawg to create "fra.word-dawg"
and remake the "fra.traineddata" and that I ran Tesseract on the same
image, the recognition is "Foux-plufonds". This word is not in my list
neither "Foux" nor "plufonds" whereas "Faux-plafonds", "Faux" and
"plafonds" are in my list.
If you have any idea to help me with this too, I will be very
greatful.

Next, I will try to provide character-image by character-image to
Tesseract to simplify again the recognition, but if you have any other
idea to improve it, I am definitely interested.

Thank you in advance for any help you will be able to provide me,
Jonathan.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Dictionnary issues

Reply via email to