Hi all, I'm working on a project that involves detecting text in street level images. I have already written a code that allows me to extract text areas from my images.
I work with Tesseract 3.0, and, first of all, I tried running Tesseract on full images (1080x1920), just to see the results I could get. Obviously, because of trees, fences, walls, etc, there are a lot of false recognition from Tesseract, but some texts are also well recognized. So, to improve the recognition, I give to Tesseract only the text areas segmented by my code and hope that recognition would be good although the scenes are very difficult. I know that when the image is too complicated (not enough contrast between text and background, many shadows...) detection is really difficult and may not give good results, however even in some supposedly very simple cases like this one (black text on white background with a slight blur): http://tesseract-ocr.googlegroups.com/web/Paris_12-080422_0687-34-00001_0001312_box_0006.png?gda=4bbSHWQAAAAq5Pp34OGAuWVwGRkvOnHabRkL_yLtSqEDTbGzFn1v-X8wOvnLz5Tja6xhmVIF9MLYgjPKDhWo7fDwKv7hdSobhowICgXY9oBdZxkhoGyvOFXq71KIRN2DRDZ98DIdT53NzgFmQudIVZfn2evkHEao Tesseract recognizes: http://tesseract-ocr.googlegroups.com/web/Paris_12-080422_0687-34-00001_0001312_box_0006_boxes.png?gda=KSimu2oAAAAq5Pp34OGAuWVwGRkvOnHabRkL_yLtSqEDTbGzFn1v-X8wOvnLz5Tja6xhmVIF9MLYgjPKDhWo7fDwKv7hdSobOWEMBOZDXT0mTiVSy6rk8qwfOToRrNOWJtPPKSAn4D797daDQaep90o7AOpSKHW0 I do not understand this result. Indeed, I use Tesseract with the option "-l fra" for french language. Normally, in the french dictionnary, the word "Cloison" exists, so I do not understand why Tesseract recognizes a "0" instead of a "o". Does the dictionary actually plays a role in the recognition? Because it is clear that the "0" and "o" have same shape-based confidence value, but the dictionary should also aim at choosing "o" rather than "0", am I wrong? In addition, Tesseract does not seem to take into account the scale between two adjacent boxes? It recognizes "ll" for the segmented quotation mark (see images above) while it recognizes correctly 'i' just before "ll". I also tried to add lines to the file "fra.unicharambigs" to correct false recognition of the 'n' as "l'I" (line in unicharambigs: 3 l'I 1 n 0) and the 'm' as "ITI" (line in unicharambigs: 3 ITI 1 m 0), I ran combine_tessdata to make a new "fra.traineddata", but there is no change. So, i tried to "help" Tesseract by giving it our own segmented text image, in this case, the blur is removed and the recognition gives better results as you can see on this image: http://tesseract-ocr.googlegroups.com/web/Paris_12-080422_0687-34-00001_0001312_box_0006_boxes+(2).png?gda=_sgwA3IAAAAq5Pp34OGAuWVwGRkvOnHabRkL_yLtSqEDTbGzFn1v-X8wOvnLz5Tja6xhmVIF9MLYgjPKDhWo7fDwKv7hdSobOWEMBOZDXT0mTiVSy6rk8juef4gIssVZMUVd4ovTnHRV4u3aa4iAIyYQIqbG9naPgh6o8ccLBvP6Chud5KMzIQ or this one too: http://tesseract-ocr.googlegroups.com/web/Paris_12-080422_0687-34-00001_0001312_box_0005_boxes.png?gda=pgvY8moAAAAq5Pp34OGAuWVwGRkvOnHabRkL_yLtSqEDTbGzFn1v-X8wOvnLz5Tja6xhmVIF9MLYgjPKDhWo7fDwKv7hdSobIvfRTlYBT-BD2NUWBDUNMqwfOToRrNOWJtPPKSAn4D797daDQaep90o7AOpSKHW0 I guess that "Faux" and "plafonds" (and mayber even "Faux-plafonds") are present in the basic Tesseract dictionnary since the recognition is good with original Tesseract. However, if I use a new dictionary I have created, with a list of about 350k french words, using wordlist2dawg to create "fra.word-dawg" and remake the "fra.traineddata" and that I ran Tesseract on the same image, the recognition is "Foux-plufonds". This word is not in my list neither "Foux" nor "plufonds" whereas "Faux-plafonds", "Faux" and "plafonds" are in my list. If you have any idea to help me with this too, I will be very greatful. Next, I will try to provide character-image by character-image to Tesseract to simplify again the recognition, but if you have any other idea to improve it, I am definitely interested. Thank you in advance for any help you will be able to provide me, Jonathan. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.