Hi All, This is Sai. I wanted to develop an Android App, which can detect the text contained in a screenshot. So, for this what I do basically is take the image and pre-process it first to basically remove the funky stuff, such as edges and other possibly removable unwanted stuff, that I can. Also, I adjust the dpi of the image according to the requirement of Tesseract, by using some interpolation methods like Cubic/Bi-Spline and enlarge the image. Then I wanted to supply this cleaned image to Tesseract-OCR (I found that it is a good Open-Source OCR engine). But, after reaching this point, I'm stuck at the OCR part, where Tesseract is unable to segment the page according to my wish. What I came to know after doing a sufficient amount of research is that, after Gray-Scaling and Thresholding the image, Tesseract basically assumes that it is the block of text on a page and applies its internal line finding algorithm to fit the text within some Base-Line and Mean-Line. I don't think this might help me in my situation because the text may be aligned like this (http://tsndiffopera.in/problem.jpg), in which case the base-line and mean-line fitting is different for both different blocks of text. But Tesseract fits both of them in the same line and because of which, the 'i' is being detected as 'l', many times. I have many such failure cases. So, my question is, "Is there any way to overcome this situation?", either by changing the segmentation algorithm used by Tesseract, like can I implement my own segmentation algorithm which can divide the page into blocks of text, which identifies each word, assuming two words are are anyhow separated by a minimum (lets say one space) distance, considering either horizontally or vertically. Has anyone got any resources, (like related research papers or so) for achieving this ? If someone was able to overcome this situation previously, please tell me how.
Thanks, Sai ( www.tsndiffopera.in ) -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

