here's a patch to Tesseract someone else posted which implements hOCR file output (which includes coordinate information) -- I'm not sure about the version this patches against:
http://code.google.com/p/tesseract-ocr/issues/detail?id=263 --Sven On Tue, May 4, 2010 at 9:40 PM, Sven Pedersen <[email protected]> wrote: > OCRopus, which can use Tesseract as its engine, has support for some > position information being output -- segmentation and some other > things: > > check out their docs on "file formats" > https://docs.google.com/View?id=dfxcv4vc_92c8xxp7 > > --Sven > > > On Tue, May 4, 2010 at 12:56 PM, lux <[email protected]> wrote: >> No, it must be something given by tesseract because there could be >> more red than black (font color in this example) and so it would all >> screw up! >> Anyway I can just get the text from tesseract before with the boxes >> positions... but the problem is that I also need the exact color of >> the word tesseract picked up. >> >> Tesseract surelly store the positions of the texts when it compute the >> image, but the point is... is there a way to get these? >> >> On 3 Mag, 21:01, Sven Pedersen <[email protected]> wrote: >>> Using filters to cancel out colors other than the target color, it >>> should be possible to iteratively extract text of a certain color (say >>> red, green, blue, black, etc.) But that would be hard. Generally >>> people just want to get the text and fix the colors later. >>> --Sven >>> >>> >>> >>> >>> >>> On Sun, May 2, 2010 at 1:41 PM, Sandro Zahra <[email protected]> wrote: >>> > I think that OCR is not about colours..... >>> >>> > On 2 May 2010 17:35, lux <[email protected]> wrote: >>> >>> >> I need the RIGHT position of the text or the RIGHT color, not an >>> >> average color :/. >>> >>> >> On 11 Apr, 20:48, MARTIN Pierre <[email protected]> wrote: >>> >> > > So how can I get the position of text? >>> >> > > I've tryed with makebox but it's not really right, it gives me the >>> >> > > cordinates of the whole "letter box" so it's impossible for me to get >>> >> > > the right pixel of the letter >>> >> > > (e.g. it would work for an 'I' but for an 'A' it gives me the box >>> >> > > left >>> >> > > up and right down position so I don't know how to get the letter >>> >> > > color >>> >> > > because the 'A' is not at the start nor at the end of the box). >>> >>> >> > That's the right method. If you want to know where the "pixels" are, do >>> >> > an histogram equalization of your picture, then contrast it with a >>> >> > fairly >>> >> > agressive threshold (If it's not already in 1bpp), this will give you >>> >> > a copy >>> >> > of your picture with only black and black pixels. Now, that's on this >>> >> > picture (Basically 1bpp depth picture) that you run tesseract. >>> >> > Then given the boxes, you look in your black & white picture where >>> >> > black >>> >> > pixels are in the boxes, and then with the same coordinates you can >>> >> > see them >>> >> > in your original picture. After that, do color average from all pixels >>> >> > in a >>> >> > box in your original picture and you're good. >>> >>> >> > Pierre. >>> >>> >> -- >>> >> You received this message because you are subscribed to the Google Groups >>> >> "tesseract-ocr" group. >>> >> To post to this group, send email to [email protected]. >>> >> To unsubscribe from this group, send email to >>> >> [email protected]. >>> >> For more options, visit this group at >>> >>http://groups.google.com/group/tesseract-ocr?hl=en. >>> >>> > -- >>> > You received this message because you are subscribed to the Google Groups >>> > "tesseract-ocr" group. >>> > To post to this group, send email to [email protected]. >>> > To unsubscribe from this group, send email to >>> > [email protected]. >>> > For more options, visit this group at >>> >http://groups.google.com/group/tesseract-ocr?hl=en. >>> >>> -- >>> ``All that is gold does not glitter, >>> not all those who wander are lost; >>> the old that is strong does not wither, >>> deep roots are not reached by the frost. >>> From the ashes a fire shall be woken, >>> a light from the shadows shall spring; >>> renewed shall be blade that was broken, >>> the crownless again shall be king.” >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "tesseract-ocr" group. >>> To post to this group, send email to [email protected]. >>> To unsubscribe from this group, send email to >>> [email protected]. >>> For more options, visit this group >>> athttp://groups.google.com/group/tesseract-ocr?hl=en. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]. >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en. >> >> > > > > -- > ``All that is gold does not glitter, > not all those who wander are lost; > the old that is strong does not wither, > deep roots are not reached by the frost. > From the ashes a fire shall be woken, > a light from the shadows shall spring; > renewed shall be blade that was broken, > the crownless again shall be king.” > -- ``All that is gold does not glitter, not all those who wander are lost; the old that is strong does not wither, deep roots are not reached by the frost. >From the ashes a fire shall be woken, a light from the shadows shall spring; renewed shall be blade that was broken, the crownless again shall be king.” -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

