here's a patch to Tesseract someone else posted which implements hOCR
file output (which includes coordinate information) -- I'm not sure
about the version this patches against:

http://code.google.com/p/tesseract-ocr/issues/detail?id=263

--Sven

On Tue, May 4, 2010 at 9:40 PM, Sven Pedersen <[email protected]> wrote:
> OCRopus, which can use Tesseract as its engine, has support for some
> position information being output -- segmentation and some other
> things:
>
> check out their docs on "file formats"
> https://docs.google.com/View?id=dfxcv4vc_92c8xxp7
>
> --Sven
>
>
> On Tue, May 4, 2010 at 12:56 PM, lux <[email protected]> wrote:
>> No, it must be something given by tesseract because there could be
>> more red than black (font color in this example) and so it would all
>> screw up!
>> Anyway I can just get the text from tesseract before with the boxes
>> positions... but the problem is that I also need the exact color of
>> the word tesseract picked up.
>>
>> Tesseract surelly store the positions of the texts when it compute the
>> image, but the point is... is there a way to get these?
>>
>> On 3 Mag, 21:01, Sven Pedersen <[email protected]> wrote:
>>> Using filters to cancel out colors other than the target color, it
>>> should be possible to iteratively extract text of a certain color (say
>>> red, green, blue, black, etc.) But that would be hard. Generally
>>> people just want to get the text and fix the colors later.
>>> --Sven
>>>
>>>
>>>
>>>
>>>
>>> On Sun, May 2, 2010 at 1:41 PM, Sandro Zahra <[email protected]> wrote:
>>> > I think that OCR is not about colours.....
>>>
>>> > On 2 May 2010 17:35, lux <[email protected]> wrote:
>>>
>>> >> I need the RIGHT position of the text or the RIGHT color, not an
>>> >> average color :/.
>>>
>>> >> On 11 Apr, 20:48, MARTIN Pierre <[email protected]> wrote:
>>> >> > > So how can I get the position of text?
>>> >> > > I've tryed with makebox but it's not really right, it gives me the
>>> >> > > cordinates of the whole "letter box" so it's impossible for me to get
>>> >> > > the right pixel of the letter
>>> >> > > (e.g. it would work for an 'I' but for an 'A' it gives me the box 
>>> >> > > left
>>> >> > > up and right down position so I don't know how to get the letter 
>>> >> > > color
>>> >> > > because the 'A' is not at the start nor at the end of the box).
>>>
>>> >> > That's the right method. If you want to know where the "pixels" are, do
>>> >> > an histogram equalization of your picture, then contrast it with a 
>>> >> > fairly
>>> >> > agressive threshold (If it's not already in 1bpp), this will give you 
>>> >> > a copy
>>> >> > of your picture with only black and black pixels. Now, that's on this
>>> >> > picture (Basically 1bpp depth picture) that you run tesseract.
>>> >> > Then given the boxes, you look in your black & white picture where 
>>> >> > black
>>> >> > pixels are in the boxes, and then with the same coordinates you can 
>>> >> > see them
>>> >> > in your original picture. After that, do color average from all pixels 
>>> >> > in a
>>> >> > box in your original picture and you're good.
>>>
>>> >> > Pierre.
>>>
>>> >> --
>>> >> You received this message because you are subscribed to the Google Groups
>>> >> "tesseract-ocr" group.
>>> >> To post to this group, send email to [email protected].
>>> >> To unsubscribe from this group, send email to
>>> >> [email protected].
>>> >> For more options, visit this group at
>>> >>http://groups.google.com/group/tesseract-ocr?hl=en.
>>>
>>> > --
>>> > You received this message because you are subscribed to the Google Groups
>>> > "tesseract-ocr" group.
>>> > To post to this group, send email to [email protected].
>>> > To unsubscribe from this group, send email to
>>> > [email protected].
>>> > For more options, visit this group at
>>> >http://groups.google.com/group/tesseract-ocr?hl=en.
>>>
>>> --
>>> ``All that is gold does not glitter,
>>>   not all those who wander are lost;
>>> the old that is strong does not wither,
>>>   deep roots are not reached by the frost.
>>> From the ashes a fire shall be woken,
>>>   a light from the shadows shall spring;
>>> renewed shall be blade that was broken,
>>>   the crownless again shall be king.”
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups 
>>> "tesseract-ocr" group.
>>> To post to this group, send email to [email protected].
>>> To unsubscribe from this group, send email to 
>>> [email protected].
>>> For more options, visit this group 
>>> athttp://groups.google.com/group/tesseract-ocr?hl=en.
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to 
>> [email protected].
>> For more options, visit this group at 
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>>
>
>
>
> --
> ``All that is gold does not glitter,
>  not all those who wander are lost;
> the old that is strong does not wither,
>  deep roots are not reached by the frost.
> From the ashes a fire shall be woken,
>  a light from the shadows shall spring;
> renewed shall be blade that was broken,
>  the crownless again shall be king.”
>



-- 
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
>From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to