Re: [tesseract-ocr] noob - no output black text on white background surrounded by color backgrounds borders and images

Allistair Sat, 07 Feb 2015 13:32:05 -0800

The page segmentation mode (PSM) could help you - mode 6 is fairly good at
finding various areas of text with images and other noise around but it
will sometimes think the surrounding noise is text, so cropping is really
the only solution here. Your problem is no different to automatic number
plate recognition, which needs to process in real time per frame where a
number plate is in a natural scene. You will need to develop a routine that
looks for your likely text areas algorithmically if you cannot rely on
coordinates. For example I noticed that your text areas have quite light
background colour areas, so you could start by creating "blobs" of areas
that have light pixels and inferring crop rectangles (just as an example).


On 7 February 2015 at 19:57, Josh Wolcott <jswolc...@gmail.com> wrote:

> My issue with cropping is that due to the variances in where the images
> are I end up with a large variance in the images. I'll attach two examples
> from one scan of 9 images.  I can't just crop these in an automated
> fashion. I need a solution outside of cropping since cropping has gotten me
> this far and wont get me any farther.
>
> I suppose your confirmation that the text itself is clear enough lets me
> know I need to do some imagemagicking.... I was thinking about scanning in
> color... pulling every color out except black (some how) and letting it run
> on that. In that case I will have random splatterings of black though. I
> don't want that to get translated in to text.
> On Saturday, February 7, 2015 at 2:44:05 PM UTC-5, Dmitri Silaev wrote:
>
>> Go ahead, your idea is what you need. You say you're ready to use
>> ImageMagick to "preprocess" images but not willing to use it to crop a few
>> regions with text. What's the point?.. Place all cards in the scanner in
>> the same way, figure out the coordinates of the text regions, extract
>> sub-images with ImageMagick, feed them to Tesseract one by one et voila!
>>
>> The text is clear enough to be processed by Tesseract without any further
>> preprocessing.
>>
>> OneNote just has a better text detection routine, so that it gets less
>> confused by graphics.
>>
>> Best regards,
>> Dmitri Silaev
>> www.CustomOCR.com
>>
>>
>>
>> On Sat, Feb 7, 2015 at 10:00 PM, Josh Wolcott <jswo...@gmail.com> wrote:
>>
>>> I'm having pretty poor luck getting this done. I've attached a couple
>>> images to give an idea of what I am trying to do.
>>>
>>> I need to extract the text paragraph in the bottom of the card. I also
>>> need to pull out the title from the top of the card separately. The can be
>>> scanned in at a max 600dpi color. Grey scale or B/W are also options.  I
>>> had the idea of cutting off the top of the card and running tesseract on
>>> that separately. However if I was able to pull the entire content of the
>>> card I could simply parse out the first line.
>>>
>>> So far I am getting 50% nothing 20% gobbledgoop and 30% sensible text.
>>> I know that OCR can be run on this image ok because onenote does it
>>> perfectly. But clearly I am doing some thing wrong or missing some thing
>>> entirely.  Are there pre-processing things I can do to fix this?  I do have
>>> image magick which I could use to feed tesseract the image however it likes.
>>>
>>> thanks!
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/tesseract-ocr/66db65a6-55ad-4c17-aaab-f6a705376051%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/66db65a6-55ad-4c17-aaab-f6a705376051%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/091f6c76-53a8-4722-8a47-89a75fc9df93%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/091f6c76-53a8-4722-8a47-89a75fc9df93%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAORW5vg3ayp-_7hs1v5q-OMCDUUoiV%3DS-pB8JUE604s9V0vn3Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] noob - no output black text on white background surrounded by color backgrounds borders and images

Reply via email to