Hi Walter,

I think it's worth for you to take a look at the OCRopus project
(http://code.google.com/p/ocropus/) As I know they can offer good
segmentation for such a type of images. At the time of my
investigation, it was based on Thomas Breuel's works related to
whitespace cover approach, particularly his "Two Geometric Algorithms
for Layout Analysis" (2002), maybe also his "Layout Analysis based on
Text Line Segment Hypotheses" (2003.) So you can even implement these
approaches yourself using these articles.

HTH

Warm regards,
Dmitri Silaev
www.CustomOCR.com




On Fri, Nov 18, 2011 at 9:26 PM, walter23 <walte...@gmail.com> wrote:
> Hi Dmitri,
>
> Thanks for your response.  I figured some kind of custom segmentation
> was going to be required.   Any suggestions you can make to help would
> be appreciated - I was thinking perhaps I would use some tools from
> OpenCV or something but I'm not really sure where to read up on
> segmentation approaches.
>
> Here's a sample image:
>
> http://i.imgur.com/6he8V.jpg
>
> This is not actually an image I have worked with.  It's just a
> representative sample pulled at random from a web image search, since
> my sample image contains proprietary information that I can't share.
>
> Actual resolution is in the 14,000 x 10,000 range.
>
>
> -Walter
>
>
> On Nov 17, 10:04 pm, Dmitri Silaev <daemons2...@gmail.com> wrote:
>> There's no other way to achieve this except helping Tesseract with
>> segmentation and feed it with chopped image pieces. Many segmentation
>> approaches exist, but which you should choose depends on your image
>> specifics: how long text lines are, whither it is a multicolumn layout
>> or not, possible skewness and plainness of the whole image and many
>> more.
>>
>> Send your sample images to get a more practical advice.
>>
>> Warm regards,
>> Dmitri Silaevwww.CustomOCR.com
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Nov 18, 2011 at 12:59 AM, walter23 <walte...@gmail.com> wrote:
>> > I'm trying to come up with a method to OCR very large images (poster
>> > sized) with lots of regular sized text... for example 40" wide with 12
>> > point font.  One big limitation I have is that memory is easily
>> > exhausted with images that take up half a gigabyte or more of RAM
>> > (40x30" @ 300DPI is pretty big).
>>
>> > I am trying to find out a smart method of automatically reducing the
>> > image to continuous regions of text so that I do not chop text lines
>> > in half (either horizontally or vertically).
>>
>> > One idea was to maybe use page segmentation on a lower resolution
>> > image and use this page layout to split the image up, but looking at
>> > the layout results I see some problems with this.
>>
>> > Has anybody tackled this kind of problem before?  Suggestions for
>> > approaches to take?
>>
>> > Many thanks
>>
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups "tesseract-ocr" group.
>> > To post to this group, send email to tesseract-ocr@googlegroups.com
>> > To unsubscribe from this group, send email to
>> > tesseract-ocr+unsubscr...@googlegroups.com
>> > For more options, visit this group at
>> >http://groups.google.com/group/tesseract-ocr?hl=en
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesseract-ocr@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-ocr+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to