Hello Thomas,

Thank you very much for your detailed answers. I will start with
ISegmentPage and the Lua scripts, and post here when I run into
problems.

Just to be clear, I am not looking for production-quality code, as I
am doing research myself. I would just like to build on others' work
where possible instead of attempting to (poorly) reinvent the wheel.

Your "Structural Mixtures for Statistical Layout Analysis" paper looks
very interesting--I would love to play with the code for that when and
if it becomes available.

Cheers,
Ryan

On Wed, Dec 10, 2008 at 1:05 AM, Thomas Breuel <[EMAIL PROTECTED]> wrote:
> Hi,
>
>> I am trying to do some analysis of the journal content and wish to
>> assemble sentences from the OCR text. I need to identify things like
>> page headers and footers, footnotes, pullquotes, etc. so that I can
>> ignore them and just assemble sentences from the main body text.
>> Though the OCR XML I have contains region information, it isn't very
>> reliable for identifying these kinds of layout elements.
>>
>> So, my questions are:
>>
>> 1. Is Ocropus suitable for this task?
>
> OCRopus has functions that will help you, but it's not a turnkey solution.
> In particular, right now, OCRopus has good physical layout analysis (finding
> "text blocks").
>
>> 2. If so, where do I start looking to see how to use Ocropus to do this?
>
> The primary layout analysis methods are the ones implemented by
> ISegmentPage.  In addition, OCRopus contains some Lua scripts that implement
> a limited form of logical labeling.
>
>>
>> 3. Of the layout analysis algorithms included with Ocropus, which do
>> you recommend for this task?
>
> The RAST algorithm will perform physical layout analysis for you.  OCRopus
> as a whole does some limited logical layout analysis.
>
>> 4. Should I be analyzing the page images directly, or could I use the
>> region, line, and word bounding boxes from the OCR XML?
>
> Which one is better depends on how good the preprocessing is.
>
> There is a lot of research-quality code that we and others have for more
> sophisticated kinds of layout analysis and that isn't part of OCRopus yet.
> The bottleneck is funding and people to clean it up and wrap it up (see
> below); there is a big difference between code that one can run by hand over
> a bunch of databases, and code that's actually useful for production.
>
> Upcoming releases of OCRopus will incorporate additional forms of logical
> layout analysis, including trainable logical layout analysis.
>
> Cheers,
> Thomas.
>
> PS: here are two pointers to the kinds of approaches we are pursuing.  Our
> focus is on methods that can be made robust even in the presence of noise
> and other image quality problems.
>
> Structural Mixtures for Statistical Layout Analysis
> Faisal Shafait, Joost van Beusekom, Daniel Keysers, Thomas M. Breuel
> Proc. 8th Int. Workshop on Document Analysis Systems (DAS) Accepted for
> publication
>
> Layout Analysis by Exploring the Space of Segmentation Parameters
> T.M. Breuel
> Proceedings of the International Association for Pattern Recognition
> Workshop (Document Analysis Systems) Also selected for inclusion in the
> post-conference book.
>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to