Hello Thomas, Thank you very much for your detailed answers. I will start with ISegmentPage and the Lua scripts, and post here when I run into problems.
Just to be clear, I am not looking for production-quality code, as I am doing research myself. I would just like to build on others' work where possible instead of attempting to (poorly) reinvent the wheel. Your "Structural Mixtures for Statistical Layout Analysis" paper looks very interesting--I would love to play with the code for that when and if it becomes available. Cheers, Ryan On Wed, Dec 10, 2008 at 1:05 AM, Thomas Breuel <[EMAIL PROTECTED]> wrote: > Hi, > >> I am trying to do some analysis of the journal content and wish to >> assemble sentences from the OCR text. I need to identify things like >> page headers and footers, footnotes, pullquotes, etc. so that I can >> ignore them and just assemble sentences from the main body text. >> Though the OCR XML I have contains region information, it isn't very >> reliable for identifying these kinds of layout elements. >> >> So, my questions are: >> >> 1. Is Ocropus suitable for this task? > > OCRopus has functions that will help you, but it's not a turnkey solution. > In particular, right now, OCRopus has good physical layout analysis (finding > "text blocks"). > >> 2. If so, where do I start looking to see how to use Ocropus to do this? > > The primary layout analysis methods are the ones implemented by > ISegmentPage. In addition, OCRopus contains some Lua scripts that implement > a limited form of logical labeling. > >> >> 3. Of the layout analysis algorithms included with Ocropus, which do >> you recommend for this task? > > The RAST algorithm will perform physical layout analysis for you. OCRopus > as a whole does some limited logical layout analysis. > >> 4. Should I be analyzing the page images directly, or could I use the >> region, line, and word bounding boxes from the OCR XML? > > Which one is better depends on how good the preprocessing is. > > There is a lot of research-quality code that we and others have for more > sophisticated kinds of layout analysis and that isn't part of OCRopus yet. > The bottleneck is funding and people to clean it up and wrap it up (see > below); there is a big difference between code that one can run by hand over > a bunch of databases, and code that's actually useful for production. > > Upcoming releases of OCRopus will incorporate additional forms of logical > layout analysis, including trainable logical layout analysis. > > Cheers, > Thomas. > > PS: here are two pointers to the kinds of approaches we are pursuing. Our > focus is on methods that can be made robust even in the presence of noise > and other image quality problems. > > Structural Mixtures for Statistical Layout Analysis > Faisal Shafait, Joost van Beusekom, Daniel Keysers, Thomas M. Breuel > Proc. 8th Int. Workshop on Document Analysis Systems (DAS) Accepted for > publication > > Layout Analysis by Exploring the Space of Segmentation Parameters > T.M. Breuel > Proceedings of the International Association for Pattern Recognition > Workshop (Document Analysis Systems) Also selected for inclusion in the > post-conference book. > > > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/ocropus?hl=en -~----------~----~----~----~------~----~------~--~---
