You can definitely get just layout analysis before text recognition -
look at the FindLinesCreateBlockList() API and the BLOCK_LIST data
structure. You can then iterate through that structure to look at
blocks and rows within these blocks. Keep in mind that a sentence in
the image could be broken out into separate boxes altogether if you
have anything more complex than a simple page, so you'll have to do
the stiching yourself of rows in entirely different boxes, based on
their coordinates. There are even cases where you might get
"Patrick"returned as one row containing "Ptrik" and one row containing
"ic" - rare but happens too, especially when the text line has a slope
(even if very moderate).

Patrick

On Jun 19, 4:07 pm, Prodoc <agebo...@gmail.com> wrote:
> Hi,
>
> In version 3 of tesseract-ocr there's a new page layout analysis
> module. I'm interested to learn in what way it is used and how it can
> be used.
>
> Does it provide additional user functionality or is it only used
> internally? I.e. can I query it somehow to output all recognized text
> areas (position and dimensions) without its actual text content?
> Does it have any influence on the mark-up of the text output? I.e.
> e.g. additional line breaks between text in case of a new paragraph.
> I've played with the different pagesegmode values (0-3) but it gives
> me the exact same output for each of them. Do these settings have
> anything to do with the layout analysis?
>
> If recognizing text areas is what it does but you can't output just
> the position and dimensions of them, it would be great to see this as
> a new feature. In a program like gImageReader you have to do this
> manually, OCRFeeder tries to do it automatically. If tesseract-ocr's
> analysis is more accurate, one could use that as an input for
> OCRFeeder again.
>
> Yours,
>
> Age Bosma

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to