Thanks for idea. I will try to have a look on it.
If anybody has patch ready I will welcome it warmly

Zdenko


On Tue, Apr 2, 2013 at 7:16 AM, "Janusz S. Bień" <[email protected]>wrote:

> The hOCR specification states that ocr_carea is "content area" which "used
> to be called ocr_column".
>
> I've checked with ScrollView that in my example
>
> http://fleksem.klf.uw.edu.pl/~jsbien/Linde_pol+deu-frak/
>
> the columns are correctly recognized, but information about them is not
> stored in hOCR. On the other hand I've got for this single page 48
> ocr_area elements, which seem to be actually equivalent to tesseract's
> blocks (as suggested also by their identifiers: "block...").
>
> My suggestions are:
>
> 1. store the block information as ocrx_block
> 2. store somehow the information about column segmentation.
>
> The simplest way is to use just ocr_area for columns (this will require no
> change in ocrodjvu and related tools), but if the actual tesseract columns
> differ too much from the intended use of this element (in my example
> columns include the running head), a special ocrx element can be created,
> as stated by the author of the specification:
>
> https://groups.google.com/forum/?fromgroups=#!topic/ocropus/eXl27_75Fm8
>
> "If there is something engine-specific you need, pick an ocrx_... tag
> that doesn't conflict with an existing one."
>
> Best regards
>
> Janusz
>
> --
> Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra
> Lingwistyki Formalnej)
> Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
> [email protected], [email protected],
> http://fleksem.klf.uw.edu.pl/~jsbien/
>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to