The hOCR specification states that ocr_carea is "content area" which "used
to be called ocr_column".

I've checked with ScrollView that in my example

http://fleksem.klf.uw.edu.pl/~jsbien/Linde_pol+deu-frak/

the columns are correctly recognized, but information about them is not
stored in hOCR. On the other hand I've got for this single page 48
ocr_area elements, which seem to be actually equivalent to tesseract's
blocks (as suggested also by their identifiers: "block...").

My suggestions are:

1. store the block information as ocrx_block
2. store somehow the information about column segmentation.

The simplest way is to use just ocr_area for columns (this will require no
change in ocrodjvu and related tools), but if the actual tesseract columns
differ too much from the intended use of this element (in my example
columns include the running head), a special ocrx element can be created,
as stated by the author of the specification:

https://groups.google.com/forum/?fromgroups=#!topic/ocropus/eXl27_75Fm8

"If there is something engine-specific you need, pick an ocrx_... tag
that doesn't conflict with an existing one."

Best regards

Janusz

-- 
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra
Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
[email protected], [email protected], http://fleksem.klf.uw.edu.pl/~jsbien/

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to