The hOCR specification states that ocr_carea is "content area" which "used to be called ocr_column".
I've checked with ScrollView that in my example http://fleksem.klf.uw.edu.pl/~jsbien/Linde_pol+deu-frak/ the columns are correctly recognized, but information about them is not stored in hOCR. On the other hand I've got for this single page 48 ocr_area elements, which seem to be actually equivalent to tesseract's blocks (as suggested also by their identifiers: "block..."). My suggestions are: 1. store the block information as ocrx_block 2. store somehow the information about column segmentation. The simplest way is to use just ocr_area for columns (this will require no change in ocrodjvu and related tools), but if the actual tesseract columns differ too much from the intended use of this element (in my example columns include the running head), a special ocrx element can be created, as stated by the author of the specification: https://groups.google.com/forum/?fromgroups=#!topic/ocropus/eXl27_75Fm8 "If there is something engine-specific you need, pick an ocrx_... tag that doesn't conflict with an existing one." Best regards Janusz -- Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) [email protected], [email protected], http://fleksem.klf.uw.edu.pl/~jsbien/ -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

