Thanks for idea. I will try to have a look on it. If anybody has patch ready I will welcome it warmly
Zdenko On Tue, Apr 2, 2013 at 7:16 AM, "Janusz S. Bień" <[email protected]>wrote: > The hOCR specification states that ocr_carea is "content area" which "used > to be called ocr_column". > > I've checked with ScrollView that in my example > > http://fleksem.klf.uw.edu.pl/~jsbien/Linde_pol+deu-frak/ > > the columns are correctly recognized, but information about them is not > stored in hOCR. On the other hand I've got for this single page 48 > ocr_area elements, which seem to be actually equivalent to tesseract's > blocks (as suggested also by their identifiers: "block..."). > > My suggestions are: > > 1. store the block information as ocrx_block > 2. store somehow the information about column segmentation. > > The simplest way is to use just ocr_area for columns (this will require no > change in ocrodjvu and related tools), but if the actual tesseract columns > differ too much from the intended use of this element (in my example > columns include the running head), a special ocrx element can be created, > as stated by the author of the specification: > > https://groups.google.com/forum/?fromgroups=#!topic/ocropus/eXl27_75Fm8 > > "If there is something engine-specific you need, pick an ocrx_... tag > that doesn't conflict with an existing one." > > Best regards > > Janusz > > -- > Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra > Lingwistyki Formalnej) > Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) > [email protected], [email protected], > http://fleksem.klf.uw.edu.pl/~jsbien/ > > -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

