We're currently developing a new open source OCR system, with a focus on digital library applications (www.ocropus.org). As part of this, we needed formats for representing both OCR output and bibliographic metadata, and we have defined two new microformats for this purpose: hOCR and hBIB.
hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information. It embeds this information invisibly in standard HTML. By building on standard HTML, it automatically inherits well-defined support for most scripts, languages, and common layout options. Furthermore, unlike previous OCR formats, the recognized text and OCR-related information co-exist in the same file and survives editing and manipulation. hOCR markup is independent of the presentation. The hBIB format is a microformat that makes it easy to indicate both where a document has been published, as well as to indicate references stored within the document (e.g., for reference lists). It is a straightforward embedding of BibTeX into HTML and should also be useful for making available reference lists and embedding citation information into the output of tools like latex2html. We're starting to make available tools and samples for both formats at: http://code.google.com/p/hocr-tools http://code.google.com/p/hbib-tools Cheers, Thomas.
_______________________________________________ microformats-new mailing list [email protected] http://microformats.org/mailman/listinfo/microformats-new
