Hi all, as I wrote earlier we worked on creating searchable PDFs from Cuneiform (or other) OCR results.
ExactImage 0.6(.0) now comes with an revamped PDF writer and hocr2pdf front-end, together with a patch to cuneiform annotating each recognized glyph with a hOCR-like bounding box allows the creation of pretty exactly positioned, searchable PDF files: ExactImage: http://www.exactcode.de/site/open_source/exactimage/ Cuneiform annotated HTML patch (includes already committed <>& fix), which is not yet conditional. For merging it it probably should only output the additional formating based on some additional command line switch, e.g. --hocr instead of --html or so, but that probably requires changing some 20+ files to pass the information down to the point where the HTML is written: http://t2-project.org/packages/cuneiform.html http://svn.exactcode.de/t2/trunk/package/graphic/cuneiform/html-hocr.patch ExactImage hocr2pdf page with some basic information: http://www.exactcode.de/site/open_source/exactimage/hocr2pdf/ Basically hocr2pdf accepts the input from STDIN (we could also add a -h/--html option to read it from a file) and the image from the filename passed to -i/--input. The resulting PDF filename is specified with -o/--output. Additionally -s/--sloppy-text allows grouping of words on a line for sometimes improved search and cut'n paste results with older PDF viewers and -n/--no-image to skip the image shadowing the text to either save storage space or take a look how exactly the glyphs are positioned. Have fun, patches and inspiration welcome, René _______________________________________________ Mailing list: https://launchpad.net/~cuneiform Post to : [email protected] Unsubscribe : https://launchpad.net/~cuneiform More help : https://help.launchpad.net/ListHelp

