just for remark: Mihail Radu Solcan in 2008 posted 2 articles [1], [2] about adding text to DjVu files. I am not sure if there are such possibilities/tools for pdf. Anyway - he used box file for this task (hocr was not available)
You did not specified language but in case of python try to have a look at OCRFeeder: is should be able to produce [3], with reportlab... Zdenko [1] http://www.ub-filosofie.ro/~solcan/wt/gnu/d/hdjv.html [2] http://www.ub-filosofie.ro/~solcan/wt/gnu/d/odjv.html [3] http://www.joaquimrocha.com/2011/08/05/ocrfeeder-0-7-6-and-desktopsummit-2011/ On Tue, Nov 29, 2011 at 10:42 PM, Carlos <[email protected]> wrote: > Tesseract 3.01 > hocr2pdf 0.8.5 > > My project has been using Tesseract to OCR documents for some time and > we are really happy with the results. > > We have been recently asked to offer the documents in our system as > searchable PDFs. > > My initial attempt has been to create a searchable PDF using the hocr > output generated by tesseract with hocr2pdf (http://www.exactcode.de/ > site/open_source/exactimage/hocr2pdf/). > > the placement of the text in the resulting PDF has some strange > quirks: words overlaying one another, words with oversized fonts, > strange line breaks etc. The problems are so stark that our current > results are not sufficient for a viable solution. > > I don't know very much about the hocr format, however "overlaying" > words doesn't seem to be caused by tesseracts hocr output. I have > verified a number of times that over-laid words in the searchable PDF > have bbox coordinates in the hocr file that do not overlap at all. > > - does anyone have experience generating searchable PDFs using > tesseract output? > - does anyone know of a simple way to visually inspect the placement > of the words specified by the hocr output - for instance, creating a > tiff from the hocr output. i would like to confirm that the tesseract > hocr output is correctly positioning the words. > > sorry if this issue doesn't relate exclusively to tesseract ... at > this point I am not certain what the cause of the problem is. > > Carlos > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

