just for remark: Mihail Radu Solcan in 2008 posted 2 articles [1],
[2]  about adding text to DjVu files. I am not sure if there are
such possibilities/tools for pdf. Anyway - he used box file for this task
(hocr was not available)

You did not specified language but in case of python try to have a look
at OCRFeeder: is should be able to produce [3], with reportlab...

Zdenko

[1] http://www.ub-filosofie.ro/~solcan/wt/gnu/d/hdjv.html
[2] http://www.ub-filosofie.ro/~solcan/wt/gnu/d/odjv.html
[3]
http://www.joaquimrocha.com/2011/08/05/ocrfeeder-0-7-6-and-desktopsummit-2011/



On Tue, Nov 29, 2011 at 10:42 PM, Carlos <[email protected]> wrote:

> Tesseract 3.01
> hocr2pdf 0.8.5
>
> My project has been using Tesseract to OCR documents for some time and
> we are really happy with the results.
>
> We have been recently asked to offer the documents in our system as
> searchable PDFs.
>
> My initial attempt has been to create a searchable PDF using the hocr
> output generated by tesseract with hocr2pdf (http://www.exactcode.de/
> site/open_source/exactimage/hocr2pdf/).
>
> the placement of the text in the resulting PDF has some strange
> quirks: words overlaying one another, words with oversized fonts,
> strange line breaks etc.  The problems are so stark that our current
> results are not sufficient for a viable solution.
>
> I don't know very much about the hocr format, however "overlaying"
> words doesn't seem to be caused by tesseracts hocr output.  I have
> verified a number of times that over-laid words in the searchable PDF
> have bbox coordinates in the hocr file that do not overlap at all.
>
> - does anyone have experience generating searchable PDFs using
> tesseract output?
> - does anyone know of a simple way to visually inspect the placement
> of the words specified by the hocr output - for instance, creating a
> tiff from the hocr output.  i would like to confirm that the tesseract
> hocr output is correctly positioning the words.
>
> sorry if this issue doesn't relate exclusively to tesseract ... at
> this point I am not certain what the cause of the problem is.
>
> Carlos
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to