Re: using tesseract hocr output to create a searchable PDF

Guido Thu, 27 Sep 2012 20:15:52 -0700

Did you find any other solution than using tesseract and pdfbeads? What are 
your experiences so far?


I am currently looking for a solution to a similar problem: When I scan 
documents at university, the only option is to save it as pdf. Afterwards, 
at home, I'd like to convert those files into searchable pdf files. I've 
successfully installed tesseract and pdfbeads on my computer so far and 
would like to use them to generate searchable PDF files. Are tesseract and 
pdfbeads a good choice for that job?

Best Regards,
Guido


Am Freitag, 2. Dezember 2011 20:52:11 UTC+1 schrieb Carlos:
>
> zdenko,
>
> Thanks for the reply.
>
> > You did not specified language but in case of python
>
> I am pretty agnostic about language as long as it can run via the CLI
> on linux - the OCR process is on the backend.
>
> In case anyone else runs across this:
>
> I am an OCR noob so the past few days have been pretty enlightening.
> I have run across a number of other options to marry hOCR w/ an image
> to generate searchable PDFs.  Unfortunately, hocr2pdf is one of the
> most prominent ones.  It shows up pretty high on a lot of searchs and
> is mentioned in various forums/blogs etc.  I have found that hocr2pdf
> generates fairly unusable searchable PDFs - the searchable text is
> interleaved and really out of position.
>
> Luckily, there are a number of other options in various languages.
> The first OSS tool that I found to generated very usable searchable
> PDFs generated from tesseract hOCR files has been pdfbeads - a ruby
> gem.  It has worked well with a diverse sample of documents.
>
> At this time my primary concern with pdfbeads is that it is a pretty
> niche library and it encapsulates all of the logic to generate the PDF
> file.  pdfbeads doesn't rely on other more heavily used/vetted/current
> PDF generation libs to generate the PDF.  It would have been a little
> more comforting if pdfbeads concentrated on parsing the hOCR files and
> adding the text layer via another lib ... assuming that is possible.
>
> If this holds up I suspect that we are going to slot this into our OCR
> process.
>
> Carlos
>
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: using tesseract hocr output to create a searchable PDF

Reply via email to