Re: [PLUG] PDF-1.5 docs not searchable

Jason Barnett Sun, 25 Jul 2021 00:00:54 -0700

Rich,
I believe you mentioned Master PDF editor. I believe it has OCR built-in,
or allows it as a plugin. If needed, a good OCR tool is Tesseract and is
likely in your distro's repository.
https://en.wikipedia.org/wiki/Tesseract_(software)


Jason

On Sat, Jul 24, 2021 at 11:08 AM Tomas Kuchta <[email protected]>
wrote:

> They are not searchable because they do not contain text to search.
> Typically, they contain image only.
>
> The way I deal with it - I OCR the image, generate text document and place
> that text into a layer under the image in the output PDF.
>
> Having the text under the image layer preserves the original look of the
> pdf why allowing for search and select.
>
> I have seen pdf with text over the image, obscuring it - as well as various
> attempts of making the text over the image invisible.
>
> Of course, OCR is not perfect as well as preserving the text in the exact
> position under the image. It mostly works for text, not so much for data
> extraction from tables, etc.
>
> I do not believe that there is OK-ish free SW solution to this. I use
> commercial SW to do that. It works, but I cannot publicly recommend it due
> to their nasty commercial behavior - no respect for privacy no sale just
> licensing with build in obsolescence.
>
> Tomas
>
> On Fri, Jul 23, 2021, 16:18 Rich Shepard <[email protected]> wrote:
>
> > I've encountered a few PDF-1.5 docs that are not searchable using xpdf,
> > mupdf, okular, or MasterPDFEditor. Perhaps they're scanned and I don't
> know
> > how to determine if they are.
> >
> > My web searches found nothing relevant; my search terms might be
> > inefficient.
> >
> > Has anyone else experienced this?
> >
> > Rich
> >
> >
> >
>

Re: [PLUG] PDF-1.5 docs not searchable

Reply via email to