Re: [CODE4LIB] Extracting words out of a .pdf

Joe Hourclé Sun, 28 May 2023 14:55:52 -0700

> On May 28, 2023, at 5:03 PM, Magnus Berg <[email protected]> wrote:
> 
>  Hi Charles,
> 
> Is the PDF you're trying to extract text from a scanned document? If so,
> you likely can't highlight the text because it's technically an image. You
> can apply Optical Character Recognition (OCR) to rectify this. GIMP doesn't
> have OCR capabilities, though there are a few plugins floating around on
> Github. If you don't have the paid version of Acrobat, you can look into
> other OCR software options. Here is a list of projects
> <https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty.html>
> that use the Tesseract engine, many of which are simple drag and drop
> solutions.


And as there are many different ways to create PDFs, you can end up with some 
really weird results even when you *can* copy and paste the text.

I don’t remember what it was that I was dealing with, but each word in the file 
was a separate text box… so when you copied and pasted, you got the text, but 
minus any spaces between words.

I think that I ended up printing the documents, scanning them back in, and 
OCRing it all.

(I’d probably try to go through some of the various PDF libraries to try doing 
it with software these days)

Oh… and when you’re doing batch OCR conversion of all of your scans, make sure 
that you don’t tell it to overwrite the files.  I accidentally missed turning 
that setting off once, and I didn’t realize it was also set to not save the 
image, and the OCR was absolute crap as the images weren’t high enough 
contrast, and I had to spend many, many hours re-scanning everything.

Or maybe back up all of your scans before OCR.  Storage is cheap these days.

-Joe

Re: [CODE4LIB] Extracting words out of a .pdf

Reply via email to