Re: How does PDFBox extract text from a PDF?

Craig Ringer Tue, 10 Jul 2012 16:34:58 -0700

On 07/10/2012 10:10 PM, Jeremias Maerki wrote:

On 10.07.2012 15:36:02 Jochen Hebbrecht wrote:

My first question is: how is text stored in a PDF? I think there are 2 ways
to store text in a PDF:
a) vector PDF: the PDF contains a line telling it to print a word in a
specific font on a specific location

There are actually two cases here:

(1) PDF text operators (BT, ET, Tj), used to convert (strings) etc totext using a font; or

(2) Vector line drawing using bezier curves, etc to represent glyphs.

The former can be extracted by fop. The latter, which is common indesktop publishing, needs OCR or special vector-to-font matchinganalysis and AFAIK cannot be processed by fop.

There is another location where a PDF can carry text but that's notsupported by PDFBox, AFAIK: the "ActualText" entries of tagged PDFscan contain text of artifacts on a page (ex. an image). That's usedfor enabling visually impaired people to read certain documents.

It's also generally an unmangled, linebreak-free, column-free version ofthe text, which can be a real bonus. When it's there - and when it'scorrect, because of course there are tools out there that generateActualText entreis full of invalid garbage or empty ActualText entries.


--
Craig Ringer

Re: How does PDFBox extract text from a PDF?

Reply via email to