Re: How does PDFBox extract text from a PDF?

Jeremias Maerki Tue, 10 Jul 2012 07:11:24 -0700

On 10.07.2012 15:36:02 Jochen Hebbrecht wrote:
> My first question is: how is text stored in a PDF? I think there are 2 ways
> to store text in a PDF:
> a) vector PDF: the PDF contains a line telling it to print a word in a
> specific font on a specific location


That's the usual case, yes.

> b) OCR text has been added to the image as an extra layer (I think this is
> called, the XMP metadata)

No, actually an OCR software usually just adds white-on-white text
behind the bitmap. This would technically be like your a).

XMP Metadata is really just for metadata, not actual text content.

> Is this information correct?
> 
> So, if PDFBox wants to extract text from a PDF, how does it extract the
> data? Is it looking at the XMP metadata? Or the vector details?
> Any developer wanting to help me on this issue?

PDFBox interprets the text painting operators (as if it were painting
the PDF), looks up the actual character for a code point (character "a"
might be at code point 7 (or whatever) when a subset CID font is used,
for example) and emits that as Unicode text. Well's that's simplified.
There are some additional heuristics for things like placement and order
of text but that doesn't really affect the actual process of extracting
text.

There is another location where a PDF can carry text but that's not
supported by PDFBox, AFAIK: the "ActualText" entries of tagged PDFs can
contain text of artifacts on a page (ex. an image). That's used for
enabling visually impaired people to read certain documents.

I guess the question is: what are you trying to do? Do you have a
problem you're trying to solve?

If you want to learn about how text is put into a PDF, run PDFBox's
PDFDebugger and open a random PDF. That allows you to explore all the
details of a PDF. Quite enlightening if you don't know the PDF
specification by heart.

Jeremias Maerki

Re: How does PDFBox extract text from a PDF?

Reply via email to