RE: How does PDFBox extract text from a PDF?

Jochen Hebbrecht Tue, 10 Jul 2012 10:41:51 -0700

Hi Jeremias,

No, I'm not having any trouble at all :-). Just curious about the working
mechanism of PDFBox. And how Adobe created its PDF format.
At this page
(http://en.wikipedia.org/wiki/Portable_Document_Format#Adobe.27s_versions),
you can see all previous (and current) versions of the PDF format. Can any
of this format support the text layer? How does Adobe call this "extra text
layer"? There's no information on Wikipedia telling me the technical details
about this "text layer".

Can we detect using PDFBox if an image has been OCR'rd? Or do we just try to
get the contents? And if contents is null, try to OCR with some kind of OCR
engine?

And what happens if we try to OCR a PDF which was already OCR'd? Do we have
an extra "text layer"? So 1 image, 1 layer with first OCR and 1 layer with
secondary OCR?

Jochen

-----Oorspronkelijk bericht-----
Van: Jeremias Maerki [mailto:[email protected]] 
Verzonden: dinsdag 10 juli 2012 16:11
Aan: [email protected]
Onderwerp: Re: How does PDFBox extract text from a PDF?

On 10.07.2012 15:36:02 Jochen Hebbrecht wrote:
> My first question is: how is text stored in a PDF? I think there are 2 
> ways to store text in a PDF:
> a) vector PDF: the PDF contains a line telling it to print a word in a 
> specific font on a specific location

That's the usual case, yes.

> b) OCR text has been added to the image as an extra layer (I think 
> this is called, the XMP metadata)

No, actually an OCR software usually just adds white-on-white text behind
the bitmap. This would technically be like your a).

XMP Metadata is really just for metadata, not actual text content.

> Is this information correct?
> 
> So, if PDFBox wants to extract text from a PDF, how does it extract 
> the data? Is it looking at the XMP metadata? Or the vector details?
> Any developer wanting to help me on this issue?

PDFBox interprets the text painting operators (as if it were painting the
PDF), looks up the actual character for a code point (character "a"
might be at code point 7 (or whatever) when a subset CID font is used, for
example) and emits that as Unicode text. Well's that's simplified.
There are some additional heuristics for things like placement and order of
text but that doesn't really affect the actual process of extracting text.

There is another location where a PDF can carry text but that's not
supported by PDFBox, AFAIK: the "ActualText" entries of tagged PDFs can
contain text of artifacts on a page (ex. an image). That's used for enabling
visually impaired people to read certain documents.

I guess the question is: what are you trying to do? Do you have a problem
you're trying to solve?

If you want to learn about how text is put into a PDF, run PDFBox's
PDFDebugger and open a random PDF. That allows you to explore all the
details of a PDF. Quite enlightening if you don't know the PDF specification
by heart.

Jeremias Maerki

RE: How does PDFBox extract text from a PDF?

Reply via email to