Fwd: How does PDFBox extract text from a PDF?

Jochen Hebbrecht Tue, 10 Jul 2012 06:36:43 -0700

My first question is: how is text stored in a PDF? I think there are 2 ways
to store text in a PDF:
a) vector PDF: the PDF contains a line telling it to print a word in a
specific font on a specific location
b) OCR text has been added to the image as an extra layer (I think this is
called, the XMP metadata)


Is this information correct?

So, if PDFBox wants to extract text from a PDF, how does it extract the
data? Is it looking at the XMP metadata? Or the vector details?
Any developer wanting to help me on this issue?

Fwd: How does PDFBox extract text from a PDF?

Reply via email to