RE: PDFBox - Does the PDF file version matter

Michael Kuß Tue, 04 Feb 2014 05:21:50 -0800

Hi Karen,

first the PDF format is not designed to get text back. It is not an editable 
format like text or word but more focused on displaying the content.
Text in a PDF file is like a cloud of points cluttered over a white space. You 
have to put the characters (if available) in the correct order and insert 
spaces if needed. This pdfbox is doing to some extent.
But if you see Text e.g. in Acrobat Reader it is not necessary "text" but it 
can also be a graphic.


So, to your problem. Different PDF converter do handle the positioning of text 
during a PDF conversion in different manners.
Some will produce just a graphic, that represents the printed result of e.g. a 
word document as a PDF file.
Some will produce a PDF with text included. This text may be with spaces or 
without and the text may be correctly positioned or not.
The converters mostly try to make an accurate representation in a layout point 
of view. The focus is not to get content back from the PDF file. PDF is not 
designed to do this.
If you have two different PDF converters the text extracted with pdfbox may 
differ.
Thus if you must extract text from a PDF file with specific positioning you 
have to do more intelligent steps.
Parse for known words or extend the framework to parse just a specific position.

To get a clue how the PDF format was created have a look here:
http://en.wikipedia.org/wiki/Pdf

Hope this helps somehow.

Kind regards,
  Micha

RE: PDFBox - Does the PDF file version matter

Reply via email to