Hi,

Am 30.11.2012 00:09, schrieb Peter Murray-Rust:
I am analysing running text by trapping the output of PDFBox through
org.apache.pdfbox.util.TextPosition through a subclass of
org.apache.pdfbox.pdfviewer.PageDrawer. I notice that there are explicit
characters for spaces (char 32). Sometimes there can be repeated spaces and
even a "paragraph" consisting only of a space. I was unaware that PDF
supported spaces - are these coming from the original document or are they
generated in PDFBox from calculations of character spacing and width?
PDF itself supports every kind of character, depending on the used font and encoding. But tools (text processors etc.) creating pdfs most likely don't use spaces to position text or to separate text into pieces (words, paragraphs etc.) Those parts, most likely every single character, are positioned directly using specific coordinates.

AFAIKT PageDrawer doesn't add spaces to the text output. Suming that up, I guess those spaces are part of your pdf.

To prove that you might use the PDFDebugger coming with PDFBox to check the content stream yourself or you might provide us with the pdf in question so that we can check that.

TIA for help.

P.


BR
Andreas Lehmkühler

Reply via email to