Hi,
Am 30.11.2012 00:09, schrieb Peter Murray-Rust:
I am analysing running text by trapping the output of PDFBox through
org.apache.pdfbox.util.TextPosition through a subclass of
org.apache.pdfbox.pdfviewer.PageDrawer. I notice that there are explicit
characters for spaces (char 32). Sometimes there can be repeated spaces and
even a "paragraph" consisting only of a space. I was unaware that PDF
supported spaces - are these coming from the original document or are they
generated in PDFBox from calculations of character spacing and width?
PDF itself supports every kind of character, depending on the used font and
encoding. But tools (text processors etc.) creating pdfs most likely don't use
spaces to position text or to separate text into pieces (words, paragraphs etc.)
Those parts, most likely every single character, are positioned directly using
specific coordinates.
AFAIKT PageDrawer doesn't add spaces to the text output. Suming that up, I guess
those spaces are part of your pdf.
To prove that you might use the PDFDebugger coming with PDFBox to check the
content stream yourself or you might provide us with the pdf in question so that
we can check that.
TIA for help.
P.
BR
Andreas Lehmkühler