On 07/10/2012 10:10 PM, Jeremias Maerki wrote:
On 10.07.2012 15:36:02 Jochen Hebbrecht wrote:
My first question is: how is text stored in a PDF? I think there are 2 ways
to store text in a PDF:
a) vector PDF: the PDF contains a line telling it to print a word in a
specific font on a specific location
There are actually two cases here:
(1) PDF text operators (BT, ET, Tj), used to convert (strings) etc to
text using a font; or
(2) Vector line drawing using bezier curves, etc to represent glyphs.
The former can be extracted by fop. The latter, which is common in
desktop publishing, needs OCR or special vector-to-font matching
analysis and AFAIK cannot be processed by fop.
There is another location where a PDF can carry text but that's not
supported by PDFBox, AFAIK: the "ActualText" entries of tagged PDFs
can contain text of artifacts on a page (ex. an image). That's used
for enabling visually impaired people to read certain documents.
It's also generally an unmangled, linebreak-free, column-free version of
the text, which can be a real bonus. When it's there - and when it's
correct, because of course there are tools out there that generate
ActualText entreis full of invalid garbage or empty ActualText entries.
--
Craig Ringer