On 10.07.2012 15:36:02 Jochen Hebbrecht wrote: > My first question is: how is text stored in a PDF? I think there are 2 ways > to store text in a PDF: > a) vector PDF: the PDF contains a line telling it to print a word in a > specific font on a specific location
That's the usual case, yes. > b) OCR text has been added to the image as an extra layer (I think this is > called, the XMP metadata) No, actually an OCR software usually just adds white-on-white text behind the bitmap. This would technically be like your a). XMP Metadata is really just for metadata, not actual text content. > Is this information correct? > > So, if PDFBox wants to extract text from a PDF, how does it extract the > data? Is it looking at the XMP metadata? Or the vector details? > Any developer wanting to help me on this issue? PDFBox interprets the text painting operators (as if it were painting the PDF), looks up the actual character for a code point (character "a" might be at code point 7 (or whatever) when a subset CID font is used, for example) and emits that as Unicode text. Well's that's simplified. There are some additional heuristics for things like placement and order of text but that doesn't really affect the actual process of extracting text. There is another location where a PDF can carry text but that's not supported by PDFBox, AFAIK: the "ActualText" entries of tagged PDFs can contain text of artifacts on a page (ex. an image). That's used for enabling visually impaired people to read certain documents. I guess the question is: what are you trying to do? Do you have a problem you're trying to solve? If you want to learn about how text is put into a PDF, run PDFBox's PDFDebugger and open a random PDF. That allows you to explore all the details of a PDF. Quite enlightening if you don't know the PDF specification by heart. Jeremias Maerki

