Hi Jeremias, No, I'm not having any trouble at all :-). Just curious about the working mechanism of PDFBox. And how Adobe created its PDF format. At this page (http://en.wikipedia.org/wiki/Portable_Document_Format#Adobe.27s_versions), you can see all previous (and current) versions of the PDF format. Can any of this format support the text layer? How does Adobe call this "extra text layer"? There's no information on Wikipedia telling me the technical details about this "text layer".
Can we detect using PDFBox if an image has been OCR'rd? Or do we just try to get the contents? And if contents is null, try to OCR with some kind of OCR engine? And what happens if we try to OCR a PDF which was already OCR'd? Do we have an extra "text layer"? So 1 image, 1 layer with first OCR and 1 layer with secondary OCR? Jochen -----Oorspronkelijk bericht----- Van: Jeremias Maerki [mailto:[email protected]] Verzonden: dinsdag 10 juli 2012 16:11 Aan: [email protected] Onderwerp: Re: How does PDFBox extract text from a PDF? On 10.07.2012 15:36:02 Jochen Hebbrecht wrote: > My first question is: how is text stored in a PDF? I think there are 2 > ways to store text in a PDF: > a) vector PDF: the PDF contains a line telling it to print a word in a > specific font on a specific location That's the usual case, yes. > b) OCR text has been added to the image as an extra layer (I think > this is called, the XMP metadata) No, actually an OCR software usually just adds white-on-white text behind the bitmap. This would technically be like your a). XMP Metadata is really just for metadata, not actual text content. > Is this information correct? > > So, if PDFBox wants to extract text from a PDF, how does it extract > the data? Is it looking at the XMP metadata? Or the vector details? > Any developer wanting to help me on this issue? PDFBox interprets the text painting operators (as if it were painting the PDF), looks up the actual character for a code point (character "a" might be at code point 7 (or whatever) when a subset CID font is used, for example) and emits that as Unicode text. Well's that's simplified. There are some additional heuristics for things like placement and order of text but that doesn't really affect the actual process of extracting text. There is another location where a PDF can carry text but that's not supported by PDFBox, AFAIK: the "ActualText" entries of tagged PDFs can contain text of artifacts on a page (ex. an image). That's used for enabling visually impaired people to read certain documents. I guess the question is: what are you trying to do? Do you have a problem you're trying to solve? If you want to learn about how text is put into a PDF, run PDFBox's PDFDebugger and open a random PDF. That allows you to explore all the details of a PDF. Quite enlightening if you don't know the PDF specification by heart. Jeremias Maerki

