Hi,

> Qingchao Kong <[email protected]> hat am 5. Mai 2014 um 12:50 geschrieben:
>
>
> Hi, I am using PDFBox to extract text from PDF files.
> I noticed that, for some PDF files(usually old PDFs), when you select
> some text using your mouse in the PDF reader application (I use Evince
> on Ubuntu), some other text come up, different from the text when you
> don't select them.
>
> I find that PDFBox sometimes actually extract the selected text, not
> the text when you don't select them. Could anybody tell me why this
> happen? Am I understood?
Sounds like a scanned document. Some scanners combine the scanned picture and
the scanned text (using a more or less acurate OCR software) in one pdf.
The picture is visible and the text is invisible but can be extracted, so that
the displayed content differs from the extracted one.

BR
Andreas Lehmkühler

Reply via email to