Well, at least for now instead of getting all the PDF text at a time, you
can instead loop on each page, get its text and check if it has text or not
... If yes then exit the loop.
Best regards ,
Hesham
--------------------------------------------------
From: "Roeder, Andreas" <[email protected]>
Sent: Thursday, February 04, 2010 8:59 AM
To: <[email protected]>
Subject: AW: PDF contains any text?
Dear Hesham,
Thank you very much for your response!
The purpose of my question is: I need to find out if all fonts used inside
the PDF are embedded. But if a PDF only contains images and no text, I
don't need to check for embedded fonts. At the moment I'm doing that:
public boolean containsText(String pdfFile) throws IOException {
PDDocument document = PDDocument.load(pdfFile);
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
if(text != null && text.length() > 0) {
return true;
} else {
return false;
}
}
But if the document is very large, this method can take a while. As soon
some text is found I could already return true. But I couldn't figure out
how to do that.
Best Regards,
Andreas
-----Ursprüngliche Nachricht-----
Von: Hesham G. [mailto:[email protected]]
Gesendet: Donnerstag, 4. Februar 2010 07:47
An: [email protected]
Betreff: Re: PDF contains any text?
I remember there was somehow in PDFBox to read some resources from the PDF
and skip others, I don't remember how but I think there's some way to skip
parsing images in the PDF.
Best regards ,
Hesham
--------------------------------------------------
From: "Erik Scholtz, ArgonSoft GmbH" <[email protected]>
Sent: Wednesday, February 03, 2010 6:03 PM
To: <[email protected]>
Subject: Re: PDF contains any text?
Andreas,
without parsing the content of a document and telling about its
contents
sounds to me like you are looking for the PDDocument.oracle_of_delphi()
method :)
But to answer your question: No - you have to look at the resources of
each page whether there are text-resources or not, to find out about
that.
There is no "central resource_available dictionary" in PDF.
Best regards,
Erik
Roeder, Andreas wrote:
Hi,
Is there a way to find out if a PDF contains any text without parsing
the
whole document?
Some PDF contain just images.
Best Regards,
Andreas