Re: AW: PDF contains any text?

Hesham G. Wed, 03 Feb 2010 23:27:05 -0800

Well, at least for now instead of getting all the PDF text at a time, youcan instead loop on each page, get its text and check if it has text or not... If yes then exit the loop.


Best regards ,
Hesham


--------------------------------------------------
From: "Roeder, Andreas" <[email protected]>
Sent: Thursday, February 04, 2010 8:59 AM
To: <[email protected]>
Subject: AW: PDF contains any text?

Dear Hesham,

Thank you very much for your response!

The purpose of my question is: I need to find out if all fonts used insidethe PDF are embedded. But if a PDF only contains images and no text, Idon't need to check for embedded fonts. At the moment I'm doing that:


public boolean containsText(String pdfFile) throws IOException {
PDDocument document = PDDocument.load(pdfFile);
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
if(text != null && text.length() > 0) {
return true;
} else {
return false;
}
}

But if the document is very large, this method can take a while. As soonsome text is found I could already return true. But I couldn't figure outhow to do that.


Best Regards,

Andreas


-----Ursprüngliche Nachricht-----
Von: Hesham G. [mailto:[email protected]]
Gesendet: Donnerstag, 4. Februar 2010 07:47
An: [email protected]
Betreff: Re: PDF contains any text?


I remember there was somehow in PDFBox to read some resources from the PDF
and skip others, I don't remember how but I think there's some way to skip
parsing images in the PDF.

Best regards ,
Hesham
--------------------------------------------------
From: "Erik Scholtz, ArgonSoft GmbH" <[email protected]>
Sent: Wednesday, February 03, 2010 6:03 PM
To: <[email protected]>
Subject: Re: PDF contains any text?

Andreas,

without parsing the content of a document and telling about its
contents
sounds to me like you are looking for the PDDocument.oracle_of_delphi()
method :)

But to answer your question: No - you have to look at the resources of

each page whether there are text-resources or not, to find out aboutthat.

There is no "central resource_available dictionary" in PDF.


Best regards,
Erik

Roeder, Andreas wrote:

Hi,

Is there a way to find out if a PDF contains any text without parsing
the
whole document?
Some PDF contain just images.

Best Regards,

Andreas

Re: AW: PDF contains any text?

Reply via email to