AW: PDF contains any text?

Roeder, Andreas Thu, 04 Feb 2010 23:58:58 -0800

Erik,

I wrote the following code:


public boolean containsText() throws IOException {
                
        PDDocument document = null;
        try {
                document = PDDocument.load(pdfFile);                    
                PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                        
                List<PDPage> pages = 
document.getDocumentCatalog().getAllPages();
                        
                for(PDPage page : pages) {
                        PDRectangle rectancle = page.getTrimBox();
                        Rectangle2D.Float awtRect = new Rectangle2D.Float( 
rectancle.getLowerLeftX(),rectancle.getUpperRightY(),
                                                                                
                                                                                
                                rectancle.getWidth(), rectancle.getHeight());
                        stripper.addRegion(page.toString(), awtRect);
                        stripper.extractRegions(page);
                        for(Object regionObj : stripper.getRegions()) {
                                String regionName = (String) regionObj;
                                String text = 
stripper.getTextForRegion(regionName);
                                if(text != null && text.length() > 0) {
                                        return true;
                                } 
                        }
                }
                return false;
        } finally {
                document.close();
        }
}


But unfortunately the line:

        String text = stripper.getTextForRegion(regionName);

always returns an empty String, what am I doing wrong? 

Best Regards,

Andreas


-----Ursprüngliche Nachricht-----
Von: Erik Scholtz, ArgonSoft GmbH [mailto:[email protected]] 
Gesendet: Mittwoch, 3. Februar 2010 17:03
An: [email protected]
Betreff: Re: PDF contains any text?


Andreas,

without parsing the content of a document and telling about its contents 
  sounds to me like you are looking for the 
PDDocument.oracle_of_delphi() method :)

But to answer your question: No - you have to look at the resources of 
each page whether there are text-resources or not, to find out about 
that. There is no "central resource_available dictionary" in PDF.


Best regards,
Erik

Roeder, Andreas wrote:
> Hi,
> 
> Is there a way to find out if a PDF contains any text without parsing 
> the whole document? Some PDF contain just images.
> 
> Best Regards,
> 
> Andreas
>

AW: PDF contains any text?

Reply via email to