[jira] Created: (PDFBOX-547) problem in extracting text using PDFBox

Jignesh Sh (JIRA) Mon, 26 Oct 2009 06:01:30 -0700

problem in extracting text using PDFBox
---------------------------------------


                 Key: PDFBOX-547
                 URL: https://issues.apache.org/jira/browse/PDFBOX-547
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 0.7.0
            Reporter: Jignesh Sh


Hi All,
I am facing problem in extracting text using PDFBox.
Program hang at the line pdfText = stripper.getText(pdDoc); and returns nothing.
Actually I am using PDFBox version PDFBox-0.6.7a.jar
Here is my code

public String getPDFContent(ZipEntry pdfEntry)
        {
                boolean status = false;
                String pdfText = null;
               ZipIssueFactory issueFactory = null;
               logger.debug("Processing : " + pdfEntry.getName());
                COSDocument cosDoc = null;
                PDDocument pdDoc = null;
                try
                {
                        cosDoc = 
parseDocument(zipFile.getInputStream(pdfEntry));      //  Load InputStream into 
memory
                 
                        // skipping the PDF document, if it is encrypted
                        if (cosDoc.isEncrypted()) {
                                logger.warn("Can not decrypt PDF document w/o 
password, skipping:"+     pdfEntry.getName());
                                return pdfText;
                        }
                        // extract PDF document's textual content
                          pdDoc = new PDDocument(cosDoc);
                          PDFTextStripper stripper = new PDFTextStripper();
                          pdfText = stripper.getText(pdDoc);
                }
                catch (IOException e) {
                  pdfText = null;
                  logger.error("IOException in parsing PDF document: " + e);
                }
                finally{
                        closeCOSDocument(cosDoc);
                        closePDDocument(pdDoc);
                }
               return pdfText;
        }
private static COSDocument parseDocument(InputStream is) throws IOException {
          PDFParser parser = new PDFParser(is);
          parser.parse();
          return parser.getDocument();
       }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PDFBOX-547) problem in extracting text using PDFBox

Reply via email to