problem in extracting text using PDFBox ---------------------------------------
Key: PDFBOX-547 URL: https://issues.apache.org/jira/browse/PDFBOX-547 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 0.7.0 Reporter: Jignesh Sh Hi All, I am facing problem in extracting text using PDFBox. Program hang at the line pdfText = stripper.getText(pdDoc); and returns nothing. Actually I am using PDFBox version PDFBox-0.6.7a.jar Here is my code public String getPDFContent(ZipEntry pdfEntry) { boolean status = false; String pdfText = null; ZipIssueFactory issueFactory = null; logger.debug("Processing : " + pdfEntry.getName()); COSDocument cosDoc = null; PDDocument pdDoc = null; try { cosDoc = parseDocument(zipFile.getInputStream(pdfEntry)); // Load InputStream into memory // skipping the PDF document, if it is encrypted if (cosDoc.isEncrypted()) { logger.warn("Can not decrypt PDF document w/o password, skipping:"+ pdfEntry.getName()); return pdfText; } // extract PDF document's textual content pdDoc = new PDDocument(cosDoc); PDFTextStripper stripper = new PDFTextStripper(); pdfText = stripper.getText(pdDoc); } catch (IOException e) { pdfText = null; logger.error("IOException in parsing PDF document: " + e); } finally{ closeCOSDocument(cosDoc); closePDDocument(pdDoc); } return pdfText; } private static COSDocument parseDocument(InputStream is) throws IOException { PDFParser parser = new PDFParser(is); parser.parse(); return parser.getDocument(); } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.