Eric R Manzitti created PDFBOX-5290: ---------------------------------------
Summary: ClassCastException during Text Extraction Key: PDFBOX-5290 URL: https://issues.apache.org/jira/browse/PDFBOX-5290 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 2.0.24, 2.0.20 Reporter: Eric R Manzitti Attachments: newBroke.pdf I am getting: java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be cast to org.apache.pdfbox.cos.COSArray When executing the following code: public byte[] extractTextPDFBox(String fileNamePath) throws PQException { String UTF_8 = "UTF-8"; PDFLibraryProperties pdfLibraryProperties = PDFLibraryProperties.getInstance(); String regex = pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT); byte[] bytesToReturn; try { FileInputStream fis = new FileInputStream(new File(fileNamePath)); PDDocument pdfDoc = PDDocument.load(fis); PDFTextStripper pdfStripper = new PDFTextStripper(); String textFromPDF = pdfStripper.getText(pdfDoc); pdfDoc.close(); bytesToReturn = textFromPDF.getBytes(UTF_8); String textStr = new String(bytesToReturn).replaceAll(regex, PDFLibraryConstants.BLANK_SPACE); bytesToReturn = textStr.getBytes(); fis.close(); } catch (IOException e) { pqUtilityLogger.logError(e.getMessage()); throw new PQException("e.getMessage()); } return bytesToReturn; } It dies on String textFromPDF = pdfStripper.getText(pdfDoc); -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org