Eric R Manzitti created PDFBOX-5290:
---------------------------------------

             Summary: ClassCastException during Text Extraction
                 Key: PDFBOX-5290
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5290
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.24, 2.0.20
            Reporter: Eric R Manzitti
         Attachments: newBroke.pdf

I am getting: 

 

java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be 
cast to org.apache.pdfbox.cos.COSArray

When executing the following code:

 

public byte[] extractTextPDFBox(String fileNamePath) throws PQException {

String UTF_8 = "UTF-8";

PDFLibraryProperties pdfLibraryProperties = PDFLibraryProperties.getInstance();
 String regex = 
pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT);

byte[] bytesToReturn;
 try {
 FileInputStream fis = new FileInputStream(new File(fileNamePath));
 PDDocument pdfDoc = PDDocument.load(fis);
 PDFTextStripper pdfStripper = new PDFTextStripper();
 String textFromPDF = pdfStripper.getText(pdfDoc);
 pdfDoc.close();
 bytesToReturn = textFromPDF.getBytes(UTF_8);
 String textStr = new String(bytesToReturn).replaceAll(regex, 
PDFLibraryConstants.BLANK_SPACE);
 bytesToReturn = textStr.getBytes();
 fis.close();
 } catch (IOException e) {
 pqUtilityLogger.logError(e.getMessage());
 throw new PQException("e.getMessage());
 }
 return bytesToReturn;
 }

 

It dies on String textFromPDF = pdfStripper.getText(pdfDoc);

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to