[
https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426203#comment-17426203
]
Eric R Manzitti edited comment on PDFBOX-5290 at 10/8/21, 1:58 PM:
-------------------------------------------------------------------
I will test this today, and let y'all know. I am skeptical because I don't see
how a fresh built instance with the 2.0.24 version in the pom.xml would
possibly get a different version on a newly created "build-image". Locally, on
my machine it ofc could make sense, but on a fresh instance...
was (Author: eric292):
I will test this today, and let y'all know. I am skeptical because I don't see
how a fresh built instance with the 2.0.24 version in the pom.xml would
possibly get a different version on a newly created "build-image"
> ClassCastException during Text Extraction
> -----------------------------------------
>
> Key: PDFBOX-5290
> URL: https://issues.apache.org/jira/browse/PDFBOX-5290
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.20, 2.0.24
> Reporter: Eric R Manzitti
> Priority: Major
> Attachments: newBroke.pdf, newBroke.txt
>
>
> I am getting:
>
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be
> cast to org.apache.pdfbox.cos.COSArray
> When executing the following code:
>
> public byte[] extractTextPDFBox(String fileNamePath) throws PQException {
> String UTF_8 = "UTF-8";
> PDFLibraryProperties pdfLibraryProperties =
> PDFLibraryProperties.getInstance();
> String regex =
> pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT);
> byte[] bytesToReturn;
> try {
> FileInputStream fis = new FileInputStream(new File(fileNamePath));
> PDDocument pdfDoc = PDDocument.load(fis);
> PDFTextStripper pdfStripper = new PDFTextStripper();
> String textFromPDF = pdfStripper.getText(pdfDoc);
> pdfDoc.close();
> bytesToReturn = textFromPDF.getBytes(UTF_8);
> String textStr = new String(bytesToReturn).replaceAll(regex,
> PDFLibraryConstants.BLANK_SPACE);
> bytesToReturn = textStr.getBytes();
> fis.close();
> } catch (IOException e) {
> pqUtilityLogger.logError(e.getMessage());
> throw new PQException("e.getMessage());
> }
> return bytesToReturn;
> }
>
> It dies on String textFromPDF = pdfStripper.getText(pdfDoc);
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]