[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction
[ https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429280#comment-17429280 ] Eric R Manzitti commented on PDFBOX-5290: - Guys this works as expected on 2.0.24. Thank you very much for the patience and guidance. > ClassCastException during Text Extraction > - > > Key: PDFBOX-5290 > URL: https://issues.apache.org/jira/browse/PDFBOX-5290 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.20, 2.0.24 >Reporter: Eric R Manzitti >Priority: Major > Attachments: newBroke.pdf, newBroke.txt > > > I am getting: > > java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be > cast to org.apache.pdfbox.cos.COSArray > When executing the following code: > > public byte[] extractTextPDFBox(String fileNamePath) throws PQException { > String UTF_8 = "UTF-8"; > PDFLibraryProperties pdfLibraryProperties = > PDFLibraryProperties.getInstance(); > String regex = > pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT); > byte[] bytesToReturn; > try { > FileInputStream fis = new FileInputStream(new File(fileNamePath)); > PDDocument pdfDoc = PDDocument.load(fis); > PDFTextStripper pdfStripper = new PDFTextStripper(); > String textFromPDF = pdfStripper.getText(pdfDoc); > pdfDoc.close(); > bytesToReturn = textFromPDF.getBytes(UTF_8); > String textStr = new String(bytesToReturn).replaceAll(regex, > PDFLibraryConstants.BLANK_SPACE); > bytesToReturn = textStr.getBytes(); > fis.close(); > } catch (IOException e) { > pqUtilityLogger.logError(e.getMessage()); > throw new PQException("e.getMessage()); > } > return bytesToReturn; > } > > It dies on String textFromPDF = pdfStripper.getText(pdfDoc); > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction
[ https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427696#comment-17427696 ] Eric R Manzitti commented on PDFBOX-5290: - Still looking to test this. (We honestly moved off PDFBox for extract text, but that caused more issues that resolutions) so I am going to be getting this sorted out soon, should be today or tomorrow. > ClassCastException during Text Extraction > - > > Key: PDFBOX-5290 > URL: https://issues.apache.org/jira/browse/PDFBOX-5290 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.20, 2.0.24 >Reporter: Eric R Manzitti >Priority: Major > Attachments: newBroke.pdf, newBroke.txt > > > I am getting: > > java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be > cast to org.apache.pdfbox.cos.COSArray > When executing the following code: > > public byte[] extractTextPDFBox(String fileNamePath) throws PQException { > String UTF_8 = "UTF-8"; > PDFLibraryProperties pdfLibraryProperties = > PDFLibraryProperties.getInstance(); > String regex = > pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT); > byte[] bytesToReturn; > try { > FileInputStream fis = new FileInputStream(new File(fileNamePath)); > PDDocument pdfDoc = PDDocument.load(fis); > PDFTextStripper pdfStripper = new PDFTextStripper(); > String textFromPDF = pdfStripper.getText(pdfDoc); > pdfDoc.close(); > bytesToReturn = textFromPDF.getBytes(UTF_8); > String textStr = new String(bytesToReturn).replaceAll(regex, > PDFLibraryConstants.BLANK_SPACE); > bytesToReturn = textStr.getBytes(); > fis.close(); > } catch (IOException e) { > pqUtilityLogger.logError(e.getMessage()); > throw new PQException("e.getMessage()); > } > return bytesToReturn; > } > > It dies on String textFromPDF = pdfStripper.getText(pdfDoc); > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction
[ https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426203#comment-17426203 ] Eric R Manzitti commented on PDFBOX-5290: - I will test this today, and let y'all know. I am skeptical because I don't see how a fresh built instance with the 2.0.24 version in the pom.xml would possibly get a different version on a newly created "build-image" > ClassCastException during Text Extraction > - > > Key: PDFBOX-5290 > URL: https://issues.apache.org/jira/browse/PDFBOX-5290 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.20, 2.0.24 >Reporter: Eric R Manzitti >Priority: Major > Attachments: newBroke.pdf, newBroke.txt > > > I am getting: > > java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be > cast to org.apache.pdfbox.cos.COSArray > When executing the following code: > > public byte[] extractTextPDFBox(String fileNamePath) throws PQException { > String UTF_8 = "UTF-8"; > PDFLibraryProperties pdfLibraryProperties = > PDFLibraryProperties.getInstance(); > String regex = > pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT); > byte[] bytesToReturn; > try { > FileInputStream fis = new FileInputStream(new File(fileNamePath)); > PDDocument pdfDoc = PDDocument.load(fis); > PDFTextStripper pdfStripper = new PDFTextStripper(); > String textFromPDF = pdfStripper.getText(pdfDoc); > pdfDoc.close(); > bytesToReturn = textFromPDF.getBytes(UTF_8); > String textStr = new String(bytesToReturn).replaceAll(regex, > PDFLibraryConstants.BLANK_SPACE); > bytesToReturn = textStr.getBytes(); > fis.close(); > } catch (IOException e) { > pqUtilityLogger.logError(e.getMessage()); > throw new PQException("e.getMessage()); > } > return bytesToReturn; > } > > It dies on String textFromPDF = pdfStripper.getText(pdfDoc); > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction
[ https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425697#comment-17425697 ] Tilman Hausherr commented on PDFBOX-5290: - No they're the same. Please try a clean build / remove all old versions from the classpath, i.e. look into the directories what's there. If it still happens, please share the stack trace. > ClassCastException during Text Extraction > - > > Key: PDFBOX-5290 > URL: https://issues.apache.org/jira/browse/PDFBOX-5290 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.20, 2.0.24 >Reporter: Eric R Manzitti >Priority: Major > Attachments: newBroke.pdf, newBroke.txt > > > I am getting: > > java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be > cast to org.apache.pdfbox.cos.COSArray > When executing the following code: > > public byte[] extractTextPDFBox(String fileNamePath) throws PQException { > String UTF_8 = "UTF-8"; > PDFLibraryProperties pdfLibraryProperties = > PDFLibraryProperties.getInstance(); > String regex = > pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT); > byte[] bytesToReturn; > try { > FileInputStream fis = new FileInputStream(new File(fileNamePath)); > PDDocument pdfDoc = PDDocument.load(fis); > PDFTextStripper pdfStripper = new PDFTextStripper(); > String textFromPDF = pdfStripper.getText(pdfDoc); > pdfDoc.close(); > bytesToReturn = textFromPDF.getBytes(UTF_8); > String textStr = new String(bytesToReturn).replaceAll(regex, > PDFLibraryConstants.BLANK_SPACE); > bytesToReturn = textStr.getBytes(); > fis.close(); > } catch (IOException e) { > pqUtilityLogger.logError(e.getMessage()); > throw new PQException("e.getMessage()); > } > return bytesToReturn; > } > > It dies on String textFromPDF = pdfStripper.getText(pdfDoc); > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction
[ https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425552#comment-17425552 ] Eric R Manzitti commented on PDFBOX-5290: - I also double checked in my IDE that my "external dependency" to PDFBox was indeed 2.0.24. It was. Is it at all possible the app and the library are different? > ClassCastException during Text Extraction > - > > Key: PDFBOX-5290 > URL: https://issues.apache.org/jira/browse/PDFBOX-5290 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.20, 2.0.24 >Reporter: Eric R Manzitti >Priority: Major > Attachments: newBroke.pdf, newBroke.txt > > > I am getting: > > java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be > cast to org.apache.pdfbox.cos.COSArray > When executing the following code: > > public byte[] extractTextPDFBox(String fileNamePath) throws PQException { > String UTF_8 = "UTF-8"; > PDFLibraryProperties pdfLibraryProperties = > PDFLibraryProperties.getInstance(); > String regex = > pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT); > byte[] bytesToReturn; > try { > FileInputStream fis = new FileInputStream(new File(fileNamePath)); > PDDocument pdfDoc = PDDocument.load(fis); > PDFTextStripper pdfStripper = new PDFTextStripper(); > String textFromPDF = pdfStripper.getText(pdfDoc); > pdfDoc.close(); > bytesToReturn = textFromPDF.getBytes(UTF_8); > String textStr = new String(bytesToReturn).replaceAll(regex, > PDFLibraryConstants.BLANK_SPACE); > bytesToReturn = textStr.getBytes(); > fis.close(); > } catch (IOException e) { > pqUtilityLogger.logError(e.getMessage()); > throw new PQException("e.getMessage()); > } > return bytesToReturn; > } > > It dies on String textFromPDF = pdfStripper.getText(pdfDoc); > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction
[ https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425321#comment-17425321 ] Tilman Hausherr commented on PDFBOX-5290: - It happens with 2.0.20 on the command line but not with 2.0.24. > ClassCastException during Text Extraction > - > > Key: PDFBOX-5290 > URL: https://issues.apache.org/jira/browse/PDFBOX-5290 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.20, 2.0.24 >Reporter: Eric R Manzitti >Priority: Major > Attachments: newBroke.pdf, newBroke.txt > > > I am getting: > > java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be > cast to org.apache.pdfbox.cos.COSArray > When executing the following code: > > public byte[] extractTextPDFBox(String fileNamePath) throws PQException { > String UTF_8 = "UTF-8"; > PDFLibraryProperties pdfLibraryProperties = > PDFLibraryProperties.getInstance(); > String regex = > pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT); > byte[] bytesToReturn; > try { > FileInputStream fis = new FileInputStream(new File(fileNamePath)); > PDDocument pdfDoc = PDDocument.load(fis); > PDFTextStripper pdfStripper = new PDFTextStripper(); > String textFromPDF = pdfStripper.getText(pdfDoc); > pdfDoc.close(); > bytesToReturn = textFromPDF.getBytes(UTF_8); > String textStr = new String(bytesToReturn).replaceAll(regex, > PDFLibraryConstants.BLANK_SPACE); > bytesToReturn = textStr.getBytes(); > fis.close(); > } catch (IOException e) { > pqUtilityLogger.logError(e.getMessage()); > throw new PQException("e.getMessage()); > } > return bytesToReturn; > } > > It dies on String textFromPDF = pdfStripper.getText(pdfDoc); > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction
[ https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425209#comment-17425209 ] Eric R Manzitti commented on PDFBOX-5290: - Nope. It works when I run it from the command line. Uhh hmm. Okay thanks...Sorry I didn't try that first. I assume when I do the ExtractText command line thingy, its using PDFTextStripper object? > ClassCastException during Text Extraction > - > > Key: PDFBOX-5290 > URL: https://issues.apache.org/jira/browse/PDFBOX-5290 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.20, 2.0.24 >Reporter: Eric R Manzitti >Priority: Major > Attachments: newBroke.pdf, newBroke.txt > > > I am getting: > > java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be > cast to org.apache.pdfbox.cos.COSArray > When executing the following code: > > public byte[] extractTextPDFBox(String fileNamePath) throws PQException { > String UTF_8 = "UTF-8"; > PDFLibraryProperties pdfLibraryProperties = > PDFLibraryProperties.getInstance(); > String regex = > pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT); > byte[] bytesToReturn; > try { > FileInputStream fis = new FileInputStream(new File(fileNamePath)); > PDDocument pdfDoc = PDDocument.load(fis); > PDFTextStripper pdfStripper = new PDFTextStripper(); > String textFromPDF = pdfStripper.getText(pdfDoc); > pdfDoc.close(); > bytesToReturn = textFromPDF.getBytes(UTF_8); > String textStr = new String(bytesToReturn).replaceAll(regex, > PDFLibraryConstants.BLANK_SPACE); > bytesToReturn = textStr.getBytes(); > fis.close(); > } catch (IOException e) { > pqUtilityLogger.logError(e.getMessage()); > throw new PQException("e.getMessage()); > } > return bytesToReturn; > } > > It dies on String textFromPDF = pdfStripper.getText(pdfDoc); > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction
[ https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425193#comment-17425193 ] Maruan Sahyoun commented on PDFBOX-5290: Works for me using the PDFBox command line too ExtractText - see https://pdfbox.apache.org/2.0/commandline.html#extracttext Do you get any error message? > ClassCastException during Text Extraction > - > > Key: PDFBOX-5290 > URL: https://issues.apache.org/jira/browse/PDFBOX-5290 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.20, 2.0.24 >Reporter: Eric R Manzitti >Priority: Major > Attachments: newBroke.pdf, newBroke.txt > > > I am getting: > > java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be > cast to org.apache.pdfbox.cos.COSArray > When executing the following code: > > public byte[] extractTextPDFBox(String fileNamePath) throws PQException { > String UTF_8 = "UTF-8"; > PDFLibraryProperties pdfLibraryProperties = > PDFLibraryProperties.getInstance(); > String regex = > pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT); > byte[] bytesToReturn; > try { > FileInputStream fis = new FileInputStream(new File(fileNamePath)); > PDDocument pdfDoc = PDDocument.load(fis); > PDFTextStripper pdfStripper = new PDFTextStripper(); > String textFromPDF = pdfStripper.getText(pdfDoc); > pdfDoc.close(); > bytesToReturn = textFromPDF.getBytes(UTF_8); > String textStr = new String(bytesToReturn).replaceAll(regex, > PDFLibraryConstants.BLANK_SPACE); > bytesToReturn = textStr.getBytes(); > fis.close(); > } catch (IOException e) { > pqUtilityLogger.logError(e.getMessage()); > throw new PQException("e.getMessage()); > } > return bytesToReturn; > } > > It dies on String textFromPDF = pdfStripper.getText(pdfDoc); > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org