[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-15 Thread Eric R Manzitti (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429280#comment-17429280
 ] 

Eric R Manzitti commented on PDFBOX-5290:
-

Guys this works as expected on 2.0.24.  Thank you very much for the patience 
and guidance. 

> ClassCastException during Text Extraction
> -
>
> Key: PDFBOX-5290
> URL: https://issues.apache.org/jira/browse/PDFBOX-5290
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.20, 2.0.24
>Reporter: Eric R Manzitti
>Priority: Major
> Attachments: newBroke.pdf, newBroke.txt
>
>
> I am getting: 
>  
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be 
> cast to org.apache.pdfbox.cos.COSArray
> When executing the following code:
>  
> public byte[] extractTextPDFBox(String fileNamePath) throws PQException {
> String UTF_8 = "UTF-8";
> PDFLibraryProperties pdfLibraryProperties = 
> PDFLibraryProperties.getInstance();
>  String regex = 
> pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT);
> byte[] bytesToReturn;
>  try {
>  FileInputStream fis = new FileInputStream(new File(fileNamePath));
>  PDDocument pdfDoc = PDDocument.load(fis);
>  PDFTextStripper pdfStripper = new PDFTextStripper();
>  String textFromPDF = pdfStripper.getText(pdfDoc);
>  pdfDoc.close();
>  bytesToReturn = textFromPDF.getBytes(UTF_8);
>  String textStr = new String(bytesToReturn).replaceAll(regex, 
> PDFLibraryConstants.BLANK_SPACE);
>  bytesToReturn = textStr.getBytes();
>  fis.close();
>  } catch (IOException e) {
>  pqUtilityLogger.logError(e.getMessage());
>  throw new PQException("e.getMessage());
>  }
>  return bytesToReturn;
>  }
>  
> It dies on String textFromPDF = pdfStripper.getText(pdfDoc);
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-12 Thread Eric R Manzitti (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427696#comment-17427696
 ] 

Eric R Manzitti commented on PDFBOX-5290:
-

Still looking to test this.  (We honestly moved off PDFBox for extract text, 
but that caused more issues that resolutions) so I am going to be getting this 
sorted out soon, should be today or tomorrow.

> ClassCastException during Text Extraction
> -
>
> Key: PDFBOX-5290
> URL: https://issues.apache.org/jira/browse/PDFBOX-5290
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.20, 2.0.24
>Reporter: Eric R Manzitti
>Priority: Major
> Attachments: newBroke.pdf, newBroke.txt
>
>
> I am getting: 
>  
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be 
> cast to org.apache.pdfbox.cos.COSArray
> When executing the following code:
>  
> public byte[] extractTextPDFBox(String fileNamePath) throws PQException {
> String UTF_8 = "UTF-8";
> PDFLibraryProperties pdfLibraryProperties = 
> PDFLibraryProperties.getInstance();
>  String regex = 
> pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT);
> byte[] bytesToReturn;
>  try {
>  FileInputStream fis = new FileInputStream(new File(fileNamePath));
>  PDDocument pdfDoc = PDDocument.load(fis);
>  PDFTextStripper pdfStripper = new PDFTextStripper();
>  String textFromPDF = pdfStripper.getText(pdfDoc);
>  pdfDoc.close();
>  bytesToReturn = textFromPDF.getBytes(UTF_8);
>  String textStr = new String(bytesToReturn).replaceAll(regex, 
> PDFLibraryConstants.BLANK_SPACE);
>  bytesToReturn = textStr.getBytes();
>  fis.close();
>  } catch (IOException e) {
>  pqUtilityLogger.logError(e.getMessage());
>  throw new PQException("e.getMessage());
>  }
>  return bytesToReturn;
>  }
>  
> It dies on String textFromPDF = pdfStripper.getText(pdfDoc);
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-08 Thread Eric R Manzitti (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426203#comment-17426203
 ] 

Eric R Manzitti commented on PDFBOX-5290:
-

I will test this today, and let y'all know.  I am skeptical because I don't see 
how a fresh built instance with the 2.0.24 version in the pom.xml would 
possibly get a different version on a newly created "build-image"

> ClassCastException during Text Extraction
> -
>
> Key: PDFBOX-5290
> URL: https://issues.apache.org/jira/browse/PDFBOX-5290
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.20, 2.0.24
>Reporter: Eric R Manzitti
>Priority: Major
> Attachments: newBroke.pdf, newBroke.txt
>
>
> I am getting: 
>  
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be 
> cast to org.apache.pdfbox.cos.COSArray
> When executing the following code:
>  
> public byte[] extractTextPDFBox(String fileNamePath) throws PQException {
> String UTF_8 = "UTF-8";
> PDFLibraryProperties pdfLibraryProperties = 
> PDFLibraryProperties.getInstance();
>  String regex = 
> pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT);
> byte[] bytesToReturn;
>  try {
>  FileInputStream fis = new FileInputStream(new File(fileNamePath));
>  PDDocument pdfDoc = PDDocument.load(fis);
>  PDFTextStripper pdfStripper = new PDFTextStripper();
>  String textFromPDF = pdfStripper.getText(pdfDoc);
>  pdfDoc.close();
>  bytesToReturn = textFromPDF.getBytes(UTF_8);
>  String textStr = new String(bytesToReturn).replaceAll(regex, 
> PDFLibraryConstants.BLANK_SPACE);
>  bytesToReturn = textStr.getBytes();
>  fis.close();
>  } catch (IOException e) {
>  pqUtilityLogger.logError(e.getMessage());
>  throw new PQException("e.getMessage());
>  }
>  return bytesToReturn;
>  }
>  
> It dies on String textFromPDF = pdfStripper.getText(pdfDoc);
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-07 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425697#comment-17425697
 ] 

Tilman Hausherr commented on PDFBOX-5290:
-

No they're the same. Please try a clean build / remove all old versions from 
the classpath, i.e. look into the directories what's there. If it still 
happens, please share the stack trace.

> ClassCastException during Text Extraction
> -
>
> Key: PDFBOX-5290
> URL: https://issues.apache.org/jira/browse/PDFBOX-5290
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.20, 2.0.24
>Reporter: Eric R Manzitti
>Priority: Major
> Attachments: newBroke.pdf, newBroke.txt
>
>
> I am getting: 
>  
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be 
> cast to org.apache.pdfbox.cos.COSArray
> When executing the following code:
>  
> public byte[] extractTextPDFBox(String fileNamePath) throws PQException {
> String UTF_8 = "UTF-8";
> PDFLibraryProperties pdfLibraryProperties = 
> PDFLibraryProperties.getInstance();
>  String regex = 
> pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT);
> byte[] bytesToReturn;
>  try {
>  FileInputStream fis = new FileInputStream(new File(fileNamePath));
>  PDDocument pdfDoc = PDDocument.load(fis);
>  PDFTextStripper pdfStripper = new PDFTextStripper();
>  String textFromPDF = pdfStripper.getText(pdfDoc);
>  pdfDoc.close();
>  bytesToReturn = textFromPDF.getBytes(UTF_8);
>  String textStr = new String(bytesToReturn).replaceAll(regex, 
> PDFLibraryConstants.BLANK_SPACE);
>  bytesToReturn = textStr.getBytes();
>  fis.close();
>  } catch (IOException e) {
>  pqUtilityLogger.logError(e.getMessage());
>  throw new PQException("e.getMessage());
>  }
>  return bytesToReturn;
>  }
>  
> It dies on String textFromPDF = pdfStripper.getText(pdfDoc);
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-07 Thread Eric R Manzitti (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425552#comment-17425552
 ] 

Eric R Manzitti commented on PDFBOX-5290:
-

I also double checked in my IDE that my "external dependency" to PDFBox was 
indeed 2.0.24.  It was.  Is it at all possible the app and the library are 
different?

> ClassCastException during Text Extraction
> -
>
> Key: PDFBOX-5290
> URL: https://issues.apache.org/jira/browse/PDFBOX-5290
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.20, 2.0.24
>Reporter: Eric R Manzitti
>Priority: Major
> Attachments: newBroke.pdf, newBroke.txt
>
>
> I am getting: 
>  
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be 
> cast to org.apache.pdfbox.cos.COSArray
> When executing the following code:
>  
> public byte[] extractTextPDFBox(String fileNamePath) throws PQException {
> String UTF_8 = "UTF-8";
> PDFLibraryProperties pdfLibraryProperties = 
> PDFLibraryProperties.getInstance();
>  String regex = 
> pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT);
> byte[] bytesToReturn;
>  try {
>  FileInputStream fis = new FileInputStream(new File(fileNamePath));
>  PDDocument pdfDoc = PDDocument.load(fis);
>  PDFTextStripper pdfStripper = new PDFTextStripper();
>  String textFromPDF = pdfStripper.getText(pdfDoc);
>  pdfDoc.close();
>  bytesToReturn = textFromPDF.getBytes(UTF_8);
>  String textStr = new String(bytesToReturn).replaceAll(regex, 
> PDFLibraryConstants.BLANK_SPACE);
>  bytesToReturn = textStr.getBytes();
>  fis.close();
>  } catch (IOException e) {
>  pqUtilityLogger.logError(e.getMessage());
>  throw new PQException("e.getMessage());
>  }
>  return bytesToReturn;
>  }
>  
> It dies on String textFromPDF = pdfStripper.getText(pdfDoc);
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-06 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425321#comment-17425321
 ] 

Tilman Hausherr commented on PDFBOX-5290:
-

It happens with 2.0.20 on the command line but not with 2.0.24.

> ClassCastException during Text Extraction
> -
>
> Key: PDFBOX-5290
> URL: https://issues.apache.org/jira/browse/PDFBOX-5290
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.20, 2.0.24
>Reporter: Eric R Manzitti
>Priority: Major
> Attachments: newBroke.pdf, newBroke.txt
>
>
> I am getting: 
>  
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be 
> cast to org.apache.pdfbox.cos.COSArray
> When executing the following code:
>  
> public byte[] extractTextPDFBox(String fileNamePath) throws PQException {
> String UTF_8 = "UTF-8";
> PDFLibraryProperties pdfLibraryProperties = 
> PDFLibraryProperties.getInstance();
>  String regex = 
> pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT);
> byte[] bytesToReturn;
>  try {
>  FileInputStream fis = new FileInputStream(new File(fileNamePath));
>  PDDocument pdfDoc = PDDocument.load(fis);
>  PDFTextStripper pdfStripper = new PDFTextStripper();
>  String textFromPDF = pdfStripper.getText(pdfDoc);
>  pdfDoc.close();
>  bytesToReturn = textFromPDF.getBytes(UTF_8);
>  String textStr = new String(bytesToReturn).replaceAll(regex, 
> PDFLibraryConstants.BLANK_SPACE);
>  bytesToReturn = textStr.getBytes();
>  fis.close();
>  } catch (IOException e) {
>  pqUtilityLogger.logError(e.getMessage());
>  throw new PQException("e.getMessage());
>  }
>  return bytesToReturn;
>  }
>  
> It dies on String textFromPDF = pdfStripper.getText(pdfDoc);
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-06 Thread Eric R Manzitti (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425209#comment-17425209
 ] 

Eric R Manzitti commented on PDFBOX-5290:
-

Nope.  It works when I run it from the command line.  Uhh hmm.  Okay 
thanks...Sorry I didn't try that first.

 

I assume when I do the ExtractText command line thingy, its using 
PDFTextStripper object?

> ClassCastException during Text Extraction
> -
>
> Key: PDFBOX-5290
> URL: https://issues.apache.org/jira/browse/PDFBOX-5290
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.20, 2.0.24
>Reporter: Eric R Manzitti
>Priority: Major
> Attachments: newBroke.pdf, newBroke.txt
>
>
> I am getting: 
>  
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be 
> cast to org.apache.pdfbox.cos.COSArray
> When executing the following code:
>  
> public byte[] extractTextPDFBox(String fileNamePath) throws PQException {
> String UTF_8 = "UTF-8";
> PDFLibraryProperties pdfLibraryProperties = 
> PDFLibraryProperties.getInstance();
>  String regex = 
> pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT);
> byte[] bytesToReturn;
>  try {
>  FileInputStream fis = new FileInputStream(new File(fileNamePath));
>  PDDocument pdfDoc = PDDocument.load(fis);
>  PDFTextStripper pdfStripper = new PDFTextStripper();
>  String textFromPDF = pdfStripper.getText(pdfDoc);
>  pdfDoc.close();
>  bytesToReturn = textFromPDF.getBytes(UTF_8);
>  String textStr = new String(bytesToReturn).replaceAll(regex, 
> PDFLibraryConstants.BLANK_SPACE);
>  bytesToReturn = textStr.getBytes();
>  fis.close();
>  } catch (IOException e) {
>  pqUtilityLogger.logError(e.getMessage());
>  throw new PQException("e.getMessage());
>  }
>  return bytesToReturn;
>  }
>  
> It dies on String textFromPDF = pdfStripper.getText(pdfDoc);
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-06 Thread Maruan Sahyoun (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425193#comment-17425193
 ] 

Maruan Sahyoun commented on PDFBOX-5290:


Works for me using the PDFBox command line too ExtractText - see 
https://pdfbox.apache.org/2.0/commandline.html#extracttext

Do you get any error message?  

> ClassCastException during Text Extraction
> -
>
> Key: PDFBOX-5290
> URL: https://issues.apache.org/jira/browse/PDFBOX-5290
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.20, 2.0.24
>Reporter: Eric R Manzitti
>Priority: Major
> Attachments: newBroke.pdf, newBroke.txt
>
>
> I am getting: 
>  
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be 
> cast to org.apache.pdfbox.cos.COSArray
> When executing the following code:
>  
> public byte[] extractTextPDFBox(String fileNamePath) throws PQException {
> String UTF_8 = "UTF-8";
> PDFLibraryProperties pdfLibraryProperties = 
> PDFLibraryProperties.getInstance();
>  String regex = 
> pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT);
> byte[] bytesToReturn;
>  try {
>  FileInputStream fis = new FileInputStream(new File(fileNamePath));
>  PDDocument pdfDoc = PDDocument.load(fis);
>  PDFTextStripper pdfStripper = new PDFTextStripper();
>  String textFromPDF = pdfStripper.getText(pdfDoc);
>  pdfDoc.close();
>  bytesToReturn = textFromPDF.getBytes(UTF_8);
>  String textStr = new String(bytesToReturn).replaceAll(regex, 
> PDFLibraryConstants.BLANK_SPACE);
>  bytesToReturn = textStr.getBytes();
>  fis.close();
>  } catch (IOException e) {
>  pqUtilityLogger.logError(e.getMessage());
>  throw new PQException("e.getMessage());
>  }
>  return bytesToReturn;
>  }
>  
> It dies on String textFromPDF = pdfStripper.getText(pdfDoc);
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org