[
https://issues.apache.org/jira/browse/TIKA-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated TIKA-4443:
----------------------------------
Attachment: screenshot-2.png
> ClassCastException while extracting the text of a PDF
> -----------------------------------------------------
>
> Key: TIKA-4443
> URL: https://issues.apache.org/jira/browse/TIKA-4443
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 3.0.0, 3.1.0, 3.2.0
> Reporter: Olivier Ceulemans
> Priority: Minor
> Attachments: 112145_EXE_PI--_FEGI_MAT01_FT-0033_-0.pdf,
> screenshot-1.png, screenshot-2.png
>
>
> A ClassCastException occurs when trying to extract the text of the attached
> PDF file with tika 3.2.0, 3.1.0, 3.0.0. I did not try previous versions.
> A simple way to reproduce the issue is to use the
> org.apache.tika.example.SimpleTextExtractor class of the tika-example
> library, part of the distribution.
> I also tried to use plain pdfbox without tika and the text can be extracted.
> That makes me assume that this could be a real issue rather than a corrupted
> PDF.
> Here is the stack trace:
> {color:#172b4d}Exception in thread "main"
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.pdf.PDFParser@2aa27288{color}
> {color:#172b4d} at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312){color}
> {color:#172b4d} at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298){color}
> {color:#172b4d} at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204){color}
> {color:#172b4d} at org.apache.tika.Tika.parseToString(Tika.java:525){color}
> {color:#172b4d} at org.apache.tika.Tika.parseToString(Tika.java:495){color}
> {color:#172b4d} at org.apache.tika.Tika.parseToString(Tika.java:594){color}
> {color:#172b4d} at
> org.apache.tika.example.SimpleTextExtractor.main(SimpleTextExtractor.java:32){color}
> {color:#172b4d}Caused by: java.lang.ClassCastException: class
> org.apache.pdfbox.cos.COSArray cannot be cast to class
> org.apache.pdfbox.cos.COSDictionary (org.apache.pdfbox.cos.COSArray and
> org.apache.pdfbox.cos.COSDictionary are in unnamed module of loader
> 'app'){color}
> {color:#172b4d} at
> org.apache.pdfbox.pdmodel.PDEmbeddedFilesNameTreeNode.convertCOSToPD(PDEmbeddedFilesNameTreeNode.java:53){color}
> {color:#172b4d} at
> org.apache.pdfbox.pdmodel.PDEmbeddedFilesNameTreeNode.convertCOSToPD(PDEmbeddedFilesNameTreeNode.java:30){color}
> {color:#172b4d} at
> org.apache.pdfbox.pdmodel.common.PDNameTreeNode.getNames(PDNameTreeNode.java:272){color}
> {color:#172b4d} at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractFilesfromEFTree(AbstractPDF2XHTML.java:856){color}
> {color:#172b4d} at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractFilesfromEFTree(AbstractPDF2XHTML.java:871){color}
> {color:#172b4d} at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractEmbeddedDocuments(AbstractPDF2XHTML.java:375){color}
> {color:#172b4d} at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:998){color}
> {color:#172b4d} at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:253){color}
> {color:#172b4d} at
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:107){color}
> {color:#172b4d} at
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:219){color}
> {color:#172b4d} at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298){color}
> {color:#172b4d} ... 6 more{color}
>
> {color:#172b4d}And here is the file that causes the issue:{color}
> [^112145_EXE_PI--_FEGI_MAT01_FT-0033_-0.pdf]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)