Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary

Phani Kumar Samudrala Tue, 12 Feb 2013 02:23:24 -0800

I am using Tika 1.2 JAVA API to extract text from a PDF, I am getting the 
following exception. I am getting this error for some PDF documents only and 
for some PDFs it is working fine. I couldn't figure it out a reason for this. 
When I tried using Tika 1.1 it works fine. Please let me if any of you have 
seen this error and how to fix this?


Here is the exception:


org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.pdf.PDFParser@1fbfd6<mailto:org.apache.tika.parser.pdf.PDFParser@1fbfd6>
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      at com.pc.TikaWithIndexing.main(TikaWithIndexing.java:53)
Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSString cannot 
be cast to org.apache.pdfbox.cos.COSDictionary
      at 
org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink.getAction(PDAnnotationLink.java:93)
      at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:148)
      at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:444)
      at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
      at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
      at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      ... 3 more


Here is the code snippet in JAVA:


String fileString = "C:/Bernard A J Am Coll Surg 2009.pdf";
                                     File file = new File(fileString );
                                     URL url = file.toURI().toURL();

                                     ParseContext context = new ParseContext();;
                                     Detector detector = new DefaultDetector();;
                                     Parser parser =  new 
AutoDetectParser(detector);;
                                     Metadata metadata = new Metadata();
                                     context.set(Parser.class, parser); 
//PPt,word,xlsx-- pdf,html
                                     ByteArrayOutputStream outputstream = new 
ByteArrayOutputStream();
                                                InputStream input = 
TikaInputStream.get(url, metadata);
                                                ContentHandler handler = new 
BodyContentHandler(outputstream);
                                                parser.parse(input, handler, 
metadata, context);

                                                input.close();
                                                outputstream.close();


Thanks

________________________________


Disclaimer: This transmission, including attachments, is confidential, 
proprietary, and may be privileged. It is intended solely for the intended 
recipient. If you are not the intended recipient, you have received this 
transmission in error and you are hereby advised that any review, disclosure, 
copying, distribution, or use of this transmission, or any of the information 
included therein, is unauthorized and strictly prohibited. If you have received 
this transmission in error, please immediately notify the sender by reply and 
permanently delete all copies of this transmission and its attachments.

Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary

Reply via email to