Sorry, Just realized, seems I posted to wrong mailing list. Please ignore this.
-----Original Message----- From: Phani Kumar Samudrala [mailto:phanikuma...@arisglobal.co.in] Sent: Tuesday, February 12, 2013 3:53 PM To: dev@tika.apache.org Subject: Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary I am using Tika 1.2 JAVA API to extract text from a PDF, I am getting the following exception. I am getting this error for some PDF documents only and for some PDFs it is working fine. I couldn't figure it out a reason for this. When I tried using Tika 1.1 it works fine. Please let me if any of you have seen this error and how to fix this? Here is the exception: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@1fbfd6<mailto:org.apache.tika.parser.pdf.PDFParser@1fbfd6> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at com.pc.TikaWithIndexing.main(TikaWithIndexing.java:53) Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink.getAction(PDAnnotationLink.java:93) at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:148) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:444) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 3 more Here is the code snippet in JAVA: String fileString = "C:/Bernard A J Am Coll Surg 2009.pdf"; File file = new File(fileString ); URL url = file.toURI().toURL(); ParseContext context = new ParseContext();; Detector detector = new DefaultDetector();; Parser parser = new AutoDetectParser(detector);; Metadata metadata = new Metadata(); context.set(Parser.class, parser); //PPt,word,xlsx-- pdf,html ByteArrayOutputStream outputstream = new ByteArrayOutputStream(); InputStream input = TikaInputStream.get(url, metadata); ContentHandler handler = new BodyContentHandler(outputstream); parser.parse(input, handler, metadata, context); input.close(); outputstream.close(); Thanks ________________________________ Disclaimer: This transmission, including attachments, is confidential, proprietary, and may be privileged. It is intended solely for the intended recipient. If you are not the intended recipient, you have received this transmission in error and you are hereby advised that any review, disclosure, copying, distribution, or use of this transmission, or any of the information included therein, is unauthorized and strictly prohibited. If you have received this transmission in error, please immediately notify the sender by reply and permanently delete all copies of this transmission and its attachments. ________________________________ Disclaimer: This transmission, including attachments, is confidential, proprietary, and may be privileged. It is intended solely for the intended recipient. If you are not the intended recipient, you have received this transmission in error and you are hereby advised that any review, disclosure, copying, distribution, or use of this transmission, or any of the information included therein, is unauthorized and strictly prohibited. If you have received this transmission in error, please immediately notify the sender by reply and permanently delete all copies of this transmission and its attachments.