[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180402#comment-14180402 ]
Tim Allison edited comment on TIKA-1442 at 10/22/14 8:05 PM: ------------------------------------------------------------- Sorry, ran new eval code on old 1.8.8 batch process. Will rerun batch process with latest 1.8.8. For file 272372.pdf, I see this in the Excel file that I posted earlier today: {noformat} org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:249) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:137) at org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:120) at org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:153) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:96) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:38) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be cast to org.apache.pdfbox.cos.COSStream at org.apache.pdfbox.pdmodel.PDDocumentCatalog.getMetadata(PDDocumentCatalog.java:312) at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:181) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) ... 13 more {noformat} Should I try to grab more than that? Or, are you seeing the same thing that I'm seeing in the Excel file? was (Author: talli...@mitre.org): Sorry, ran new eval code on old 1.8.8 batch process. Will rerun batch process with latest 1.8.8. For file 27372.pdf, I see this in Excel: {noformat} org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:249) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:137) at org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:120) at org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:153) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:96) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:38) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be cast to org.apache.pdfbox.cos.COSStream at org.apache.pdfbox.pdmodel.PDDocumentCatalog.getMetadata(PDDocumentCatalog.java:312) at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:181) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) ... 13 more {noformat} Should I try to grab more than that? Or, are you seeing the same thing that I'm seeing in the Excel file? > Upgrade to PDFBox 1.8.8 > ----------------------- > > Key: TIKA-1442 > URL: https://issues.apache.org/jira/browse/TIKA-1442 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Assignee: Tim Allison > Fix For: 1.7 > > Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx > > > Given the regressions we identified in PDFBox 1.8.7, we should upgrade to > 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika > 1.7. Let's use this issue to carry on the discussion of regression testing > (if any further discussion is necessary) or any other prep that needs to > happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)