[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180402#comment-14180402
 ] 

Tim Allison edited comment on TIKA-1442 at 10/22/14 8:05 PM:
-------------------------------------------------------------

Sorry, ran new eval code on old 1.8.8 batch process.  Will rerun batch process 
with latest 1.8.8.

For file 272372.pdf, I see this in the Excel file that I posted earlier today:
{noformat}
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.pdf.PDFParser
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:249)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at 
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:137)
        at 
org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:120)
        at 
org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:153)
        at 
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:96)
        at 
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:38)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary 
cannot be cast to org.apache.pdfbox.cos.COSStream
        at 
org.apache.pdfbox.pdmodel.PDDocumentCatalog.getMetadata(PDDocumentCatalog.java:312)
        at 
org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:181)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
        ... 13 more
{noformat}

Should I try to grab more than that?  Or, are you seeing the same thing that 
I'm seeing in the Excel file?


was (Author: talli...@mitre.org):
Sorry, ran new eval code on old 1.8.8 batch process.  Will rerun batch process 
with latest 1.8.8.

For file 27372.pdf, I see this in Excel:
{noformat}
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.pdf.PDFParser
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:249)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at 
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:137)
        at 
org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:120)
        at 
org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:153)
        at 
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:96)
        at 
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:38)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary 
cannot be cast to org.apache.pdfbox.cos.COSStream
        at 
org.apache.pdfbox.pdmodel.PDDocumentCatalog.getMetadata(PDDocumentCatalog.java:312)
        at 
org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:181)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:158)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
        ... 13 more
{noformat}

Should I try to grab more than that?  Or, are you seeing the same thing that 
I'm seeing in the Excel file?

> Upgrade to PDFBox 1.8.8
> -----------------------
>
>                 Key: TIKA-1442
>                 URL: https://issues.apache.org/jira/browse/TIKA-1442
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>             Fix For: 1.7
>
>         Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to