[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14183338#comment-14183338 ]
Tim Allison edited comment on TIKA-1442 at 10/24/14 7:22 PM: ------------------------------------------------------------- Hmmm...I can't explain those files, and I recently did some cleanup so I don't have the original 1.8.6 output. When I recently reran with the latest Tika trunk, I got the same number of metadata values for those files with PDFBox 1.8.6 and 1.8.8-SNAPSHOT (vintage 2 days ago). All the problematic files have attachments. I wonder if recent work on the OCR parser could explain this. [~tpalsulich], over the last few weeks, was there a time when we were extracting metadata from images, but now we're not? For 224644.pdf, for example, there doesn't seem to be much metadata for the jpgs now...a total of 40 metadata values for the full document. Last week, when I ran Tika, there were 160, metadata values. {noformat} {"Content-Length":"5970","Content-Type":"image/jpeg", "X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.ocr.TesseractOCRParser"], "embeddedResourceType":"ATTACHMENT","resourceName":"arrow.jpg", "tika:embedded_resource_path":"224644.pdf/arrow.jpg"}, {"Content-Length":"5970","Content-Type":"image/jpeg", "X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.ocr.TesseractOCRParser"], "embeddedResourceType":"ATTACHMENT","resourceName":"arrow.jpg", "tika:embedded_resource_path":"224644.pdf/arrow.jpg"}] {noformat} In short, [~tilman], I don't think this is a PDFBox issue. was (Author: talli...@mitre.org): Hmmm...I can't explain those files, and I recently did some cleanup so I don't have the original 1.8.6 output. When I recently reran with the latest Tika trunk, I got the same number of metadata values for those files with PDFBox 1.8.6 and 1.8.8-SNAPSHOT (vintage 2 days ago). All the problematic files have attachments. I wonder if recent work on the OCR parser could explain this. [~tpalsulich], over the last few weeks, was there a time when we were extracting metadata from images, but now we're not? For 224644.pdf, for example, there doesn't seem to be much metadata for the jpgs now...a total of 40 metadata values for the full document. Last week, when I ran Tika, there were 160, metadata values. {noformat} {"Content-Length":"5970","Content-Type":"image/jpeg","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.ocr.TesseractOCRParser"],"embeddedResourceType":"ATTACHMENT","resourceName":"arrow.jpg","tika:embedded_resource_path":"224644.pdf/arrow.jpg"},{"Content-Length":"5970","Content-Type":"image/jpeg","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.ocr.TesseractOCRParser"],"embeddedResourceType":"ATTACHMENT","resourceName":"arrow.jpg","tika:embedded_resource_path":"224644.pdf/arrow.jpg"}] {noformat} In short, [~tilman], I don't think this is a PDFBox issue. > Upgrade to PDFBox 1.8.8 > ----------------------- > > Key: TIKA-1442 > URL: https://issues.apache.org/jira/browse/TIKA-1442 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Assignee: Tim Allison > Fix For: 1.7 > > Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip > > > Given the regressions we identified in PDFBox 1.8.7, we should upgrade to > 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika > 1.7. Let's use this issue to carry on the discussion of regression testing > (if any further discussion is necessary) or any other prep that needs to > happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)