[ https://issues.apache.org/jira/browse/TIKA-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251722#comment-17251722 ]
Tim Allison commented on TIKA-3253: ----------------------------------- Will fix today. > improve "attachments" tika-eval report directory > ------------------------------------------------ > > Key: TIKA-3253 > URL: https://issues.apache.org/jira/browse/TIKA-3253 > Project: Tika > Issue Type: Improvement > Components: tika-eval > Affects Versions: 1.25 > Environment: W10 > Reporter: Tilman Hausherr > Priority: Minor > Attachments: GHOSTSCRIPT-690526-0.pdf, > container_files_missing_in_B_by_mime.xlsx > > > While doing regression testing for PDFBox I found > container_files_missing_in_B_by_mime.xlsx > which has > MIME_STRING CNT > application/pdf 4 > I have no idea which files this is about. The other reports don't tell it. I > was able to solve this by accessing the H2 database and then submitting this > query > {code} > select pa.file_name > from profiles_a pa > left join profiles_b pb on pa.id=pb.id > where pb.id is null and pa.is_embedded=false > {code} > and got > GHOSTSCRIPT-690526-0.pdf > GHOSTSCRIPT-692591-0.pdf > GHOSTSCRIPT-692591-2.pdf > PDFBOX-4319-0.zip-0.pdf > So my suggestion is to add 2 files to the report directory where the names > are mentioned. > I have attached one of the "bad" PDF files. The B extract is empty, tika runs > forever. I'll investigate that separately. (Update: PDFBOX-5049. Will > probably be solved by TIKA-3246) -- This message was sent by Atlassian Jira (v8.3.4#803005)