[ 
https://issues.apache.org/jira/browse/TIKA-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-3253:
----------------------------------
    Fix Version/s: 1.26
                   2.0.0

> improve "attachments" tika-eval report directory
> ------------------------------------------------
>
>                 Key: TIKA-3253
>                 URL: https://issues.apache.org/jira/browse/TIKA-3253
>             Project: Tika
>          Issue Type: Improvement
>          Components: tika-eval
>    Affects Versions: 1.25
>         Environment: W10
>            Reporter: Tilman Hausherr
>            Priority: Minor
>             Fix For: 2.0.0, 1.26
>
>         Attachments: GHOSTSCRIPT-690526-0.pdf, 
> container_files_missing_in_B_by_mime.xlsx
>
>
> While doing regression testing for PDFBox I found 
> container_files_missing_in_B_by_mime.xlsx
> which has
> MIME_STRING   CNT
> application/pdf       4
> I have no idea which files this is about. The other reports don't tell it. I 
> was able to solve this by accessing the H2 database and then submitting this 
> query
> {code}
> select pa.file_name
> from profiles_a pa
> left join profiles_b pb on pa.id=pb.id
> where pb.id is null and pa.is_embedded=false
> {code}
> and got
> GHOSTSCRIPT-690526-0.pdf
> GHOSTSCRIPT-692591-0.pdf
> GHOSTSCRIPT-692591-2.pdf
> PDFBOX-4319-0.zip-0.pdf
> So my suggestion is to add 2 files to the report directory where the names 
> are mentioned.
> I have attached one of the "bad" PDF files. The B extract is empty, tika runs 
> forever. I'll investigate that separately. (Update: PDFBOX-5049. Will 
> probably be solved by TIKA-3246)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to