[ 
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15096866#comment-15096866
 ] 

Tilman Hausherr edited comment on TIKA-1830 at 1/14/16 5:05 PM:
----------------------------------------------------------------

I can't reproduce the difference for the file 074531.pdf. ExtractText returns 
identical results, that makes me doubt on the entire test :-(

(edit: also 362980.pdf, 058103.pdf, and 760707.pdf )

I can reproduce the difference for 290377.pdf, this is because of a change in 
decompression (rev 1709182) that tries to squeeze as much as possible from 
corrupt streams.

There may be some differences due to a bugfix related to "article beads". This 
will mean improved results for files with correct beads, but worse results for 
files where bead rectangles are incorrect.


was (Author: tilman):
I can't reproduce the difference for the file 074531.pdf. ExtractText returns 
identical results, that makes me doubt on the entire test :-(

I can reproduce the difference for 290377.pdf, this is because of a change in 
decompression (rev 1709182) that tries to squeeze as much as possible from 
corrupt streams.

There may be some differences due to a bugfix related to "article beads". This 
will mean improved results for files with correct beads, but worse results for 
files where bead rectangles are incorrect.

> Upgrade to PDFBox 1.8.11 when available
> ---------------------------------------
>
>                 Key: TIKA-1830
>                 URL: https://issues.apache.org/jira/browse/TIKA-1830
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>         Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to