[ https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15096866#comment-15096866 ]
Tilman Hausherr edited comment on TIKA-1830 at 1/14/16 5:05 PM: ---------------------------------------------------------------- I can't reproduce the difference for the file 074531.pdf. ExtractText returns identical results, that makes me doubt on the entire test :-( (edit: also 362980.pdf, 058103.pdf, and 760707.pdf ) I can reproduce the difference for 290377.pdf, this is because of a change in decompression (rev 1709182) that tries to squeeze as much as possible from corrupt streams. There may be some differences due to a bugfix related to "article beads". This will mean improved results for files with correct beads, but worse results for files where bead rectangles are incorrect. was (Author: tilman): I can't reproduce the difference for the file 074531.pdf. ExtractText returns identical results, that makes me doubt on the entire test :-( I can reproduce the difference for 290377.pdf, this is because of a change in decompression (rev 1709182) that tries to squeeze as much as possible from corrupt streams. There may be some differences due to a bugfix related to "article beads". This will mean improved results for files with correct beads, but worse results for files where bead rectangles are incorrect. > Upgrade to PDFBox 1.8.11 when available > --------------------------------------- > > Key: TIKA-1830 > URL: https://issues.apache.org/jira/browse/TIKA-1830 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Attachments: reports_pdfbox_1_8_11-rc1.zip > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)