[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison updated TIKA-1442: ------------------------------ Attachment: PDFBox_1_8_8-CLASSICVPDFBox_1_8_8-NONSEQ-b162.xlsx PDFBox_1_8_6VPDFBox_1_8_8-CLASSIC-b162.xlsx [~tilman], mea culpa. That botch was typical of the rest of my day yesterday. I reran with fresh builds b162 of 1.8.8-SNAPSHOT. I added three extra columns to help highlight content differences: If you look at the entry for 005/005260.pdf... *TOP_10_UNIQUE_TOKEN_DIFFS_PDFBox_1_8_6* contains the top 10 most frequent tokens that appear in the text extracted via 1.8.6 but not in 1.8.8 {noformat} originat: 2 | can't: 1 | don't: 1 | editor's: 1 | leaving: 1 | retroactively: 1 | site's: 1 | stovepiped: 1 | tic's: 1 {noformat} *TOP_10_UNIQUE_TOKEN_DIFFS_PDFBox_1_8_8-b162-CLASSIC* contains the top 10 most frequent tokens that appear in 1.8.8 but not in 1.8.6 {noformat} insideros: 8 | ohelpo: 4 | os: 4 | ooriginatingo: 3 | osearch: 3 | ooriginat: 2 | opaint: 2 | owholly: 2 | results.o: 2 | searcho: 2 {noformat} *TOP_10_TOKEN_DIFFS* captures the increase or decrease as we move from 1.8.6 to 1.8.8. There are 10 more "o", 8 fewer "insider's", 8 more "insideros", etc. {noformat} o: 10 | insider's: -8 | insideros: 8 | search: -5 | help: -4 | ohelpo: 4 | os: 4 | s: -4 | ooriginatingo: 3 | originating: -3 {noformat} The eval modifications are hot off the press, and there may be surprises. As you found, there may be surprises in getting the correct versions of PDFBox, too. :( Cheers! > Upgrade to PDFBox 1.8.8 > ----------------------- > > Key: TIKA-1442 > URL: https://issues.apache.org/jira/browse/TIKA-1442 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Assignee: Tim Allison > Fix For: 1.8 > > Attachments: PDFBox_1_8_6DVPDFBox_1_8_8-TRAD-b156.xlsx, > PDFBox_1_8_6VPDFBox_1_8_8-CLASSIC-b162.xlsx, > PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, > PDFBox_1_8_8-CLASSICVPDFBox_1_8_8-NONSEQ-b162.xlsx, > PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, > PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, > PDFBox_1_8_8-TRADVPDFBox_1_8_8-NONSEQ-b156.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip > > > Given the regressions we identified in PDFBox 1.8.7, we should upgrade to > 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika > 1.7. Let's use this issue to carry on the discussion of regression testing > (if any further discussion is necessary) or any other prep that needs to > happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)