[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1442:
------------------------------
    Attachment: PDFBox_1_8_8-CLASSICVPDFBox_1_8_8-NONSEQ-b162.xlsx
                PDFBox_1_8_6VPDFBox_1_8_8-CLASSIC-b162.xlsx

[~tilman], mea culpa.  That botch was typical of the rest of my day yesterday.

I reran with fresh builds b162 of 1.8.8-SNAPSHOT.  I added three extra columns 
to help highlight content differences:

If you look at the entry for 005/005260.pdf...
*TOP_10_UNIQUE_TOKEN_DIFFS_PDFBox_1_8_6*
contains the top 10 most frequent tokens that appear in the text extracted via 
1.8.6 but not in 1.8.8
{noformat}
originat: 2 | can't: 1 | don't: 1 | editor's: 1 | 
leaving: 1 | retroactively: 1 | site's: 1 | stovepiped: 1 | tic's: 1
{noformat}

*TOP_10_UNIQUE_TOKEN_DIFFS_PDFBox_1_8_8-b162-CLASSIC*
contains the top 10 most frequent tokens that appear in 1.8.8 but not in 1.8.6
{noformat}
insideros: 8 | ohelpo: 4 | os: 4 | ooriginatingo: 3 |
 osearch: 3 | ooriginat: 2 | opaint: 2 | owholly: 2 | 
results.o: 2 | searcho: 2
{noformat}

*TOP_10_TOKEN_DIFFS*
captures the increase or decrease as we move from 1.8.6 to 1.8.8.  There are 10 
more "o", 8 fewer "insider's", 8 more "insideros", etc.
{noformat}
o: 10 | insider's: -8 | insideros: 8 | search: -5 | help: -4 | 
ohelpo: 4 | os: 4 | s: -4 | ooriginatingo: 3 | originating: -3
{noformat}

The eval modifications are hot off the press, and there may be surprises.

As you found, there may be surprises in getting the correct versions of PDFBox, 
too. :(

Cheers!

> Upgrade to PDFBox 1.8.8
> -----------------------
>
>                 Key: TIKA-1442
>                 URL: https://issues.apache.org/jira/browse/TIKA-1442
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>             Fix For: 1.8
>
>         Attachments: PDFBox_1_8_6DVPDFBox_1_8_8-TRAD-b156.xlsx, 
> PDFBox_1_8_6VPDFBox_1_8_8-CLASSIC-b162.xlsx, 
> PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, 
> PDFBox_1_8_8-CLASSICVPDFBox_1_8_8-NONSEQ-b162.xlsx, 
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
> PDFBox_1_8_8-TRADVPDFBox_1_8_8-NONSEQ-b156.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to