[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-12-03 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1442:
--
Attachment: PDFBox_1_8_6VPDFBox_1_8_8-CLASSIC-b162.xlsx

I've now looked at the 1.8.6 vs 1.8.8 file - looks nice. There are some slight 
differences for special fonts, but I don't see these as regressions.

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.8

 Attachments: PDFBox_1_8_6DVPDFBox_1_8_8-TRAD-b156.xlsx, 
 PDFBox_1_8_6VPDFBox_1_8_8-CLASSIC-b162.xlsx, 
 PDFBox_1_8_6VPDFBox_1_8_8-CLASSIC-b162.xlsx, 
 PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, 
 PDFBox_1_8_8-CLASSICVPDFBox_1_8_8-NONSEQ-b162.xlsx, 
 PDFBox_1_8_8-CLASSICVPDFBox_1_8_8-NONSEQ-b162.xlsx, 
 PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
 PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
 PDFBox_1_8_8-TRADVPDFBox_1_8_8-NONSEQ-b156.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-12-02 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1442:
--
Attachment: PDFBox_1_8_8-CLASSICVPDFBox_1_8_8-NONSEQ-b162.xlsx
PDFBox_1_8_6VPDFBox_1_8_8-CLASSIC-b162.xlsx

[~tilman], mea culpa.  That botch was typical of the rest of my day yesterday.

I reran with fresh builds b162 of 1.8.8-SNAPSHOT.  I added three extra columns 
to help highlight content differences:

If you look at the entry for 005/005260.pdf...
*TOP_10_UNIQUE_TOKEN_DIFFS_PDFBox_1_8_6*
contains the top 10 most frequent tokens that appear in the text extracted via 
1.8.6 but not in 1.8.8
{noformat}
originat: 2 | can't: 1 | don't: 1 | editor's: 1 | 
leaving: 1 | retroactively: 1 | site's: 1 | stovepiped: 1 | tic's: 1
{noformat}

*TOP_10_UNIQUE_TOKEN_DIFFS_PDFBox_1_8_8-b162-CLASSIC*
contains the top 10 most frequent tokens that appear in 1.8.8 but not in 1.8.6
{noformat}
insideros: 8 | ohelpo: 4 | os: 4 | ooriginatingo: 3 |
 osearch: 3 | ooriginat: 2 | opaint: 2 | owholly: 2 | 
results.o: 2 | searcho: 2
{noformat}

*TOP_10_TOKEN_DIFFS*
captures the increase or decrease as we move from 1.8.6 to 1.8.8.  There are 10 
more o, 8 fewer insider's, 8 more insideros, etc.
{noformat}
o: 10 | insider's: -8 | insideros: 8 | search: -5 | help: -4 | 
ohelpo: 4 | os: 4 | s: -4 | ooriginatingo: 3 | originating: -3
{noformat}

The eval modifications are hot off the press, and there may be surprises.

As you found, there may be surprises in getting the correct versions of PDFBox, 
too. :(

Cheers!

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.8

 Attachments: PDFBox_1_8_6DVPDFBox_1_8_8-TRAD-b156.xlsx, 
 PDFBox_1_8_6VPDFBox_1_8_8-CLASSIC-b162.xlsx, 
 PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, 
 PDFBox_1_8_8-CLASSICVPDFBox_1_8_8-NONSEQ-b162.xlsx, 
 PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
 PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
 PDFBox_1_8_8-TRADVPDFBox_1_8_8-NONSEQ-b156.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-12-02 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1442:
--
Attachment: PDFBox_1_8_8-CLASSICVPDFBox_1_8_8-NONSEQ-b162.xlsx

Thanks... one problem in both excel files: the copying of my remark doesn't 
work. In the excel file, I saw this formula:
{code}
=_xlfn.IFNA(SVERWEIS($A2;Sheet2!$A$2:$C$108;2;FALSCH);)
{code}
IFNA is from excel 2013, which is not available in earlier versions. Next time, 
please use IFERROR. (I did this and now it works). What I didn't do is to 
replace the formulas with their results. (But what is _xlfn.?)

About seq vs. nonseq - I think the nonseq parser is now slightly better, see my 
comments.

The other file I'll look at tomorrow :-)

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.8

 Attachments: PDFBox_1_8_6DVPDFBox_1_8_8-TRAD-b156.xlsx, 
 PDFBox_1_8_6VPDFBox_1_8_8-CLASSIC-b162.xlsx, 
 PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, 
 PDFBox_1_8_8-CLASSICVPDFBox_1_8_8-NONSEQ-b162.xlsx, 
 PDFBox_1_8_8-CLASSICVPDFBox_1_8_8-NONSEQ-b162.xlsx, 
 PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
 PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
 PDFBox_1_8_8-TRADVPDFBox_1_8_8-NONSEQ-b156.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-12-01 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1442:
--
Attachment: PDFBox_1_8_8-TRADVPDFBox_1_8_8-NONSEQ-b156.xlsx
PDFBox_1_8_6DVPDFBox_1_8_8-TRAD-b156.xlsx

I need to figure out what caused the JSON errors.  I've only included those 
diffs with exceptions or  1.0 overlap. Let me know if you have questions, and 
thank you, again!

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.8

 Attachments: PDFBox_1_8_6DVPDFBox_1_8_8-TRAD-b156.xlsx, 
 PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, 
 PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
 PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
 PDFBox_1_8_8-TRADVPDFBox_1_8_8-NONSEQ-b156.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-30 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1442:
--
Attachment: PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.8

 Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, 
 PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, 
 PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
 PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-30 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1442:
--
Attachment: (was: PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx)

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.8

 Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, 
 PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, 
 PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
 PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-29 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1442:
--
Attachment: PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx

Here's my evaluation of the test. I wasn't finished, but it would be nice to 
use my comments in the next test.

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.8

 Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, 
 PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, 
 PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
 PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-25 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1442:
--
Attachment: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx

This is a comparison of PDFBox 1.8.6 and PDFBox 1.8.8-SNAPSHOT build 145.  This 
was run via Tika 1.7-SNAPSHOT which uses as default the classic parser.  I'll 
post a comparison file of 1.8.8-SNAPSHOT-145 with classic vs. nonSeq shortly.

It looks like there are only a few regressions, and many improvements.

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.8

 Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-25 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1442:
--
Attachment: PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx

This file compares PDFBox 1.8.8-SNAPSHOT-b145 with the classic parser vs the 
NonSequential parser.  I've only included the files that had any diffs in 
extracted content, attachments or metadata.

There is one fewer exception with the NonSeq and a few handfuls of new 
exceptions.

Text extraction looks to be mixed, with some better and some worse.  Note, 
though, that there are only 94 files with exceptions or any amount of 
difference out of 50,000 pdfs.

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.8

 Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, 
 PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-25 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1442:
--
Attachment: PDFBox_1_8_6VPDFBox_1_8_8-b145.zip

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.8

 Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, 
 PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, 
 PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-24 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1442:

Fix Version/s: (was: 1.7)
   1.8

- push to 1.8

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.8

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1442:
--
Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip

I'm done now; the result is two new issues, PDFBOX-2448 and PDFBOX-2449. 
However PDFBOX-2448 isn't relevant to 1.8.8.

Many changes are positive ones, files that no longer thrown an exception, or 
files that have better text extraction.


 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.7

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1442:
--
Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx

[~tilman], thank you, again, for all of your work on this.

Tika community, if you have a chance, take a look at the attached comparison 
file and recommend other statistics that would be useful for file comparison 
(TIKA-1332) and junk detection TIKA-1443).

I added the following columns:
language id: language and confidence score
top10words
count of the top 10 words that are stopwords in English (based on Lucene's 
StandardAnalyzer's list)...I need to make this language specific...if the 
langid component says so, we need to count the number of so stopwords.

I renamed some of the column headers.  I finally had a chance to break out 
Manning and Schutze... token overlap is actually Dice coefficient.

I added a vlookup column for [~tilman]'s notes. 

I cannot figure out why I'm getting different lang id confidence scores for a 
given file pair if the Dice Coefficient is 1.0.  I need to look into this.

All a work in progress...

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.7

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1442:
--
Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx

Reran on latest 1.8.8-SNAPSHOT.  

Added token counts and overlap, something like Dice, but takes into account 
token count not just binary overlap/unique counts.

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.7

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
 pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-16 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1442:
--
Attachment: (was: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx)

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.7


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-16 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1442:
--
Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.7

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)