[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: PDFBox_1_8_6VPDFBox_1_8_8-CLASSIC-b162.xlsx I've now looked at the 1.8.6 vs 1.8.8 file - looks nice. There are some slight differences for special fonts, but I don't see these as regressions. Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.8 Attachments: PDFBox_1_8_6DVPDFBox_1_8_8-TRAD-b156.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-CLASSIC-b162.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-CLASSIC-b162.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, PDFBox_1_8_8-CLASSICVPDFBox_1_8_8-NONSEQ-b162.xlsx, PDFBox_1_8_8-CLASSICVPDFBox_1_8_8-NONSEQ-b162.xlsx, PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, PDFBox_1_8_8-TRADVPDFBox_1_8_8-NONSEQ-b156.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1442: -- Attachment: PDFBox_1_8_8-CLASSICVPDFBox_1_8_8-NONSEQ-b162.xlsx PDFBox_1_8_6VPDFBox_1_8_8-CLASSIC-b162.xlsx [~tilman], mea culpa. That botch was typical of the rest of my day yesterday. I reran with fresh builds b162 of 1.8.8-SNAPSHOT. I added three extra columns to help highlight content differences: If you look at the entry for 005/005260.pdf... *TOP_10_UNIQUE_TOKEN_DIFFS_PDFBox_1_8_6* contains the top 10 most frequent tokens that appear in the text extracted via 1.8.6 but not in 1.8.8 {noformat} originat: 2 | can't: 1 | don't: 1 | editor's: 1 | leaving: 1 | retroactively: 1 | site's: 1 | stovepiped: 1 | tic's: 1 {noformat} *TOP_10_UNIQUE_TOKEN_DIFFS_PDFBox_1_8_8-b162-CLASSIC* contains the top 10 most frequent tokens that appear in 1.8.8 but not in 1.8.6 {noformat} insideros: 8 | ohelpo: 4 | os: 4 | ooriginatingo: 3 | osearch: 3 | ooriginat: 2 | opaint: 2 | owholly: 2 | results.o: 2 | searcho: 2 {noformat} *TOP_10_TOKEN_DIFFS* captures the increase or decrease as we move from 1.8.6 to 1.8.8. There are 10 more o, 8 fewer insider's, 8 more insideros, etc. {noformat} o: 10 | insider's: -8 | insideros: 8 | search: -5 | help: -4 | ohelpo: 4 | os: 4 | s: -4 | ooriginatingo: 3 | originating: -3 {noformat} The eval modifications are hot off the press, and there may be surprises. As you found, there may be surprises in getting the correct versions of PDFBox, too. :( Cheers! Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.8 Attachments: PDFBox_1_8_6DVPDFBox_1_8_8-TRAD-b156.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-CLASSIC-b162.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, PDFBox_1_8_8-CLASSICVPDFBox_1_8_8-NONSEQ-b162.xlsx, PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, PDFBox_1_8_8-TRADVPDFBox_1_8_8-NONSEQ-b156.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: PDFBox_1_8_8-CLASSICVPDFBox_1_8_8-NONSEQ-b162.xlsx Thanks... one problem in both excel files: the copying of my remark doesn't work. In the excel file, I saw this formula: {code} =_xlfn.IFNA(SVERWEIS($A2;Sheet2!$A$2:$C$108;2;FALSCH);) {code} IFNA is from excel 2013, which is not available in earlier versions. Next time, please use IFERROR. (I did this and now it works). What I didn't do is to replace the formulas with their results. (But what is _xlfn.?) About seq vs. nonseq - I think the nonseq parser is now slightly better, see my comments. The other file I'll look at tomorrow :-) Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.8 Attachments: PDFBox_1_8_6DVPDFBox_1_8_8-TRAD-b156.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-CLASSIC-b162.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, PDFBox_1_8_8-CLASSICVPDFBox_1_8_8-NONSEQ-b162.xlsx, PDFBox_1_8_8-CLASSICVPDFBox_1_8_8-NONSEQ-b162.xlsx, PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, PDFBox_1_8_8-TRADVPDFBox_1_8_8-NONSEQ-b156.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1442: -- Attachment: PDFBox_1_8_8-TRADVPDFBox_1_8_8-NONSEQ-b156.xlsx PDFBox_1_8_6DVPDFBox_1_8_8-TRAD-b156.xlsx I need to figure out what caused the JSON errors. I've only included those diffs with exceptions or 1.0 overlap. Let me know if you have questions, and thank you, again! Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.8 Attachments: PDFBox_1_8_6DVPDFBox_1_8_8-TRAD-b156.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, PDFBox_1_8_8-TRADVPDFBox_1_8_8-NONSEQ-b156.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.8 Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: (was: PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx) Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.8 Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx Here's my evaluation of the test. I wasn't finished, but it would be nice to use my comments in the next test. Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.8 Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1442: -- Attachment: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx This is a comparison of PDFBox 1.8.6 and PDFBox 1.8.8-SNAPSHOT build 145. This was run via Tika 1.7-SNAPSHOT which uses as default the classic parser. I'll post a comparison file of 1.8.8-SNAPSHOT-145 with classic vs. nonSeq shortly. It looks like there are only a few regressions, and many improvements. Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.8 Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1442: -- Attachment: PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx This file compares PDFBox 1.8.8-SNAPSHOT-b145 with the classic parser vs the NonSequential parser. I've only included the files that had any diffs in extracted content, attachments or metadata. There is one fewer exception with the NonSeq and a few handfuls of new exceptions. Text extraction looks to be mixed, with some better and some worse. Note, though, that there are only 94 files with exceptions or any amount of difference out of 50,000 pdfs. Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.8 Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: PDFBox_1_8_6VPDFBox_1_8_8-b145.zip Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.8 Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1442: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.8 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip I'm done now; the result is two new issues, PDFBOX-2448 and PDFBOX-2449. However PDFBOX-2448 isn't relevant to 1.8.8. Many changes are positive ones, files that no longer thrown an exception, or files that have better text extraction. Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.7 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1442: -- Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx [~tilman], thank you, again, for all of your work on this. Tika community, if you have a chance, take a look at the attached comparison file and recommend other statistics that would be useful for file comparison (TIKA-1332) and junk detection TIKA-1443). I added the following columns: language id: language and confidence score top10words count of the top 10 words that are stopwords in English (based on Lucene's StandardAnalyzer's list)...I need to make this language specific...if the langid component says so, we need to count the number of so stopwords. I renamed some of the column headers. I finally had a chance to break out Manning and Schutze... token overlap is actually Dice coefficient. I added a vlookup column for [~tilman]'s notes. I cannot figure out why I'm getting different lang id confidence scores for a given file pair if the Dice Coefficient is 1.0. I need to look into this. All a work in progress... Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.7 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1442: -- Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx Reran on latest 1.8.8-SNAPSHOT. Added token counts and overlap, something like Dice, but takes into account token count not just binary overlap/unique counts. Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.7 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: (was: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx) Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.7 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.7 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)