[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830974#comment-17830974 ] Tim Allison commented on TIKA-4218: --- Thank you, [~tilman]! The mp4 is weird because exiftool was run in 2.9.2, but our regular MP4Parser was run in 2.9.1. The bad text in the epub is a side effect of work on TIKA-4219. I've just pushed a fix. Thank you, again. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830954#comment-17830954 ] Tilman Hausherr commented on TIKA-4218: --- 6FOMNUPGPA6IG66Z4NIUEQIVOR5ON46Q (an MP4 file) has a loss of metadata (bierenbach: 2 | earlier: 2 | https://www.facebook.com/speedlinecablecam: 2 | https://www.speedline-cablecam.com: 2 | in: 2 | of: 2 | the: 2 | this: 2 | woods: 2 | year: 2) EEXR753OKDGYAIXL36PZ2EGYPN477SZU and a few other files have one word in TOP_10_MORE_IN_A which reappears in TOP_10_MORE_IN_B but with "oebps". Here, "secretary" becomes "secretaryoebps". I don't know if this is a bug or not. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830934#comment-17830934 ] Tim Allison commented on TIKA-4218: --- I'll start building RC1. If we find any problems, we can cancel the vote and go with rc2. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830930#comment-17830930 ] Tim Allison commented on TIKA-4218: --- The number of new attachments from pptx based on [~xyang200]'s work on TIKA-4211 is amazing: ~3500 new attachments across our corpus. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830927#comment-17830927 ] Tim Allison commented on TIKA-4218: --- This is looking much, much better. I made a small change to the EpubParser that will prevent writing the names of font files into the main contents. The fork of COMPRESS-674 helped out dramatically as did the other fixes. I merged the recent mime detection updates for 3d files. I _think_ we're good to go with rc1 for 2.9.2. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830918#comment-17830918 ] Tim Allison commented on TIKA-4218: --- https://corpora.tika.apache.org/base/reports/tika-2.9.2b-prerc1-reports.tgz > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830604#comment-17830604 ] Tilman Hausherr commented on TIKA-4218: --- To be honest I didn't look further, because these problems affected too many files. Yes please rerun the test so that whatever remains would stick out. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830585#comment-17830585 ] Tim Allison commented on TIKA-4218: --- [~tilman] did you see any other blockers/surprises? Once I merge TIKA-4221, I'll rerun the regression tests if there's not anything else to fix. I see you've updated pdfbox already! :D > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830105#comment-17830105 ] Tilman Hausherr commented on TIKA-4218: --- Follow up in TIKA-4171 > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830097#comment-17830097 ] Tilman Hausherr commented on TIKA-4218: --- Confirmed, I reverted just that change and then the text view is longer and ends with "Enter the total number of pages being submitted, including cover sheets, attachments, and documents:" and in current 2.9.2 it ends with "disclosure to GSA shall not be used to make determinations about individuals." > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830094#comment-17830094 ] Tilman Hausherr commented on TIKA-4218: --- Oops, or it's part of XFA, I just found it too. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830093#comment-17830093 ] Tilman Hausherr commented on TIKA-4218: --- I found one difference: "Enter the full name of the conveying party or parties" is in 2.9.1 but not in 2.9.2, and in 2.9.1 it appears directly after the main text. This text is in the first field (below "Name of conveying party(ies):") as /TU entry which one can get with {{getAlternateFieldName()}}. PDFBox and the PDF specification considers this to be an "alternative field name" and Adobe Reader displays it as popup when the mouse hovers there. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830090#comment-17830090 ] Tim Allison commented on TIKA-4218: --- Hmmm... Sorry, I didn't specify the exact 2.9.1 file: tika-2.9.1-rc1-rand1m/govdocs1/876/876503.pdf.json That one does have "party" 62 times. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830079#comment-17830079 ] Tilman Hausherr commented on TIKA-4218: --- The word "party" appears 36 times in the json file, 18 times in my text extraction, but 62 times in the csv file in the TOP_N_TOKENS_A row. The double in the json file is because of "xfa_content", but the "62" I don't understand. Thanks for mentioning the new list (I probably missed it), I'll adjust my scripts and use them the next time. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830072#comment-17830072 ] Tim Allison commented on TIKA-4218: --- 1) Just fixed this by bumping the max files in a zip to 10k as we do in 3.x 2) I'm punting on this for now. Some are better, some are worse. 3) See TIKA-4219 4) turning to this now. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830071#comment-17830071 ] Tilman Hausherr commented on TIKA-4218: --- There are also improvements not in my own test results, e.g. the "FOP" pdf file. Either something went wrong with my test, or with yours. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830070#comment-17830070 ] Tim Allison commented on TIKA-4218: --- I started using a new file list based on an update 6? months ago: ...tools/tika/batch/new-file-lists/rand1.2million.csv which might explain why that file wasn't in your regression list. We were using 2.0.29 in Tika 2.9.1, so the diff I saw was between 2.0.29 and 2.0.30. The extracts are where you'd expect: .../extracts... Again, this could be a Tika issue or an improvement. I haven't had a chance to look! Thank you so much for looking into this! > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830069#comment-17830069 ] Tilman Hausherr commented on TIKA-4218: --- Weird indeed, 876503.pdf didn't appear in the PDFBox regression tests: https://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.30_vs_2.0.31.tar.xz I think we did make one (harmless) last minute change after the regression tests, so I just ran ExtractText with both versions and no difference. > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release
[ https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830063#comment-17830063 ] Tim Allison commented on TIKA-4218: --- https://corpora.tika.apache.org/base/reports/tika-2.9.2-pre-rc1-reports.tgz Initial negative observations that require investigation: 1) some pptx are now being identified as tika-ooxml: commoncrawl3_refetched/HD/HDUTGEMEAGSGCJOTXREK77GYQKM3W5H3 2) some pdfs have less text: govdocs1/876/876503.pdf (this could be Tika's fault, not PDFBox's -- it could also be an improvement!) 3) epub+zip have many fewer "common tokens" -- this is caused by EncryptedExceptions being thrown in 2.9.2: commoncrawl3/47/47WOSBEUHE6CRMVDFBOOHUD36FEQAZ6T 4) it looks like a bunch of formats are now being identified (incorrectly) as x-tar, leading to exceptions: 646 appledouble, 289 microsoft icon, etc. There is a small handful of files that used to be identified as mp4 that are now being correctly handled as x-tar... 5) There are several regressions in x-xz handling: commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH Initial positive observations: 1) some rfc822 have less junk, esp Persian language emails 2) some pdfs are much better 3) application/vnd.ms-htmlhelp look to be better > Run regression tests to support 2.9.2 release > - > > Key: TIKA-4218 > URL: https://issues.apache.org/jira/browse/TIKA-4218 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)