[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-26 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830974#comment-17830974
 ] 

Tim Allison commented on TIKA-4218:
---

Thank you, [~tilman]! The mp4 is weird because exiftool was run in 2.9.2, but 
our regular MP4Parser was run in 2.9.1. The bad text in the epub is a side 
effect of work on TIKA-4219. I've just pushed a fix. Thank you, again.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-26 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830954#comment-17830954
 ] 

Tilman Hausherr commented on TIKA-4218:
---

6FOMNUPGPA6IG66Z4NIUEQIVOR5ON46Q (an MP4 file) has a loss of metadata 
(bierenbach: 2 | earlier: 2 | https://www.facebook.com/speedlinecablecam: 2 | 
https://www.speedline-cablecam.com: 2 | in: 2 | of: 2 | the: 2 | this: 2 | 
woods: 2 | year: 2)

EEXR753OKDGYAIXL36PZ2EGYPN477SZU and a few other files have one word in 
TOP_10_MORE_IN_A which reappears in TOP_10_MORE_IN_B but with "oebps". Here, 
"secretary" becomes "secretaryoebps". I don't know if this is a bug or not.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-26 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830934#comment-17830934
 ] 

Tim Allison commented on TIKA-4218:
---

I'll start building RC1. If we find any problems, we can cancel the vote and go 
with rc2.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-26 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830930#comment-17830930
 ] 

Tim Allison commented on TIKA-4218:
---

The number of new attachments from pptx based on [~xyang200]'s work on 
TIKA-4211 is amazing: ~3500 new attachments across our corpus.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-26 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830927#comment-17830927
 ] 

Tim Allison commented on TIKA-4218:
---

This is looking much, much better. I made a small change to the EpubParser that 
will prevent writing the names of font files into the main contents.

The fork of COMPRESS-674 helped out dramatically as did the other fixes.

I merged the recent mime detection updates for 3d files. I _think_ we're good 
to go with rc1 for 2.9.2.


> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-26 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830918#comment-17830918
 ] 

Tim Allison commented on TIKA-4218:
---

https://corpora.tika.apache.org/base/reports/tika-2.9.2b-prerc1-reports.tgz

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-25 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830604#comment-17830604
 ] 

Tilman Hausherr commented on TIKA-4218:
---

To be honest I didn't look further, because these problems affected too many 
files. Yes please rerun the test so that whatever remains would stick out.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830585#comment-17830585
 ] 

Tim Allison commented on TIKA-4218:
---

[~tilman] did you see any other blockers/surprises? Once I merge TIKA-4221, 
I'll rerun the regression tests if there's not anything else to fix.

I see you've updated pdfbox already! :D

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830105#comment-17830105
 ] 

Tilman Hausherr commented on TIKA-4218:
---

Follow up in  TIKA-4171

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830097#comment-17830097
 ] 

Tilman Hausherr commented on TIKA-4218:
---

Confirmed, I reverted just that change and then the text view is longer and 
ends with "Enter the total number of pages being submitted, including cover 
sheets, attachments, and documents:" and in current 2.9.2 it ends with 
"disclosure to GSA shall not be used to make determinations about individuals."

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830094#comment-17830094
 ] 

Tilman Hausherr commented on TIKA-4218:
---

Oops, or it's part of XFA, I just found it too.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830093#comment-17830093
 ] 

Tilman Hausherr commented on TIKA-4218:
---

I found one difference: "Enter the full name of the conveying party or parties" 
is in 2.9.1 but not in 2.9.2, and in 2.9.1 it appears directly after the main 
text. This text is in the first field (below "Name of conveying party(ies):") 
as /TU entry which one can get with {{getAlternateFieldName()}}. PDFBox and the 
PDF specification considers this to be an "alternative field name" and Adobe 
Reader displays it as popup when the mouse hovers there.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: 2.9.1-876503.pdf.json, 2.9.2-876503.pdf.json
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-23 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830090#comment-17830090
 ] 

Tim Allison commented on TIKA-4218:
---

Hmmm... Sorry, I didn't specify the exact 2.9.1 file: 
tika-2.9.1-rc1-rand1m/govdocs1/876/876503.pdf.json

That one does have "party" 62 times.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830079#comment-17830079
 ] 

Tilman Hausherr commented on TIKA-4218:
---

The word "party" appears 36 times in the json file, 18 times in my text 
extraction, but 62 times in the csv file in the TOP_N_TOKENS_A row. The double 
in the json file is because of "xfa_content", but the "62" I don't understand.

Thanks for mentioning the new list (I probably missed it), I'll adjust my 
scripts and use them the next time.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-23 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830072#comment-17830072
 ] 

Tim Allison commented on TIKA-4218:
---

1) Just fixed this by bumping the max files in a zip to 10k as we do in 3.x
2) I'm punting on this for now. Some are better, some are worse.
3) See TIKA-4219
4) turning to this now.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830071#comment-17830071
 ] 

Tilman Hausherr commented on TIKA-4218:
---

There are also improvements not in my own test results, e.g. the "FOP" pdf 
file. Either something went wrong with my test, or with yours.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-23 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830070#comment-17830070
 ] 

Tim Allison commented on TIKA-4218:
---

I started using a new file list based on an update 6? months ago: 
...tools/tika/batch/new-file-lists/rand1.2million.csv which might explain why 
that file wasn't in your regression list.

We were using 2.0.29 in Tika 2.9.1, so the diff I saw was between 2.0.29 and 
2.0.30. The extracts are where you'd expect: .../extracts...

Again, this could be a Tika issue or an improvement. I haven't had a chance to 
look!

Thank you so much for looking into this!

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-23 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830069#comment-17830069
 ] 

Tilman Hausherr commented on TIKA-4218:
---

Weird indeed, 876503.pdf didn't appear in the PDFBox regression tests:
https://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.30_vs_2.0.31.tar.xz

I think we did make one (harmless) last minute change after the regression 
tests, so I just ran ExtractText with both versions and no difference.

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4218) Run regression tests to support 2.9.2 release

2024-03-23 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830063#comment-17830063
 ] 

Tim Allison commented on TIKA-4218:
---

https://corpora.tika.apache.org/base/reports/tika-2.9.2-pre-rc1-reports.tgz

Initial negative observations that require investigation:
1) some pptx are now being identified as tika-ooxml: 
commoncrawl3_refetched/HD/HDUTGEMEAGSGCJOTXREK77GYQKM3W5H3
2) some pdfs have less text: govdocs1/876/876503.pdf (this could be Tika's 
fault, not PDFBox's -- it could also be an improvement!)
3) epub+zip have many fewer "common tokens" -- this is caused by 
EncryptedExceptions being thrown in 2.9.2: 
commoncrawl3/47/47WOSBEUHE6CRMVDFBOOHUD36FEQAZ6T
4) it looks like a bunch of formats are now being identified (incorrectly) as 
x-tar, leading to exceptions: 646 appledouble, 289 microsoft icon, etc. There 
is a small handful of files that used to be identified as mp4 that are now 
being correctly handled as x-tar...
5) There are several regressions in x-xz handling: 
commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH

Initial positive observations:
1) some rfc822 have less junk, esp Persian language emails
2) some pdfs are much better
3) application/vnd.ms-htmlhelp look to be better

> Run regression tests to support 2.9.2 release
> -
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)