[
https://issues.apache.org/jira/browse/TIKA-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063641#comment-18063641
]
Tim Allison commented on TIKA-4683:
-----------------------------------
I have some local updates to tika-eval that I'll push shortly. I had claude
create a new summary .md that it can read....lol, and it should be helpful for
humans too, if there are any left.
I also massively bumped the max json length, so we're comparing a lot more of
the files now.
This is claude's summary:
{noformat}
The good news (3.3.0 → 4.0.0b):
- 1.25M containers, ~5M files compared — massive dataset
- Most mime types have dice=1.0 (perfect match) — vast majority of content is
unchanged
- PDF (190K files): dice 0.9973 — very stable
- msword (177K): 0.9997, xls (120K): 0.9982 — excellent
- Fixed ~16K exceptions (oleobject → ms-equation being the biggest at 11.8K —
that's the OLE reclassification)
- x-xz exceptions dropped from 1,062 to 1 — nice fix
- gzip fixed 351 exceptions, zlib 217, msword 730
The concerning items:
1. PowerPoint regressions — biggest red flag:
- .pptx slideshows: exceptions jumped 3.1% → 19.2% (flagged YIKES!)
- .pptx presentations: 4.1% → 12.9%
- New exceptions in B: 859 pptx, 751 docx, 427 pptx-slideshow
- Dice for pptx slideshows only 0.8149, presentations 0.8717
2. Content Lost — 19 of top 20 are xlsx files going from 200K tokens to 0.
That's a real regression in spreadsheet extraction.
3. Content Gained — many application/octet-stream → text/plain changes. The
new text detection is pulling content from previously-undetected binary files.
Many have very low common_tokens (UTF-16LE with 13-34 common tokens out of
200K) — those look like false text detection on binary data.
4. Missing extracts in B — 142 NO_EXTRACT_FILE in B vs 0 in A. The top
missing extracts include large ppts, docx, and pdfs.
5. Extract file too long dropped from 570→470, which makes sense with the
higher limit.
{noformat}
> Prep for 4.0.0-ALPHA release
> ----------------------------
>
> Key: TIKA-4683
> URL: https://issues.apache.org/jira/browse/TIKA-4683
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
> Attachments: reports.tar.gz
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)