[ 
https://issues.apache.org/jira/browse/TIKA-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063641#comment-18063641
 ] 

Tim Allison commented on TIKA-4683:
-----------------------------------

I have some local updates to tika-eval that I'll push shortly. I had claude 
create a new summary .md that it can read....lol, and it should be helpful for 
humans too, if there are any left.

I also massively bumped the max json length, so we're comparing a lot more of 
the files now.

This is claude's summary:
{noformat}
The good news (3.3.0 → 4.0.0b):                                                 
                                                                          
  - 1.25M containers, ~5M files compared — massive dataset                      
                                                                            
  - Most mime types have dice=1.0 (perfect match) — vast majority of content is 
unchanged                                                                   
  - PDF (190K files): dice 0.9973 — very stable                                 
                                                                            
  - msword (177K): 0.9997, xls (120K): 0.9982 — excellent   
  - Fixed ~16K exceptions (oleobject → ms-equation being the biggest at 11.8K — 
that's the OLE reclassification)                                            
  - x-xz exceptions dropped from 1,062 to 1 — nice fix
  - gzip fixed 351 exceptions, zlib 217, msword 730

  The concerning items:
  1. PowerPoint regressions — biggest red flag:
    - .pptx slideshows: exceptions jumped 3.1% → 19.2% (flagged YIKES!)
    - .pptx presentations: 4.1% → 12.9%
    - New exceptions in B: 859 pptx, 751 docx, 427 pptx-slideshow
    - Dice for pptx slideshows only 0.8149, presentations 0.8717
  2. Content Lost — 19 of top 20 are xlsx files going from 200K tokens to 0. 
That's a real regression in spreadsheet extraction.
  3. Content Gained — many application/octet-stream → text/plain changes. The 
new text detection is pulling content from previously-undetected binary files.
   Many have very low common_tokens (UTF-16LE with 13-34 common tokens out of 
200K) — those look like false text detection on binary data.
  4. Missing extracts in B — 142 NO_EXTRACT_FILE in B vs 0 in A. The top 
missing extracts include large ppts, docx, and pdfs.
  5. Extract file too long dropped from 570→470, which makes sense with the 
higher limit.

{noformat}

> Prep for 4.0.0-ALPHA release
> ----------------------------
>
>                 Key: TIKA-4683
>                 URL: https://issues.apache.org/jira/browse/TIKA-4683
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: reports.tar.gz
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to