[ 
https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456688#comment-17456688
 ] 

Tim Allison commented on TIKA-3164:
-----------------------------------

I finally had time to run the regression tests against ~400k files.  The 
reports are here: https://corpora.tika.apache.org/base/share/reports-poi-5.x.tgz

There are ~20 fixed exceptions.

Two files have this new exception:
{noformat}
Could not locate compiled schema resource 
org/apache/poi/schemas/ooxml/system/ooxml/ctcustomxmlblockd3c1type.xsb
{noformat}

There's a very small regression in that in a handful of xlsx files, if there's 
a number in the last column of a row, it is not cleared before the content in 
the first cell of the next row.  So we get:
{noformat}
...<td>1.5</td></tr>
<tr><td>1.5kultur...
from 
...<td>1.5</td></tr>
<tr><td>kultur...

I'll open an issue with POI and see if I can patch this at the Tika level for 
now.

> Upgrade to POI 5.0.0 when available
> -----------------------------------
>
>                 Key: TIKA-3164
>                 URL: https://issues.apache.org/jira/browse/TIKA-3164
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to