[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456688#comment-17456688 ]
Tim Allison commented on TIKA-3164: ----------------------------------- I finally had time to run the regression tests against ~400k files. The reports are here: https://corpora.tika.apache.org/base/share/reports-poi-5.x.tgz There are ~20 fixed exceptions. Two files have this new exception: {noformat} Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/ctcustomxmlblockd3c1type.xsb {noformat} There's a very small regression in that in a handful of xlsx files, if there's a number in the last column of a row, it is not cleared before the content in the first cell of the next row. So we get: {noformat} ...<td>1.5</td></tr> <tr><td>1.5kultur... from ...<td>1.5</td></tr> <tr><td>kultur... I'll open an issue with POI and see if I can patch this at the Tika level for now. > Upgrade to POI 5.0.0 when available > ----------------------------------- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)