Tim Allison created TIKA-1813: --------------------------------- Summary: Figure out file types for several unknown OLE files in Common Crawl Key: TIKA-1813 URL: https://issues.apache.org/jira/browse/TIKA-1813 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor
We're getting around 300 exceptions from "application/x-tika-msoffice" files in our current slice of Common Crawl documents that look roughly like this: {noformat} java.lang.IllegalArgumentException: Position 86528 past the end of the file at org.apache.poi.poifs.nio.FileBackedDataSource.read {noformat} I suspect these are non-MS OLE file formats. Any help identifying the file types and patching our OLE mime detector would be great. -- This message was sent by Atlassian JIRA (v6.3.4#6332)