[ https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060203#comment-15060203 ]
Nick Burch edited comment on TIKA-1813 at 12/16/15 3:58 PM: ------------------------------------------------------------ My best guess is that these have been truncated. Having a look with {{org.apache.poi.poifs.dev.POIFSHeaderDumper}} it certainly looks that way was (Author: gagravarr): My best guess is that these have been truncated. Having a look with {{{org.apache.poi.poifs.dev.POIFSHeaderDumper}}} it certainly looks that way > Figure out file types for several unknown OLE files in Common Crawl > ------------------------------------------------------------------- > > Key: TIKA-1813 > URL: https://issues.apache.org/jira/browse/TIKA-1813 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Priority: Minor > Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB, > 25JIANLV77U645GUSJ2E67YSM4B2TNSP, 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA > > > We're getting around 300 exceptions from "application/x-tika-msoffice" files > in our current slice of Common Crawl documents that look roughly like this: > {noformat} > java.lang.IllegalArgumentException: Position 86528 past the end of the file > at org.apache.poi.poifs.nio.FileBackedDataSource.read > {noformat} > I suspect these are non-MS OLE file formats. Any help identifying the file > types and patching our OLE mime detector would be great. -- This message was sent by Atlassian JIRA (v6.3.4#6332)