Thank you, Nick! I'll post a file to Tika's JIRA. Or, should I raise this on POI's bugzilla? I can't imagine there's a burning need (or interest to add) processing for pre-OLE2 docs.
-----Original Message----- From: Nick Burch [mailto:[email protected]] Sent: Tuesday, November 25, 2014 9:20 AM To: POI Users List Subject: Re: Invalid header for xls: 0x0010000000060409? On Mon, 24 Nov 2014, Allison, Timothy B. wrote: > I recently ran Tika against the ~1 million files in govdocs1. Nearly > 91% (2,579/2,828) of the XLS exceptions via Tika 1.7 are the following. > Tika is detecting these as XLS and then the header exception is thrown. You need to read that backwards to see the pattern, so the file starts with 0x090406 > Does this header ring any bells? Old version of XLS, perhaps? The > triggering files open in Excel and I think I see that they are "Excel > 4". Sounds like one of the very old, pre-ole2 versions Looking at the OpenOffice documentation, under section 2.2 and 2.3: http://www.openoffice.org/sc/excelfileformat.pdf That suggests that Excel 5 onwards (5, 95, 97 etc) used OLE2, so that'd mean it's Excel 1 through Excel 4 > I can't get the link to work, but one triggering file is 004444.xls. If you can get that file out, and raise a JIRA, then we can look to add in magic to correctly detect/handle those files! Nick --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
