On Mon, 24 Nov 2014, Allison, Timothy B. wrote:
I recently ran Tika against the ~1 million files in govdocs1. Nearly 91% (2,579/2,828) of the XLS exceptions via Tika 1.7 are the following. Tika is detecting these as XLS and then the header exception is thrown.

You need to read that backwards to see the pattern, so the file starts with 0x090406

Does this header ring any bells? Old version of XLS, perhaps? The triggering files open in Excel and I think I see that they are "Excel 4".

Sounds like one of the very old, pre-ole2 versions

Looking at the OpenOffice documentation, under section 2.2 and 2.3:
http://www.openoffice.org/sc/excelfileformat.pdf

That suggests that Excel 5 onwards (5, 95, 97 etc) used OLE2, so that'd mean it's Excel 1 through Excel 4

I can't get the link to work, but one triggering file is 004444.xls.

If you can get that file out, and raise a JIRA, then we can look to add in magic to correctly detect/handle those files!

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to