On Mon, 24 Nov 2014, Allison, Timothy B. wrote:
I recently ran Tika against the ~1 million files in govdocs1. Nearly
91% (2,579/2,828) of the XLS exceptions via Tika 1.7 are the following.
Tika is detecting these as XLS and then the header exception is thrown.
You need to read that backwards to see the pattern, so the file starts
with 0x090406
Does this header ring any bells? Old version of XLS, perhaps? The
triggering files open in Excel and I think I see that they are "Excel
4".
Sounds like one of the very old, pre-ole2 versions
Looking at the OpenOffice documentation, under section 2.2 and 2.3:
http://www.openoffice.org/sc/excelfileformat.pdf
That suggests that Excel 5 onwards (5, 95, 97 etc) used OLE2, so that'd
mean it's Excel 1 through Excel 4
I can't get the link to work, but one triggering file is 004444.xls.
If you can get that file out, and raise a JIRA, then we can look to add in
magic to correctly detect/handle those files!
Nick
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]