All,
  I recently ran Tika against the ~1 million files in govdocs1.  Nearly 91% 
(2,579/2,828) of the XLS exceptions via Tika 1.7 are the following.  Tika is 
detecting these as XLS and then the header exception is thrown.
  Does this header ring any bells?  Old version of XLS, perhaps?  The 
triggering files open in Excel and I think I see that they are "Excel 4".
  I can't get the link to work, but one triggering file is 004444.xls.

          Best,

                   Tim


Caused by: java.io.IOException: Invalid header signature; read 
0x0010000000060409, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a 
valid OLE2 document at 
org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:140) at 
org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:115) at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:198)
 at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:184)
 at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:162) 
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 
13 more

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to