Oh, so reading the stream doesn't read the whole file? I know for Office files you can tell it's an Office file from the first dozen or so bytes, but you have to read the 2nd 512 block to find out more. The stream doesn't do that?
-----Original Message----- From: Nick Burch <[email protected]> Sent: Tuesday, December 22, 2020 4:40 PM To: [email protected] Subject: Re: Mimetypes On Tue, 22 Dec 2020, Peter Kronenberg wrote: > I'm trying to detect the mimetype of a file using both > > Tika.detect(InputStream) > and > Tika.detect(File) > > I get 2 different results. I'm testing with a Microsoft Word (.doc) file. The InputStream one is based on just the first few kb of the file. That's enough to figure out it's an OLE2 file, but not what flavour The File one reads the whole file, checks the OLE2 directory entries, and identifies that you have a Word file If you gave Tika the InputStream + filename on a Metadata object, it would specialise the OLE2 type to Word based on the extension If you gave Tika a TikaInputStream, it would detect that a File was needed for a fully precise answer, spool the Stream to a File, then use that to detect (and later parse if you need) Nick
