Oh, so reading the stream doesn't read the whole file?  I know for Office files 
you can tell it's an Office file from the first dozen or so bytes, but you have 
to read the 2nd 512 block to find out more.  The stream doesn't do that?

-----Original Message-----
From: Nick Burch <[email protected]> 
Sent: Tuesday, December 22, 2020 4:40 PM
To: [email protected]
Subject: Re: Mimetypes

On Tue, 22 Dec 2020, Peter Kronenberg wrote:
> I'm trying to detect the mimetype of a file using both
>
> Tika.detect(InputStream)
> and
> Tika.detect(File)
>
> I get 2 different results.  I'm testing with a Microsoft Word (.doc) file.

The InputStream one is based on just the first few kb of the file. That's 
enough to figure out it's an OLE2 file, but not what flavour

The File one reads the whole file, checks the OLE2 directory entries, and 
identifies that you have a Word file


If you gave Tika the InputStream + filename on a Metadata object, it would 
specialise the OLE2 type to Word based on the extension

If you gave Tika a TikaInputStream, it would detect that a File was needed 
for a fully precise answer, spool the Stream to a File, then use that to 
detect (and later parse if you need)

Nick

Reply via email to