Support for marks in InputStream passed to Tika.detect

Lewis John Mcgibbney Fri, 13 Dec 2013 13:09:06 -0800

Hi,
I am wondering whether the concept of 'purifying' [0][1] is something which
may be of interest to the detect API in Tika.


Basically we have an interface which defines some logic which should be
performed prior to MIMEType detection taking place. The only implementation
we have right now is a WhiteSpacePurifier [1] which scans inputstream
(file) headers and removes whitespace prior to detection taking place.

The reason I am asking this question is that AFAIK Tika will use by default
the MagicDetector to read in some minimal Bytes from the input source
before attempting to do MimeType detection. This is followed by a filename
(slightly naive) method of detection hence the two parameters passed to
Tika.detect(InputStream is, Metadata md)/(String st, Metadata md) methods.

Would 'purification' be something which could improve the first
(MagicDetection) method of taking minimal Bytes as a clue? Or has this been
implemented and am I missing it?

Sorry if the latter is the case!
Ta
Lewis

[0]
http://any23.apache.org/apidocs/index.html?org/apache/any23/mime/purifier/Purifier.html
[1]
http://any23.apache.org/apidocs/index.html?org/apache/any23/mime/purifier/WhiteSpacesPurifier.html

-- 
*Lewis*

Support for marks in InputStream passed to Tika.detect

Reply via email to