Hi, I am wondering whether the concept of 'purifying' [0][1] is something which may be of interest to the detect API in Tika.
Basically we have an interface which defines some logic which should be performed prior to MIMEType detection taking place. The only implementation we have right now is a WhiteSpacePurifier [1] which scans inputstream (file) headers and removes whitespace prior to detection taking place. The reason I am asking this question is that AFAIK Tika will use by default the MagicDetector to read in some minimal Bytes from the input source before attempting to do MimeType detection. This is followed by a filename (slightly naive) method of detection hence the two parameters passed to Tika.detect(InputStream is, Metadata md)/(String st, Metadata md) methods. Would 'purification' be something which could improve the first (MagicDetection) method of taking minimal Bytes as a clue? Or has this been implemented and am I missing it? Sorry if the latter is the case! Ta Lewis [0] http://any23.apache.org/apidocs/index.html?org/apache/any23/mime/purifier/Purifier.html [1] http://any23.apache.org/apidocs/index.html?org/apache/any23/mime/purifier/WhiteSpacesPurifier.html -- *Lewis*