Hi

There might be a bug with the AutoDetectParser, which fails to recognise some 
plain-text files as plain text.

In the attachment are three testing files, as you can see they are all plain 
text.

The following code is used for my testing:

————————
AutoDetectParser parser = new AutoDetectParser();
for (File f : new 
File("/Users/-/work/jate/experiment/bugged_corpus").listFiles()) {
    InputStream in = new BufferedInputStream(new FileInputStream(f.toString()));
    BodyContentHandler handler = new BodyContentHandler(-1);
    Metadata metadata = new Metadata();
    try {

        parser.parse(in, handler, metadata);
        String content = handler.toString();
        System.out.println(metadata); //line A
    }catch (Exception e){
        e.printStackTrace();
    }
}
————————
for the three testing files, I would expect line A to print “plain text”, in 
fact, it is printing the following:
X-Parsed-By=org.apache.tika.parser.EmptyParser 
Content-Type=image/x-portable-bitmap 
X-Parsed-By=org.apache.tika.parser.DefaultParser 
X-Parsed-By=org.apache.tika.parser.mp3.Mp3Parser xmpDM:audioCompressor=MP3 
Content-Type=audio/mpeg 
X-Parsed-By=org.apache.tika.parser.EmptyParser 
Content-Type=image/x-portable-bitmap 

And as a result, variable “content” is always empty.

Any suggestions on this please?

Thanks

Reply via email to