This is a result of false positive mime-type detection. In first case file starts with "ID3" which is usually present in mp3 (audio/mpeg) files. Other two files starts with P1 or P4 which are present in start of image/x-portable-bitmap files.
You can either use text parser directrly or pass filename via metadata using metadata.set(RESOURCE_NAME_KEY, filename). ср, 14 окт. 2015 г. в 12:08, Ziqi Zhang <ziqi.zh...@sheffield.ac.uk>: > My apologies, here are the testing files attached. > > > > Begin forwarded message: > > *From: *Ziqi Zhang <ziqi.zh...@sheffield.ac.uk> > *Date: *14 October 2015 at 10:06:33 BST > *To: *user@tika.apache.org > *Subject: **AutoDetectParser bug?* > > > Hi > > There might be a bug with the AutoDetectParser, which fails to recognise > some plain-text files as plain text. > > In the attachment are three testing files, as you can see they are all > plain text. > > The following code is used for my testing: > > ———————— > > AutoDetectParser parser = new AutoDetectParser(); > for (File f : new > File("/Users/-/work/jate/experiment/bugged_corpus").listFiles()) { > InputStream in = new BufferedInputStream(new > FileInputStream(f.toString())); > BodyContentHandler handler = new BodyContentHandler(-1); > Metadata metadata = new Metadata(); > try { > > parser.parse(in, handler, metadata); > String content = handler.toString(); > System.out.println(metadata); //line A > }catch (Exception e){ > e.printStackTrace(); > } > } > > ———————— > > for the three testing files, I would expect line A to print “plain text”, in > fact, it is printing the following: > > X-Parsed-By=org.apache.tika.parser.EmptyParser > Content-Type=image/x-portable-bitmap > > X-Parsed-By=org.apache.tika.parser.DefaultParser > X-Parsed-By=org.apache.tika.parser.mp3.Mp3Parser xmpDM:audioCompressor=MP3 > Content-Type=audio/mpeg > > X-Parsed-By=org.apache.tika.parser.EmptyParser > Content-Type=image/x-portable-bitmap > > > And as a result, variable “content” is always empty. > > > Any suggestions on this please? > > > Thanks > > > -- Best regards, Konstantin Gribov