This is a result of false positive mime-type detection. In first case file
starts with "ID3" which is usually present in mp3 (audio/mpeg) files. Other
two files starts with P1 or P4 which are present in start of
image/x-portable-bitmap files.

You can either use text parser directrly or pass filename via metadata
using metadata.set(RESOURCE_NAME_KEY, filename).

ср, 14 окт. 2015 г. в 12:08, Ziqi Zhang <ziqi.zh...@sheffield.ac.uk>:

> My apologies, here are the testing files attached.
>
>
>
> Begin forwarded message:
>
> *From: *Ziqi Zhang <ziqi.zh...@sheffield.ac.uk>
> *Date: *14 October 2015 at 10:06:33 BST
> *To: *user@tika.apache.org
> *Subject: **AutoDetectParser bug?*
>
>
> Hi
>
> There might be a bug with the AutoDetectParser, which fails to recognise
> some plain-text files as plain text.
>
> In the attachment are three testing files, as you can see they are all
> plain text.
>
> The following code is used for my testing:
>
> ————————
>
> AutoDetectParser parser = new AutoDetectParser();
> for (File f : new 
> File("/Users/-/work/jate/experiment/bugged_corpus").listFiles()) {
>     InputStream in = new BufferedInputStream(new 
> FileInputStream(f.toString()));
>     BodyContentHandler handler = new BodyContentHandler(-1);
>     Metadata metadata = new Metadata();
>     try {
>
>         parser.parse(in, handler, metadata);
>         String content = handler.toString();
>         System.out.println(metadata); //line A
>     }catch (Exception e){
>         e.printStackTrace();
>     }
> }
>
> ————————
>
> for the three testing files, I would expect line A to print “plain text”, in 
> fact, it is printing the following:
>
> X-Parsed-By=org.apache.tika.parser.EmptyParser 
> Content-Type=image/x-portable-bitmap
>
> X-Parsed-By=org.apache.tika.parser.DefaultParser 
> X-Parsed-By=org.apache.tika.parser.mp3.Mp3Parser xmpDM:audioCompressor=MP3 
> Content-Type=audio/mpeg
>
> X-Parsed-By=org.apache.tika.parser.EmptyParser 
> Content-Type=image/x-portable-bitmap
>
>
> And as a result, variable “content” is always empty.
>
>
> Any suggestions on this please?
>
>
> Thanks
>
>
> --
Best regards,
Konstantin Gribov

Reply via email to