Many thanks
As for bugzilla, I was unable to create a new bug, as it is saying “first you
must pick a product…” and there is no tika in the list.
> On 14 Oct 2015, at 10:40, Konstantin Gribov <gros...@gmail.com> wrote:
>
> This is a result of false positive mime-type detection. In first case file
> starts with "ID3" which is usually present in mp3 (audio/mpeg) files. Other
> two files starts with P1 or P4 which are present in start of
> image/x-portable-bitmap files.
>
> You can either use text parser directrly or pass filename via metadata using
> metadata.set(RESOURCE_NAME_KEY, filename).
>
> ср, 14 окт. 2015 г. в 12:08, Ziqi Zhang <ziqi.zh...@sheffield.ac.uk
> <mailto:ziqi.zh...@sheffield.ac.uk>>:
> My apologies, here are the testing files attached.
>
>
>
>> Begin forwarded message:
>>
>> From: Ziqi Zhang <ziqi.zh...@sheffield.ac.uk
>> <mailto:ziqi.zh...@sheffield.ac.uk>>
>> Date: 14 October 2015 at 10:06:33 BST
>> To: user@tika.apache.org <mailto:user@tika.apache.org>
>> Subject: AutoDetectParser bug?
>
>>
>> Hi
>>
>> There might be a bug with the AutoDetectParser, which fails to recognise
>> some plain-text files as plain text.
>>
>> In the attachment are three testing files, as you can see they are all plain
>> text.
>>
>> The following code is used for my testing:
>>
>>
>> AutoDetectParser parser = new AutoDetectParser();
>> for (File f : new
>> File("/Users/-/work/jate/experiment/bugged_corpus").listFiles()) {
>> InputStream in = new BufferedInputStream(new
>> FileInputStream(f.toString()));
>> BodyContentHandler handler = new BodyContentHandler(-1);
>> Metadata metadata = new Metadata();
>> try {
>>
>> parser.parse(in, handler, metadata);
>> String content = handler.toString();
>> System.out.println(metadata); //line A
>> }catch (Exception e){
>> e.printStackTrace();
>> }
>> }
>>
>> for the three testing files, I would expect line A to print “plain text”, in
>> fact, it is printing the following:
>> X-Parsed-By=org.apache.tika.parser.EmptyParser
>> Content-Type=image/x-portable-bitmap
>> X-Parsed-By=org.apache.tika.parser.DefaultParser
>> X-Parsed-By=org.apache.tika.parser.mp3.Mp3Parser xmpDM:audioCompressor=MP3
>> Content-Type=audio/mpeg
>> X-Parsed-By=org.apache.tika.parser.EmptyParser
>> Content-Type=image/x-portable-bitmap
>>
>> And as a result, variable “content” is always empty.
>>
>> Any suggestions on this please?
>>
>> Thanks
>>
>
> --
> Best regards,
> Konstantin Gribov