Many thanks

As for bugzilla, I was unable to create a new bug, as it is saying “first you 
must pick a product…” and there is no tika in the list.



> On 14 Oct 2015, at 10:40, Konstantin Gribov <gros...@gmail.com> wrote:
> 
> This is a result of false positive mime-type detection. In first case file 
> starts with "ID3" which is usually present in mp3 (audio/mpeg) files. Other 
> two files starts with P1 or P4 which are present in start of 
> image/x-portable-bitmap files.
> 
> You can either use text parser directrly or pass filename via metadata using 
> metadata.set(RESOURCE_NAME_KEY, filename).
> 
> ср, 14 окт. 2015 г. в 12:08, Ziqi Zhang <ziqi.zh...@sheffield.ac.uk 
> <mailto:ziqi.zh...@sheffield.ac.uk>>:
> My apologies, here are the testing files attached.
> 
> 
> 
>> Begin forwarded message:
>> 
>> From: Ziqi Zhang <ziqi.zh...@sheffield.ac.uk 
>> <mailto:ziqi.zh...@sheffield.ac.uk>>
>> Date: 14 October 2015 at 10:06:33 BST
>> To: user@tika.apache.org <mailto:user@tika.apache.org>
>> Subject: AutoDetectParser bug?
> 
>> 
>> Hi
>> 
>> There might be a bug with the AutoDetectParser, which fails to recognise 
>> some plain-text files as plain text.
>> 
>> In the attachment are three testing files, as you can see they are all plain 
>> text.
>> 
>> The following code is used for my testing:
>> 
>> ————————
>> AutoDetectParser parser = new AutoDetectParser();
>> for (File f : new 
>> File("/Users/-/work/jate/experiment/bugged_corpus").listFiles()) {
>>     InputStream in = new BufferedInputStream(new 
>> FileInputStream(f.toString()));
>>     BodyContentHandler handler = new BodyContentHandler(-1);
>>     Metadata metadata = new Metadata();
>>     try {
>> 
>>         parser.parse(in, handler, metadata);
>>         String content = handler.toString();
>>         System.out.println(metadata); //line A
>>     }catch (Exception e){
>>         e.printStackTrace();
>>     }
>> }
>> ————————
>> for the three testing files, I would expect line A to print “plain text”, in 
>> fact, it is printing the following:
>> X-Parsed-By=org.apache.tika.parser.EmptyParser 
>> Content-Type=image/x-portable-bitmap 
>> X-Parsed-By=org.apache.tika.parser.DefaultParser 
>> X-Parsed-By=org.apache.tika.parser.mp3.Mp3Parser xmpDM:audioCompressor=MP3 
>> Content-Type=audio/mpeg 
>> X-Parsed-By=org.apache.tika.parser.EmptyParser 
>> Content-Type=image/x-portable-bitmap 
>> 
>> And as a result, variable “content” is always empty.
>> 
>> Any suggestions on this please?
>> 
>> Thanks
>> 
> 
> -- 
> Best regards,
> Konstantin Gribov

Reply via email to