Re: AutoDetectParser bug?

2015-10-14 Thread Ziqi Zhang
Thanks
I have created an issue.

metadata.set(RESOURCE_NAME_KEY, filename) also did not work. For now I am 
telling the parser specifically it is plain text files. But it would be really 
nice to have this addressed because I would like to use the auto detect ability 
in my app.

regards




> On 14 Oct 2015, at 11:11, Nick Burch <apa...@gagravarr.org> wrote:
> 
> On Wed, 14 Oct 2015, Ziqi Zhang wrote:
>> As for bugzilla, I was unable to create a new bug, as it is saying “first 
>> you must pick a product…” and there is no tika in the list.
> 
> Sorry, wrong project - POI uses Bugzilla, Tika uses JIRA, I wasn't paying 
> enough attention!
> 
> The starting point for reporting the bug is:
>   https://issues.apache.org/jira/browse/TIKA
> 
> Nick



AutoDetectParser bug?

2015-10-14 Thread Ziqi Zhang
Hi

There might be a bug with the AutoDetectParser, which fails to recognise some 
plain-text files as plain text.

In the attachment are three testing files, as you can see they are all plain 
text.

The following code is used for my testing:


AutoDetectParser parser = new AutoDetectParser();
for (File f : new 
File("/Users/-/work/jate/experiment/bugged_corpus").listFiles()) {
InputStream in = new BufferedInputStream(new FileInputStream(f.toString()));
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
try {

parser.parse(in, handler, metadata);
String content = handler.toString();
System.out.println(metadata); //line A
}catch (Exception e){
e.printStackTrace();
}
}

for the three testing files, I would expect line A to print “plain text”, in 
fact, it is printing the following:
X-Parsed-By=org.apache.tika.parser.EmptyParser 
Content-Type=image/x-portable-bitmap 
X-Parsed-By=org.apache.tika.parser.DefaultParser 
X-Parsed-By=org.apache.tika.parser.mp3.Mp3Parser xmpDM:audioCompressor=MP3 
Content-Type=audio/mpeg 
X-Parsed-By=org.apache.tika.parser.EmptyParser 
Content-Type=image/x-portable-bitmap 

And as a result, variable “content” is always empty.

Any suggestions on this please?

Thanks



Re: AutoDetectParser bug?

2015-10-14 Thread Ziqi Zhang
Many thanks

As for bugzilla, I was unable to create a new bug, as it is saying “first you 
must pick a product…” and there is no tika in the list.



> On 14 Oct 2015, at 10:40, Konstantin Gribov <gros...@gmail.com> wrote:
> 
> This is a result of false positive mime-type detection. In first case file 
> starts with "ID3" which is usually present in mp3 (audio/mpeg) files. Other 
> two files starts with P1 or P4 which are present in start of 
> image/x-portable-bitmap files.
> 
> You can either use text parser directrly or pass filename via metadata using 
> metadata.set(RESOURCE_NAME_KEY, filename).
> 
> ср, 14 окт. 2015 г. в 12:08, Ziqi Zhang <ziqi.zh...@sheffield.ac.uk 
> <mailto:ziqi.zh...@sheffield.ac.uk>>:
> My apologies, here are the testing files attached.
> 
> 
> 
>> Begin forwarded message:
>> 
>> From: Ziqi Zhang <ziqi.zh...@sheffield.ac.uk 
>> <mailto:ziqi.zh...@sheffield.ac.uk>>
>> Date: 14 October 2015 at 10:06:33 BST
>> To: user@tika.apache.org <mailto:user@tika.apache.org>
>> Subject: AutoDetectParser bug?
> 
>> 
>> Hi
>> 
>> There might be a bug with the AutoDetectParser, which fails to recognise 
>> some plain-text files as plain text.
>> 
>> In the attachment are three testing files, as you can see they are all plain 
>> text.
>> 
>> The following code is used for my testing:
>> 
>> 
>> AutoDetectParser parser = new AutoDetectParser();
>> for (File f : new 
>> File("/Users/-/work/jate/experiment/bugged_corpus").listFiles()) {
>> InputStream in = new BufferedInputStream(new 
>> FileInputStream(f.toString()));
>> BodyContentHandler handler = new BodyContentHandler(-1);
>> Metadata metadata = new Metadata();
>> try {
>> 
>> parser.parse(in, handler, metadata);
>> String content = handler.toString();
>> System.out.println(metadata); //line A
>> }catch (Exception e){
>> e.printStackTrace();
>> }
>> }
>> 
>> for the three testing files, I would expect line A to print “plain text”, in 
>> fact, it is printing the following:
>> X-Parsed-By=org.apache.tika.parser.EmptyParser 
>> Content-Type=image/x-portable-bitmap 
>> X-Parsed-By=org.apache.tika.parser.DefaultParser 
>> X-Parsed-By=org.apache.tika.parser.mp3.Mp3Parser xmpDM:audioCompressor=MP3 
>> Content-Type=audio/mpeg 
>> X-Parsed-By=org.apache.tika.parser.EmptyParser 
>> Content-Type=image/x-portable-bitmap 
>> 
>> And as a result, variable “content” is always empty.
>> 
>> Any suggestions on this please?
>> 
>> Thanks
>> 
> 
> -- 
> Best regards,
> Konstantin Gribov