Many thanks As for bugzilla, I was unable to create a new bug, as it is saying “first you must pick a product…” and there is no tika in the list.
> On 14 Oct 2015, at 10:40, Konstantin Gribov <gros...@gmail.com> wrote: > > This is a result of false positive mime-type detection. In first case file > starts with "ID3" which is usually present in mp3 (audio/mpeg) files. Other > two files starts with P1 or P4 which are present in start of > image/x-portable-bitmap files. > > You can either use text parser directrly or pass filename via metadata using > metadata.set(RESOURCE_NAME_KEY, filename). > > ср, 14 окт. 2015 г. в 12:08, Ziqi Zhang <ziqi.zh...@sheffield.ac.uk > <mailto:ziqi.zh...@sheffield.ac.uk>>: > My apologies, here are the testing files attached. > > > >> Begin forwarded message: >> >> From: Ziqi Zhang <ziqi.zh...@sheffield.ac.uk >> <mailto:ziqi.zh...@sheffield.ac.uk>> >> Date: 14 October 2015 at 10:06:33 BST >> To: user@tika.apache.org <mailto:user@tika.apache.org> >> Subject: AutoDetectParser bug? > >> >> Hi >> >> There might be a bug with the AutoDetectParser, which fails to recognise >> some plain-text files as plain text. >> >> In the attachment are three testing files, as you can see they are all plain >> text. >> >> The following code is used for my testing: >> >> ———————— >> AutoDetectParser parser = new AutoDetectParser(); >> for (File f : new >> File("/Users/-/work/jate/experiment/bugged_corpus").listFiles()) { >> InputStream in = new BufferedInputStream(new >> FileInputStream(f.toString())); >> BodyContentHandler handler = new BodyContentHandler(-1); >> Metadata metadata = new Metadata(); >> try { >> >> parser.parse(in, handler, metadata); >> String content = handler.toString(); >> System.out.println(metadata); //line A >> }catch (Exception e){ >> e.printStackTrace(); >> } >> } >> ———————— >> for the three testing files, I would expect line A to print “plain text”, in >> fact, it is printing the following: >> X-Parsed-By=org.apache.tika.parser.EmptyParser >> Content-Type=image/x-portable-bitmap >> X-Parsed-By=org.apache.tika.parser.DefaultParser >> X-Parsed-By=org.apache.tika.parser.mp3.Mp3Parser xmpDM:audioCompressor=MP3 >> Content-Type=audio/mpeg >> X-Parsed-By=org.apache.tika.parser.EmptyParser >> Content-Type=image/x-portable-bitmap >> >> And as a result, variable “content” is always empty. >> >> Any suggestions on this please? >> >> Thanks >> > > -- > Best regards, > Konstantin Gribov