Raised https://issues.apache.org/jira/browse/TIKA-391 and provided a Tika 0.6 based fix. There might be more involved a fully fix as the issue can apply to any method that uses the results from getMimeType.
Simon On 23/03/2010 13:13, "Mattmann, Chris A (388J)" <[email protected]> wrote: > Hi Simon, > > Can you prepare a patch, and post it to JIRA? I'll happily take a look. > > Thanks, > Chris > > > On 3/23/10 3:43 AM, "Simon Tyler" <[email protected]> wrote: > > > > I have had a further look at the nature of the failure to detect the type of > the particular file and still feel it is a bug. > > This is an excel (.xls) spreadsheet and I give the detector the correct > filename and correct content content type for it. The detector still fails > to identify it correctly sometimes. > > I had a look at the code and the reason is now clear to me and is easily > fixed. > > The getMimeType method searches for a magic match and stops at the first > hit. The search is ordered (based on priority, size and clause). This > particular file matches two detectors (word and excel) which compare > identically - this means the order of them in the SortedSet is undefined, > this is the cause of the problem. > > A fix is for getMimeType to return the complete set of matches rather than a > single match and then to use the filename and content-type hints on each > match returning the first that matches either. I have modified the code to > do this and it solves the problem. The hint matching could be improved > further if necessary so that it picks the best match from the set based on > both hints rather than just stopping at the first. > > Simon > > > On 18/03/2010 19:16, "Alex Ott" <[email protected]> wrote: > >> Re >> >> Ken Krugler at "Thu, 18 Mar 2010 12:07:14 -0700" wrote: >> KK> Thanks, Alex - great input. >> >> KK> We'd run into similar problems at Krugle, with determining the correct >> mime-type for >> KK> source code. Sometimes you wind up needing to parse the code to make >> the >> correct choice. >> >> KK> We had extended the Nutch mime-type detector to support both regex and >> post-processing to >> KK> handle this disambiguation. >> >> KK> But that was hard-coded for a handful of known edge cases. >> >> KK> One possible way for this to work with the current XML-based mime-type >> definitions is to >> KK> have a "here's the name of the class you'll have to instantiate and run >> to make the final >> KK> call" >> >> Yes - I have something like in my own media type detector (for data leak >> prevention) - when signature (either CFBF or Zip) is found, then >> corresponding code is called, that return constant, that correspond to some >> type (I need to implement logic inside my own code, because sometimes rules >> are to complex to express them in simplier rules). At the end I have >> something like: >> >> if CFBF Signature then get type from CFBF and if type == NNN then mimetype = >> word/excel/... >> >> But i have special lisp-like language to describe complex checks... >> >> KK> -- Ken >> >> KK> On Mar 18, 2010, at 11:21am, Alex Ott wrote: >> >>>> >>>> I'm not sure, that this is actual for Tika, but I looked into its mime >>>> database and see problem in definitions - both types uses common OLE (MS >>>> CFBF - Microsoft Compound File Binary Format) signature, that also used by >>>> dozens of file formats. To perform correct mime type detection of CFBF >>>> files, you need to analyze it (with POI?) and detect which objects are >>>> located at top-directory (directly under Root Directory entry) of the OLE >>>> file. For word this is object WordDocument, while for Excel this is >>>> Workbook or Book. Simple search for corresponding names will not help, >>>> because all these objects could be embedded into other documents via OLE. >>>> >>>> Other details you can find in official Microsoft Documentation >>>> >>>> Simon Tyler at "Thu, 18 Mar 2010 18:12:16 +0000" wrote: >>>> ST> Hi, >>>> >>>> ST> I haven't seen any responses to this. Does anyone know why I should be >>>> ST> seeing such unpredictable behaviour? >>>> >>>> ST> Simon >>>> >>>> ST> On 15/03/2010 09:27, "Simon Tyler" <[email protected]> wrote: >>>> >>>>>> >>>>>> Hi, >>>>>> >>>>>> I am doing some testing of Tika 0.6 and noticed some odd results for the >>>>>> testEXCEL.xls file included in the test suite. >>>>>> >>>>>> 100 calls to the following code: >>>>>> >>>>>> is = new BufferedInputStream(new FileInputStream(filename)); >>>>>> >>>>>> Metadata metadata = new Metadata(); >>>>>> metadata.set(Metadata.RESOURCE_NAME_KEY, filename); >>>>>> >>>>>> String type = tika.detect(is, metadata); >>>>>> >>>>>> Results in different matches as application/msword or >>>>>> application/vnd.ms-excel seemingly at random. >>>>>> >>>>>> Is this expected? Is there a way to mitigate it? >>>>>> >>>>>> Simon >>>>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> With best wishes, Alex Ott, MBA >>>> http://alexott.blogspot.com/ http://alexott.net/ >>>> http://alexott-ru.blogspot.com/ >> >> KK> -------------------------------------------- >> KK> Ken Krugler >> KK> +1 530-210-6378 >> KK> http://bixolabs.com >> KK> e l a s t i c w e b m i n i n g >> >> >> >> >> > > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >
