Hi Folks, Thinking this through more, it probably makes a lot of sense for the Default MIME TYPE in Tika to be application/octet-stream. Essentially what this is saying is: "this is content for which I could not discern a mime type". Well, application/octet-stream indicates that the content coming is a sequence of bits. Well, following suit, essentially it is the super mime type of all mime types in reality.
I like the idea about implementing a UNIX strings style parser: there was discussion on the Nutch list a year or so ago regarding this same issue. If there isn't exact consensus on this issue; however, we could always make the default mime type a settable parameter in the tika-config.xml file mimeTypeRepository tag. That way, we could ship Tika with a default mime type of application/octet-stream, and then if that doesn't work for users, they simply update their attribute in their xml file (and perhaps additionally turn on magic detection) and that would probably solve the issue. Thoughts? If you all agree, I'll create an issue about this in JIRA. Come to think of it: I'll probably do that anyways. Cheers, Chris On 10/12/07 6:12 PM, "Keith R. Bennett" <[EMAIL PROTECTED]> wrote: > > By the way, strings detects sequences of ASCII characters, so it would not > work at all in many locales (unless ICU or someone else has figured out how > to do this). > > As long as this limitation is documented, however, I think it would still be > extremely useful, and trivial to implement. > > - Keith ______________________________________________ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _________________________________________________ Jet Propulsion Laboratory Pasadena, CA Office: 171-266B Mailstop: 171-246 _______________________________________________________ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
