Hi Folks, I would object to such a change, committed to trunk, and as part of the default configuration for mime types in Tika. This is for the simple reason that by its definition, CSV is a "comma separated value file", not a "plain text" file. We need to maintain standardization and consistency of mime types within Tika, or else our default mime database will be of little general value, and will be specific to the domains in which we as developers are using the framework. Defining it as a glob pattern for text/plain may be a good band aid, but IMHO it's not the solution here in general. The most general solution would be to create a "text/csv" mime type (or alternatively " text/comma-separated-values"), and have it subclass text/plain, according to the RFC shown here:
http://www.shaftek.org/publications/drafts/mime-csv/draft-shafranovich-mime -csv-00.html We could do this by adding the "text/csv" mime type, and then adding the following line underneath it: <sub-class-of type="text/plain"/> Then, in the mime definition as well, we could add the .csv glob pattern. Cheers, Chris On 10/19/07 8:10 AM, "Keith R. Bennett" <[EMAIL PROTECTED]> wrote: > > Guys - > > These are great ideas, and I look forward to having those features. > > As a short term marginal improvement though, until we can do the full > solution, would it make sense to consider .csv plain text? I'm assuming > that's just a matter of adding to tika-mimetypes.xml a line: > > <glob pattern="*.csv" /> > > to: > > <mime-type type="text/plain"> > <magic priority="50"> > <match value="This is TeX," type="string" offset="0" /> > <match value="This is METAFONT," type="string" offset="0" /> > </magic> > <glob pattern="*.txt" /> > <glob pattern="*.asc" /> > </mime-type> > > Would doing so cause other problems though? Should we consider non-.txt and > non-.asc files binary unless byte header detection reports they're plain > text? Perhaps we should wait until that's working instead of adding the > .csv glob pattern? > > - Keith > ______________________________________________ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _________________________________________________ Jet Propulsion Laboratory Pasadena, CA Office: 171-266B Mailstop: 171-246 _______________________________________________________ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
