Hi Folks,

 I would object to such a change, committed to trunk, and as part of the
default configuration for mime types in Tika. This is for the simple reason
that by its definition, CSV is a "comma separated value file", not a "plain
text" file. We need to maintain standardization and consistency of mime
types within Tika, or else our default mime database will be of little
general value, and will be specific to the domains in which we as developers
are using the framework. Defining it as a glob pattern for text/plain may be
a good band aid, but IMHO it's not the solution here in general.
 
 The most general solution would be to create a "text/csv" mime type (or
alternatively " text/comma-separated-values"), and have it subclass
text/plain, according to the RFC shown here:

 http://www.shaftek.org/publications/drafts/mime-csv/draft-shafranovich-mime
-csv-00.html


 We could do this by adding the "text/csv" mime type, and then adding the
following line underneath it:

         <sub-class-of type="text/plain"/>

 Then, in the mime definition as well, we could add the .csv glob pattern.

Cheers,
  Chris



On 10/19/07 8:10 AM, "Keith R. Bennett" <[EMAIL PROTECTED]> wrote:

> 
> Guys -
> 
> These are great ideas, and I look forward to having those features.
> 
> As a short term marginal improvement though, until we can do the full
> solution, would it make sense to consider .csv plain text?  I'm assuming
> that's just a matter of adding to tika-mimetypes.xml a line:
> 
> <glob pattern="*.csv" />
> 
> to:
> 
> <mime-type type="text/plain">
> <magic priority="50">
> <match value="This is TeX," type="string" offset="0" />
> <match value="This is METAFONT," type="string" offset="0" />
> </magic>
> <glob pattern="*.txt" />
> <glob pattern="*.asc" />
> </mime-type>
> 
> Would doing so cause other problems though?  Should we consider non-.txt and
> non-.asc files binary unless byte header detection reports they're plain
> text?  Perhaps we should wait until that's working instead of adding the
> .csv glob pattern?
> 
> - Keith
> 

______________________________________________
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply via email to