All - After doing more research, I found that according to IANA (http://www.iana.org/assignments/media-types/text/), there is already a text/csv MIME type in addition to text/plain.
That being the case, how about if we say that all .csv files are of type text/csv in tika-mimetypes.xml, and include text/csv as one the MIME types handled by our text parser in tika-config.xml? And add the extensions Jukka found to the text/plain MIME type? Jukka, the reason I thought there might be a problem using the .csv extension to determine that a document was text/plain was because I was concerned that there may be files that use the .csv extension that are, in fact, binary. I didn't know if the .csv is understood universally enough to be text for us to make that assumption. Given that it has its own MIME type, though, I feel more comfortable about it. Question, though -- Does the text/plain MIME type apply to all unformatted text, regardless of language and character set? If so, then when we say that an extension indicates plain text, do we prevent it from ever being parsed as binary? There may be no reliable way to distinguish between text and binary from the header, since non-Latin characters may be represented as byte values indistinguishable from pure binary. In other words, how does MIME type detection from header bytes handle non-Latin characters in unspecified encodings? There's a lot of fuzziness here. - Keith -- View this message in context: http://www.nabble.com/Add-CSV-as-a-plain-text-extension--tf4649726.html#a13298777 Sent from the Apache Tika - Development mailing list archive at Nabble.com.
