Hi, A Charset detector sounds like something generally useful that belongs in Commons IO.
Path path = Path.get(...); Charset cs = org.apache.commons.io.CharsetDetector.detect(path); org.apache.commons.csv.CSVParser.parse(path, charset, csvFormat); Thoughts? Gary On Mon, Feb 25, 2019 at 10:23 AM Tim Allison <[email protected]> wrote: > Commons-CSV team, > > We recently integrated Commons-CSV into Apache Tika. For now, we’re > relying strictly on the filename for csv detection, and we’re relying > on our AutodetectReader to identify the charset. It would be really > useful for us to be able to detect: > > 1) A csv/tsv file vs a regular .txt file by content heuristics > 2) The parameters: delimiter, escape and quote characters > > We realize that no detection will be perfect, but we have two questions: > > 1) Do you have any pointers for this kind of thing? > 2) If we develop it, would you want to put it in commons-csv or should > we leave it in Tika? I'm not sure, yet, if there'd be a clean/useful > way to integrate this without using a charset detector...but we can > hold off on that for now. > > Thank you for all of your fantastic work! > > Cheers, > > Tim > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
