On 26/09/2009, at 12:00 AM, Jay A. Kreibich wrote: > On Fri, Sep 25, 2009 at 10:24:15AM +1000, BareFeet scratched on the > wall: > >> In reality, in the thousands of CSV files I've dealt with >> over the years, they all follow the same standard: >> >> 1. Rows are delimited by a new line (return and/or line feed). >> 2. Columns are delimited by a comma. >> 3. "Quoted text" is treated as one value item, including any commas >> or >> new lines within it. >> 4. A double quote "" is used to put a quote within quotes. >> >> That's it. > > This is more or less the standard put forth by RFC 4180. And if > this is all you've encountered
Yes. > you're not using very many different applications I guess "many" is subjective but I would say I have used many, including Tandem main frame outputs, SQL server outputs, various product list suppliers, Excel Exports, electron microscope date, etc. I don't doubt that you've dealt with some strange CSV files, but I think the number of files defined as "strange" reduces if you better define the CSV format. Please understand that I am not arguing that CSV should be used. But the fact is that CSV is a commonly used data export/import format (I'd say around 50%) and therefore supporting it is a significant requirement. > or you're primarily dealing with numbers and simple strings that > don't contain quotes or commas. CSV works very very well if you > never get into the question of escapes, but details count. As I mentioned previously in points 3 and 4 above, I do deal with strings that contain quotes and commas. Dealing with them is well defined in practical use (or my experience of it). > Reading the RFC only proves my point. SQLite v3 is older than that > doc, and it pretty much admits the cat was out of the bag a long > time ago. There are a ton of optional and might/may/could sections > that event I accept that some earlier CSV implementations varied and that the RFC may have been in catch up mode. But I summarise that as a delay in setting the standard and thankfully not so much a problem with everyday use since most/all major current implementations use the specifications I outlined above (4 point summary). > the format they define has a lot of holes in it I don't think there are holes, within the scope of the data with which it deals. For instance, CSV doesn't deal with blob data, pictures, related tables, column type (even the distinction between a numeric and text value) and therefore has no definition for it. > (i.e. headers, or no headers?). Again, as I understand it, headers are outside of the CSV scope. CSV only defines an array of cells of string data, not column definitions. If headers are included within CSV they are just another row in the array. CSV doesn't know what they are but an importer should import them into the array. After importing the raw data into an array, the program can then interpret the data in ways that makes sense to it. If the program understands headers then it should look for them in the array. If it understands column types, it should apply them, etc. Since we're coming from the SQLite perspective, data means more to us than what CSV defines. Concepts such as column type, cell type are everyday definitions for SQLite but are meaningless to CSV. In most cases, we pre-define a table within SQLite specifically for receiving data from a particular CSV data file. CSV defines just the array of data. >> Everything I've seen uses this. > > According to the RFC Excel doesn't use double-quotes for anything. > You might not care about Excel, but I'm willing to bet it is one of > the most-- if not the most-- common exporters of CSV. The question > of getting data from Excel into SQLite shows up on the list every > now and then. Yes, Excel is a major consideration and I have to deal with data from Excel exporters often. In my experience, though, the specification I outlined above deals with Excel exports fine. I can't comment on the RFC which may or may not match common use. >> Some don't need delimiters in values, so they don't need quotes, >> but the encompassing specification works for all cases. > > No, it doesn't. Working on a large website that provided CSV > exports for a number of data sources, I've seen plenty of examples > that don't work. Finding a common format that could be exported > into a handful of common desktop apps was so pointless we seriously > considered getting rid of CSV all together, because we got tired of > our users telling us how simple CSV was, and why couldn't we just do > this one thing differently so it would work on *their* application. This surprises me. Did you cater for escapes as mentioned in points 3 and 4 above? You mentioned: > CSV works very very well if you never get into the question of escapes So I wonder if you did. Also, perhaps the users were asking for specific data within the CSV, such as header names or column types. This doesn't mean that the CSV format is ambiguous, just that they want specific information included in the array. >> It's not that big a deal for SQLite to support it, so it should. > > If it is so simple, and you know where the code is... There are numerous code subroutines around to deal with this, in various languages. I Google searched the other day for one to incorporate into my Objective-C program, pasted it in and was instantly able to import all of my usual CSV data sources into an array. I then loop through the array to interpret the cells in a way that makes sense to my program. CSV just defines an array of data. My program can then interpret it however makes sense within its context. > after all, it is such a "simple" format-- and find themselves in a > mess of code soon enough. Seriously, give it a try. The code is fairly straight forward. I've done it a few times. > Carlos's Python script (nice!) is a great example. His comment "I > am so grateful I did not have to write a parser for CSV" is dead on. I would also rather use existing code than roll my own. > And, as he points out, the reason the Python module is so good is > that it is adaptive There's nothing wrong with adaptive. But the CSV-to-array importer only has to deal with the four spec points above. How that array is used is a second step after the file reading. > and really reads five or six different variants of CSV (something a > reader can do but a writer cannot). What are the variants? You mentioned with or without escapes, but those are both within the spec. You mentioned headers, but they are within the spec. Are you talking about non-standard escapes like \" instead of "" for quotes within quotes, because tat would be incorrect CSV. Or are you talking about semi-colons instead of commas? > that advanced support of this kind of thing is really outside of the > scope of SQLite3. After all, the .import command is part of the > shell, not part of the core library. Given the number of times that CSV import has come up on this list, it is obviously important. Since .import already exists in SQLite, it might as well embrace the spec properly, including delimiters in quotes. The code is trivial by comparison the SQLite in general. > CSV is a great quick and dirty format to move data. But it isn't > "simple" and it isn't nearly as universal as many assume. As above, I think it is pretty simple. The complexity arrives when people expect CSV to provide more than a simple array of strings. > It works great if you're just moving simple numbers and strings Exactly. > that don't include commas, but becomes a mess when you get into > exceptions. No, it caters for commas. If your CSV implementation doesn't, then it isn't doing full CSV. Tom BareFeet -- Comparison of SQLite GUI applications: http://www.tandb.com.au/sqlite/compare/ _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users