On 2014/07/18 16:53, Jonathan Leslie wrote:
I have a similar situation about to happen.  I also have the issue where one of the data fields is a text section that 
the user fills in and of course, he is free to put commas in this field.   I believe that this will result in 
higgly-piggly (that's a technical term) when I do the import.   What I do is I never use a CSV file as a "Comma 
separated Values" file but rather as a "Character separated Values"  file, but rather use a Character, º 
(<alt>0186), a legal character but not one on the keyboard, as the separator character.  You can set Excel to do 
this automatically, and it makes life a whole lot easier.   I believe with sqlite the >.separator "º"< 
directive will allow the import correctly, and keep the commas in place in the data fields.

Hi Jonathan, while this is a nice solution, it is not a perfect one. What if a user does enter that character? (Even if not directly linked to a keyboard key, it can happen). The very correct way to do it is to simply use quoting, which preserves the text specifically, whatever characters were used. A correct CSV format can hold any data (though most readers would have difficulty reading Chr(0) values, but that is an implementation issue). More importantly, Excel, Open Office Calc etc, all knows exactly how to export this correctly and without the higgly-pigglies. The only place I have seen confusion is where you export it in a country that uses semi-colon as a separator (yes it happens, mostly in the Netherlands) and then try to read in another country where the locality values are different (comma mostly).

Mostly the problems experienced by people is that they make some home-brew CSV importer that does not realise how to correctly read output from a standards-based exporter such as Excel, and then try to change things like separation or quoting methods to "fix" it after the fact.

The method described by the RFC for CSV is quite genius ( I think RFC2048, but not sure now and not important, but google will know if you need it) so let me explain a simple way of making csv files.

The Important bits to remember is:

1 - Header, first line of the CSV must be the header, must contain field names, unless you do not have those or they are implicit, such as in a log-type listing or import to existing structure- but any standard importer will expect row 1 to contain field-identifiers if previously unknown.

2 - Every row must be separated by whichever Line-separation character(s) used by the operating system - this sometimes causes problems. Carriage-Return-Linefeed (aka CR/LF) pairs (0x0D+0x0A) are mostly used in Windows, *nix environments prefer to use just the Linefeed (LF) character (0x0A). The Windows-way allows other linefeeds or carriage returns to be inside the text and not affect line-termination, and the *nix method is more lightweight in data terms, so both have advantages. Important to note that a line-separator may appear inside a set of quotes which makes it simply part of the value, and not indication of line termination - this is the biggest problem with homegrown importers I've seen, they all try read the file line-by-line which breaks badly when a field contains such a CR/LF.

3 - if a field starts with a Quote (after any leading/trailing white-space was stripped) any and all characters up to the next quote is part of the field value, even if they are separators (; or , or CR or LF) or white-space characters (such as SPACE/TAB/VTAB/etc. - these are stripped when leading/trailing field values, unless in quotes) and when we come to the next Quote (typically " or 0x22, but may differ between implementations), we check whether it's immediately followed by another quote, which means simply add the actual quote char and continue, or not, which means the end of the field. (Note: Any text after the ending quote in a quoted field and before the next field separator ( ; or , ) is ignored, one can use this quirk to add some comments or directives even).

4 - If you are writing fields, if your field contains any Quote characters, any separators or leading/trailing white-space characters that must not be discarded, it must be quoted. The best method for quoting a string to be valid CSV field data is like this pseudo code:

QuoteChar = Char(0x22); // Or whatever is used for quoting in your system
for i = last index of value down to first index:       // Go downward because 
it allows inserting without messing up final index
  if  value[i] = QuoteChar then insert QuoteChar into value at position i;      
  // Duplicate any existing quotes
OutputFieldValue = QuoteChar + value + QuoteChar;       // Surround the result 
with quotes


This way whatever is used in your fields do not matter, it will be understood by any standard-validated CSV reader. Note that quotes may be added regardless of content, but it is preferred to not add quotes where not needed as it can cause very significant weight-gains to a large file, especially one with many fields per row, and moreso if lots of fields are short or empty.

The sqlite3.exe (or its Linux counterpart) is tested to understand the correct formatted CSV as used in most places, not sure if it recognizes the OS's preference for say ; vs. , separators, but it's an easy test. If your data gets mangled, it is probably due to the input being not well-processed, and I hope the above will help to fix that.

Cheers,
Ryan


_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to