Re: [sqlite] Importing a cvs file

RSmith Fri, 18 Jul 2014 08:50:09 -0700


On 2014/07/18 16:53, Jonathan Leslie wrote:

I have a similar situation about to happen.  I also have the issue where one of the data fields is a text section that 
the user fills in and of course, he is free to put commas in this field.   I believe that this will result in 
higgly-piggly (that's a technical term) when I do the import.   What I do is I never use a CSV file as a "Comma 
separated Values" file but rather as a "Character separated Values"  file, but rather use a Character, º 
(<alt>0186), a legal character but not one on the keyboard, as the separator character.  You can set Excel to do 
this automatically, and it makes life a whole lot easier.   I believe with sqlite the >.separator "º"< 
directive will allow the import correctly, and keep the commas in place in the data fields.

Hi Jonathan, while this is a nice solution, it is not a perfect one. What if a user does enter that character? (Even if not directlylinked to a keyboard key, it can happen). The very correct way to do it is to simply use quoting, which preserves the textspecifically, whatever characters were used. A correct CSV format can hold any data (though most readers would have difficultyreading Chr(0) values, but that is an implementation issue). More importantly, Excel, Open Office Calc etc, all knows exactly how toexport this correctly and without the higgly-pigglies. The only place I have seen confusion is where you export it in a country thatuses semi-colon as a separator (yes it happens, mostly in the Netherlands) and then try to read in another country where thelocality values are different (comma mostly).

Mostly the problems experienced by people is that they make some home-brew CSV importer that does not realise how to correctly readoutput from a standards-based exporter such as Excel, and then try to change things like separation or quoting methods to "fix" itafter the fact.

The method described by the RFC for CSV is quite genius ( I think RFC2048, but not sure now and not important, but google will knowif you need it) so let me explain a simple way of making csv files.


The Important bits to remember is:

1 - Header, first line of the CSV must be the header, must contain field names, unless you do not have those or they are implicit,such as in a log-type listing or import to existing structure- but any standard importer will expect row 1 to containfield-identifiers if previously unknown.

2 - Every row must be separated by whichever Line-separation character(s) used by the operating system - this sometimes causesproblems. Carriage-Return-Linefeed (aka CR/LF) pairs (0x0D+0x0A) are mostly used in Windows, *nix environments prefer to use justthe Linefeed (LF) character (0x0A). The Windows-way allows other linefeeds or carriage returns to be inside the text and not affectline-termination, and the *nix method is more lightweight in data terms, so both have advantages. Important to note that aline-separator may appear inside a set of quotes which makes it simply part of the value, and not indication of line termination -this is the biggest problem with homegrown importers I've seen, they all try read the file line-by-line which breaks badly when afield contains such a CR/LF.

3 - if a field starts with a Quote (after any leading/trailing white-space was stripped) any and all characters up to the next quoteis part of the field value, even if they are separators (; or , or CR or LF) or white-space characters (such as SPACE/TAB/VTAB/etc.- these are stripped when leading/trailing field values, unless in quotes) and when we come to the next Quote (typically " or 0x22,but may differ between implementations), we check whether it's immediately followed by another quote, which means simply add theactual quote char and continue, or not, which means the end of the field. (Note: Any text after the ending quote in a quoted fieldand before the next field separator ( ; or , ) is ignored, one can use this quirk to add some comments or directives even).

4 - If you are writing fields, if your field contains any Quote characters, any separators or leading/trailing white-spacecharacters that must not be discarded, it must be quoted. The best method for quoting a string to be valid CSV field data is likethis pseudo code:


QuoteChar = Char(0x22); // Or whatever is used for quoting in your system
for i = last index of value down to first index:       // Go downward because 
it allows inserting without messing up final index
  if  value[i] = QuoteChar then insert QuoteChar into value at position i;      
  // Duplicate any existing quotes
OutputFieldValue = QuoteChar + value + QuoteChar;       // Surround the result 
with quotes

This way whatever is used in your fields do not matter, it will be understood by any standard-validated CSV reader. Note that quotesmay be added regardless of content, but it is preferred to not add quotes where not needed as it can cause very significantweight-gains to a large file, especially one with many fields per row, and moreso if lots of fields are short or empty.

The sqlite3.exe (or its Linux counterpart) is tested to understand the correct formatted CSV as used in most places, not sure if itrecognizes the OS's preference for say ; vs. , separators, but it's an easy test. If your data gets mangled, it is probably due tothe input being not well-processed, and I hope the above will help to fix that.


Cheers,
Ryan


_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Importing a cvs file

Reply via email to