Re: [sqlite] Importing a cvs file

Jonathan Leslie Fri, 18 Jul 2014 12:32:52 -0700

Ryan, 

You are quite right, it is not a perfect solution.  Thank you for your detailed 
explanation your's is a much more complete solution.  I never knew about the  
quotes syntax being formal. that is very good to know.   Sure enough, it works 
just as you described:



1ºfield 1ºfield 2ºthis is a comment fieldºnumber 4ºlast field
2ºfield 1ºfield 2ºthis is a comment fieldºnumber 4ºlast field
3ºfield 1ºfield 2ºthis is a comment, has a comma in it. ºnumber 4ºlast field
4ºfield 1ºfield 2º"this is a commentº it has my 0186 in it. "ºnumber 4ºlast 
field
5ºfield 1ºfield 2º"this is a comment field lets quote ""something"" and see 
what happens. "ºnumber 4ºlast field
6ºfield 1ºfield 2ºthis is a comment fieldºnumber 4ºlast field



On Friday, July 18, 2014 11:49 AM, RSmith <[email protected]> wrote:
 

>
>
>
>On 2014/07/18 16:53, Jonathan Leslie wrote:
>> I have a similar situation about to happen.  I also have the issue where one 
>> of the data fields is a text section that the user fills in and of course, 
>> he is free to put commas in this field.   I believe that this will result in 
>> higgly-piggly (that's a technical term) when I do the import.   What I do is 
>> I never use a CSV file as a "Comma separated Values" file but rather as a 
>> "Character separated Values"  file, but rather use a Character, º 
>> (<alt>0186), a legal character but not one on the keyboard, as the separator 
>> character.  You can set Excel to do this automatically, and it makes life a 
>> whole lot easier.   I believe with sqlite the >.separator "º"< directive 
>> will allow the import correctly, and keep the commas in place in the data 
>> fields.
>
>Hi Jonathan, while this is a nice solution, it is not a perfect one. What if a 
>user does enter that character? (Even if not directly 
>linked to a keyboard key, it can happen). The very correct way to do it is to 
>simply use quoting, which preserves the text 
>specifically, whatever characters were used. A correct CSV format can hold any 
>data (though most readers would have difficulty 
>reading Chr(0) values, but that is an implementation issue). More importantly, 
>Excel, Open Office Calc etc, all knows exactly how to 
>export this correctly and without the higgly-pigglies. The only place I have 
>seen confusion is where you export it in a country that 
>uses semi-colon as a separator (yes it happens, mostly in the Netherlands) and 
>then try to read in another country where the 
>locality values are different (comma mostly).
>
>Mostly the problems experienced by people is that they make some home-brew CSV 
>importer that does not realise how to correctly read 
>output from a standards-based exporter such as Excel, and then try to change 
>things like separation or quoting methods to "fix" it 
>after the fact.
>
>The method described by the RFC for CSV is quite genius ( I think RFC2048, but 
>not sure now and not important, but google will know 
>if you need it) so let me explain a simple way of making csv files.
>
>The Important bits to remember is:
>
>1 - Header, first line of the CSV must be the header, must contain field 
>names, unless you do not have those or they are implicit, 
>such as in a log-type listing or import to existing structure- but any 
>standard importer will expect row 1 to contain 
>field-identifiers if previously unknown.
>
>2 - Every row must be separated by whichever Line-separation character(s) used 
>by the operating system - this sometimes causes 
>problems.  Carriage-Return-Linefeed (aka CR/LF) pairs (0x0D+0x0A) are mostly 
>used in Windows, *nix environments prefer to use just 
>the Linefeed (LF) character (0x0A). The Windows-way allows other linefeeds or 
>carriage returns to be inside the text and not affect 
>line-termination, and the *nix method is more lightweight in data terms, so 
>both have advantages. Important to note that a 
>line-separator may appear inside a set of quotes which makes it simply part of 
>the value, and not indication of line termination - 
>this is the biggest problem with homegrown importers I've seen, they all try 
>read the file line-by-line which breaks badly when a 
>field contains such a CR/LF.
>
>3 - if a field starts with a Quote (after any leading/trailing white-space was 
>stripped) any and all characters up to the next quote 
>is part of the field value, even if they are separators (; or , or CR or LF) 
>or white-space characters (such as SPACE/TAB/VTAB/etc. 
>- these are stripped when leading/trailing field values, unless in quotes) and 
>when we come to the next Quote (typically " or 0x22, 
>but may differ between implementations), we check whether it's immediately 
>followed by another quote, which means simply add the 
>actual quote char and continue, or not, which means the end of the field. 
>(Note: Any text after the ending quote in a quoted field 
>and before the next field separator ( ; or , ) is ignored, one can use this 
>quirk to add some comments or directives even).
>
>4 - If you are writing fields, if your field contains any Quote characters, 
>any separators or leading/trailing white-space 
>characters that must not be discarded, it must be quoted. The best method for 
>quoting a string to be valid CSV field data is like 
>this pseudo code:
>
>QuoteChar = Char(0x22); // Or whatever is used for quoting in your system
>for i = last index of value down to first index:       // Go downward because 
>it allows inserting without messing up final index
>   if  value[i] = QuoteChar then insert QuoteChar into value at position i;    
>    // Duplicate any existing quotes
>OutputFieldValue = QuoteChar + value + QuoteChar;       // Surround the result 
>with quotes
>
>
>This way whatever is used in your fields do not matter, it will be understood 
>by any standard-validated CSV reader. Note that quotes 
>may be added regardless of content, but it is preferred to not add quotes 
>where not needed as it can cause very significant 
>weight-gains to a large file, especially one with many fields per row, and 
>moreso if lots of fields are short or empty.
>
>The sqlite3.exe (or its Linux counterpart) is tested to understand the correct 
>formatted CSV as used in most places, not sure if it 
>recognizes the OS's preference for say ; vs. , separators, but it's an easy 
>test. If your data gets mangled, it is probably due to 
>the input being not well-processed, and I hope the above will help to fix that.
>
>Cheers,
>Ryan
>
>
>
>_______________________________________________
>sqlite-users mailing list
>[email protected]
>http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
>
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Importing a cvs file

Reply via email to