On 2014/07/18 16:53, Jonathan Leslie wrote:
I have a similar situation about to happen. I also have the issue where one of the data fields is a text section that
the user fills in and of course, he is free to put commas in this field. I believe that this will result in
higgly-piggly (that's a technical term) when I do the import. What I do is I never use a CSV file as a "Comma
separated Values" file but rather as a "Character separated Values" file, but rather use a Character, º
(<alt>0186), a legal character but not one on the keyboard, as the separator character. You can set Excel to do
this automatically, and it makes life a whole lot easier. I believe with sqlite the >.separator "º"<
directive will allow the import correctly, and keep the commas in place in the data fields.
Hi Jonathan, while this is a nice solution, it is not a perfect one. What if a user does enter that character? (Even if not directly
linked to a keyboard key, it can happen). The very correct way to do it is to simply use quoting, which preserves the text
specifically, whatever characters were used. A correct CSV format can hold any data (though most readers would have difficulty
reading Chr(0) values, but that is an implementation issue). More importantly, Excel, Open Office Calc etc, all knows exactly how to
export this correctly and without the higgly-pigglies. The only place I have seen confusion is where you export it in a country that
uses semi-colon as a separator (yes it happens, mostly in the Netherlands) and then try to read in another country where the
locality values are different (comma mostly).
Mostly the problems experienced by people is that they make some home-brew CSV importer that does not realise how to correctly read
output from a standards-based exporter such as Excel, and then try to change things like separation or quoting methods to "fix" it
after the fact.
The method described by the RFC for CSV is quite genius ( I think RFC2048, but not sure now and not important, but google will know
if you need it) so let me explain a simple way of making csv files.
The Important bits to remember is:
1 - Header, first line of the CSV must be the header, must contain field names, unless you do not have those or they are implicit,
such as in a log-type listing or import to existing structure- but any standard importer will expect row 1 to contain
field-identifiers if previously unknown.
2 - Every row must be separated by whichever Line-separation character(s) used by the operating system - this sometimes causes
problems. Carriage-Return-Linefeed (aka CR/LF) pairs (0x0D+0x0A) are mostly used in Windows, *nix environments prefer to use just
the Linefeed (LF) character (0x0A). The Windows-way allows other linefeeds or carriage returns to be inside the text and not affect
line-termination, and the *nix method is more lightweight in data terms, so both have advantages. Important to note that a
line-separator may appear inside a set of quotes which makes it simply part of the value, and not indication of line termination -
this is the biggest problem with homegrown importers I've seen, they all try read the file line-by-line which breaks badly when a
field contains such a CR/LF.
3 - if a field starts with a Quote (after any leading/trailing white-space was stripped) any and all characters up to the next quote
is part of the field value, even if they are separators (; or , or CR or LF) or white-space characters (such as SPACE/TAB/VTAB/etc.
- these are stripped when leading/trailing field values, unless in quotes) and when we come to the next Quote (typically " or 0x22,
but may differ between implementations), we check whether it's immediately followed by another quote, which means simply add the
actual quote char and continue, or not, which means the end of the field. (Note: Any text after the ending quote in a quoted field
and before the next field separator ( ; or , ) is ignored, one can use this quirk to add some comments or directives even).
4 - If you are writing fields, if your field contains any Quote characters, any separators or leading/trailing white-space
characters that must not be discarded, it must be quoted. The best method for quoting a string to be valid CSV field data is like
this pseudo code:
QuoteChar = Char(0x22); // Or whatever is used for quoting in your system
for i = last index of value down to first index: // Go downward because
it allows inserting without messing up final index
if value[i] = QuoteChar then insert QuoteChar into value at position i;
// Duplicate any existing quotes
OutputFieldValue = QuoteChar + value + QuoteChar; // Surround the result
with quotes
This way whatever is used in your fields do not matter, it will be understood by any standard-validated CSV reader. Note that quotes
may be added regardless of content, but it is preferred to not add quotes where not needed as it can cause very significant
weight-gains to a large file, especially one with many fields per row, and moreso if lots of fields are short or empty.
The sqlite3.exe (or its Linux counterpart) is tested to understand the correct formatted CSV as used in most places, not sure if it
recognizes the OS's preference for say ; vs. , separators, but it's an easy test. If your data gets mangled, it is probably due to
the input being not well-processed, and I hope the above will help to fix that.
Cheers,
Ryan
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users