At 16:36 01/08/2015, Igor wrote:
????
>There are many real problems with CSV - no need to make up imaginary ones.
`---
Indeed and I'm surprised noone mentionned this from the very start of
this thread: Nulls. There is no real provision in the RFC to represent
Null, or distinguish Null from an empty string.
I've long established my own convention to transfer data between the
language I use most (AutoIt), CSV files and SQLite and I'm very happy
with that.
Let's look at which datatypes we have to deal with, at least for SQLite:
o) string
o) integer
o) real
o) binary
o) Null <-- that's very important and most often forgoten!
Within my humble convention, a CSV field is either:
o) a string delimited by a parametrable delimiter (double quote by
default) where embedded delimiters are escaped by doubling
o) an integer, optionally signed
o) a real (e.g. 3.1415926 or -5.6e-12)
o) a binary in 0x0123456789abcdef form
o) an empty field or a field containing Null (w/o delimiter)
represents Null
Fields are separated by separators (comma by default).
Rows are terminated by any Unicode line termination (defined by PCRE
BSR_ANY).
An optional header may be present with column names as a series of
string types.
That deals with SQLite datatypes. I've additionnally accept datatypes
from/to AutoIt to/from SQLite (some of them don't make much sense
storing in a DB, but anyway):
o) a field containing True or False (w/o delimiter) represents 1 or
0 as a boolean
o) a field containing "<-~Default~->" represents the keyword Default
o) function references are encoded as strings in the format
"<-(myFunc)->"
o) pointers are encoded as strings in the format "<-*0123456789"*->"
o) Windows handles are encoded in the form "<-&0123456789&->"
o) a C-style structure is encoded by "<-{0123456789ABCDEF}->"
and the following more exotic ones (their actuel content is not stored):
o) a COM object is denoted by the format "<- at objectName@->"
o) an array is denoted by the format "<-[3][4]...[2]->"
o) all other unrepresentable variants are denoted by "<-?variantType?->"
I agree that the convention of using strings like "<- ... ->" is
questionable and is merely a hint that some variable(s) should have
been handled otherwise, but if one focuses only on SQLite (and Excel or
such) datatypes, the format is definitely unambiguous and easily parsed
by a single regexp and encoded with minimum effort.
The benefit of representing datatypes unambiguously in round-trip
operations should be obvious.
Granted that doesn't solve the direct import from Excel CSV or some
other data source using its own "standard", but I prefer having to code
the few lines of AutoIt needed to grab data directly from Excel or
other COM objects and format the data according to my own convention
than rely on variable formats found around.
Again, the issue of not providing a clear way to represent SQL Null is
a big hole in the RFC (I understand it was not the purpose at the time
it was issued and the fact that it was too late to enforce something
better). Also the optional string delimiter makes it essentially
impossible to distinguish the integer value 1234 from the string
"0001234", another issue that needs addressing.