At 16:36 01/08/2015, Igor wrote:
????
>There are many real problems with CSV - no need to make up imaginary ones.
`---

Indeed and I'm surprised noone mentionned this from the very start of 
this thread: Nulls. There is no real provision in the RFC to represent 
Null, or distinguish Null from an empty string.

I've long established my own convention to transfer data between the 
language I use most (AutoIt), CSV files and SQLite and I'm very happy 
with that.

Let's look at which datatypes we have to deal with, at least for SQLite:
   o) string
   o) integer
   o) real
   o) binary
   o) Null       <-- that's very important and most often forgoten!

Within my humble convention, a CSV field is either:
   o) a string delimited by a parametrable delimiter (double quote by 
default) where embedded delimiters are escaped by doubling
   o) an integer, optionally signed
   o) a real (e.g. 3.1415926 or -5.6e-12)
   o) a binary in 0x0123456789abcdef form
   o) an empty field or a field containing Null (w/o delimiter) 
represents Null

Fields are separated by separators (comma by default).
Rows are terminated by any Unicode line termination (defined by PCRE 
BSR_ANY).
An optional header may be present with column names as a series of 
string types.

That deals with SQLite datatypes. I've additionnally accept datatypes 
from/to AutoIt to/from SQLite (some of them don't make much sense 
storing in a DB, but anyway):
   o) a field containing True or False (w/o delimiter) represents 1 or 
0 as a boolean
   o) a field containing "<-~Default~->" represents the keyword Default
   o) function references are encoded as strings in the format 
"<-(myFunc)->"
   o) pointers are encoded as strings in the format "<-*0123456789"*->"
   o) Windows handles are encoded in the form "<-&0123456789&->"
   o) a C-style structure is encoded by "<-{0123456789ABCDEF}->"
and the following more exotic ones (their actuel content is not stored):
   o) a COM object is denoted by the format "<- at objectName@->"
   o) an array is denoted by the format "<-[3][4]...[2]->"
   o) all other unrepresentable variants are denoted by "<-?variantType?->"

I agree that the convention of using strings like "<- ... ->" is 
questionable and is merely a hint that some variable(s) should have 
been handled otherwise, but if one focuses only on SQLite (and Excel or 
such) datatypes, the format is definitely unambiguous and easily parsed 
by a single regexp and encoded with minimum effort.
The benefit of representing datatypes unambiguously in round-trip 
operations should be obvious.

Granted that doesn't solve the direct import from Excel CSV or some 
other data source using its own "standard", but I prefer having to code 
the few lines of AutoIt needed to grab data directly from Excel or 
other COM objects and format the data according to my own convention 
than rely on variable formats found around.

Again, the issue of not providing a clear way to represent SQL Null is 
a big hole in the RFC (I understand it was not the purpose at the time 
it was issued and the fact that it was too late to enforce something 
better). Also the optional string delimiter makes it essentially 
impossible to distinguish the integer value 1234 from the string 
"0001234", another issue that needs addressing. 

Reply via email to