Re: [sqlite] CSV import using CLI (header, NULL)

2020-02-29 Thread Warren Young
On Feb 27, 2020, at 11:51 PM, Christof Böckler  wrote:
> 
> 1. There should be no penalty for using header lines in CSV files. Thus a new 
> flag -h for .import is much appreciated.

More than that, SQLite should be able to *actively use* the header when present.

For instance, given:

foo,bar,qux
1,2,3

then there should be a way to import only the first and third columns, perhaps 
with a command like so:

.import data.csv tablename foo,qux

I’ve needed such a thing multiple times.

One case that comes to mind is that tablename has only two columns, and you’re 
purposefully subsetting the CSV data in SQLite.  This is common when importing 
DBs from outside sources, where the other DB has more columns than your local 
SQLite DB.  It’s annoying to need to preprocess the data to strip the “extra” 
columns out.  It’s even more annoying to subset it in SQLite by matching the 
outside DB’s structure in SQLite and then do the common workaround to a lack of 
ALTER TABLE DROP COLUMN.

Another case I’ve run into before is that the table layout of the CSV is the 
same as in SQLite, but you want one or more of the columns to get their default 
value for some reason, not take them from the CSV.  You can see this with 
time-varying data when the CSV is output from SQLite at time t0 and reimported 
at t1 after some of the data has gone stale, so that letting the stale columns 
revert to their defaults is better than importing obsolete data.

To make the second example concrete, imagine an HVAC monitoring system’s DB: 
the names of the stations and the wear data are still useful, but the last 
known temperature shouldn’t be imported because the old data could be entirely 
inaccurate, and fresh data should be coming along soon after the DB table is 
restored from the backup.  Meanwhile, report “I don’t know” for the temperature 
measurements.

> 2. Missing values in columns with numerical affinity should show up as NULL 
> values.

I don’t think one hard-and-fast rule works for everyone here, if for no other 
reason than that there must be people depending on the current behavior.

Instead, let it be a mode:

.blank default
.blank blank
.blank zero
.blank null

This proposed setting tells the CSV importer how to handle blanks: use the DB 
column’s default value, leave them be, make them zeroes, or make them SQL NULLs.

The default value for this setting would be “blank” for backwards compatibility.

Parallel .zero and .null settings could be added to take care of similar cases. 
 The defaults would obviously be “zero” and “null”, respectively.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] CSV import using CLI (header, NULL)

2020-02-29 Thread Dominique Devienne
On Sat, Feb 29, 2020 at 1:42 PM Shawn Wagner 
wrote:

> To import a csv file with headers into an existing table, you can use
> .import '| tail -n +2 yourfile.csv' yourtable
> to skip the header line.


On unix. And by shell’ing out to native tools, so not portable.
The cli ought to have something built in, if it doesn’t already.

>
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] CSV import using CLI (header, NULL)

2020-02-29 Thread Shawn Wagner
To import a csv file with headers into an existing table, you can use

.import '| tail -n +2 yourfile.csv' yourtable

to skip the header line.

On Sat, Feb 29, 2020, 4:30 AM Christof Böckler  wrote:

> Hi,
>
> I want to share some thoughts and make some suggestions about the SQLite
> 3 command line interface (CLI) tool, especially its behaviour when
> importing CSV files.
>
> CSV files are probably even more common for data exchange than SQLite
> database files. I consider it to be good practice to include a header
> line with column names in every CSV file. Metadata should go with the
> data. This prevents me from mixing up two columns that contain numbers
> with similiar value distributions.
>
> Let’s look at an example. A file named data.csv contains three lines:
>
> id,val1,val2
> A,27,8
> B,3,12
>
> Now …
> sqlite3
> sqlite> .import data.csv tab
>
> works and looks good at first, but there is a problem with numerical data.
>
> sqlite> SELECT * FROM tab ORDER BY val2;
> B,3,12
> A,27,8
>
> This is because on import all three columns were created with affinity
> (not to say data type) TEXT (see .schema). As a consequence all numbers
> were imported as strings.
> '12' < '8' is lexicographically OK, but not so in a mathematical sense.
> Having the CSV file in mind I clearly expect to see 8 on the first line
> of the above result.
>
> How to work around this? Just define the table in advance with
> appropriate data types (affinity INTEGER):
>
> sqlite> CREATE TABLE tab (id TEXT, val1 INT, val2 INT);
>
> But now the above .import command will not work as expected, because it
> will result in three rows in our table tab. The first row contains the
> header line.
>
> Two different workarounds come to my mind:
> a) sqlite> DELETE FROM tab WHERE rowid = 1;  -- Dirty hack!
> b) sqlite> .import data.csv temp_tab
> sqlite> INSERT INTO tab SELECT * FROM temp_tab;
> sqlite> DROP TABLE temp_tab;  -- Two tedious extra lines
>
> Both approaches are not very appealing to me. To make CSV files with a
> header line first class citizens, I suggest this instead:
> sqlite> .import -h data.csv tab
> should ignore the first line of data.csv. This import should fail if
> table tab does not already exist.
> This is both shorter and more elegant than both workarounds.
>
>
> Now on to a second issue. Let’s assume you have sucessfully imported a
> file containing these four lines:
> id,val1,val2
> A,27,8
> B,3,12
> C,,1
> into the table tab mentioned above, resulting in three rows. Notice the
> missing value in column val1 on the last line. This missing value is
> imported as an empty string '' regardlesse of the affinity of column val1.
>
> That leads to (mathematically) unexpected results from aggregate functions:
> sqlite> SELECT SUM(val1) FROM tab;
> 30-- Well, treating '' like 0 is OK in this case
> sqlite> SELECT COUNT(val1) FROM tab;
> 3 -- but not so here; only two values/numbers were given in data.csv
> sqlite> SELECT AVG(val1) FROM tab;
> 10-- the average of 3 and 27 is 15
> sqlite> SELECT MAX(val1) FROM tab;
> ''-- not to be expected when looking at data.csv
>
> OK, I hear you say, what about this workaround:
> sqlite> UPDATE tab SET val1 = NULL WHERE val1 = '';
> This makes the surprises above go away, but it is again tedious to do for
> all columns containing only numbers.
>
> My suggestion is: If someone goes the extra mile and defines a table in
> advance in order to have an appropriate numerical affinity (INTEGER, REAL
> or NUMERIC) for a column, then it is OK to encode a missing value as NULL
> instead of ''. It seems right though to keep the current behaviour for
> columns with affinity TEXT (the default) or BLOB.
>
> To sum things up:
> 1. There should be no penalty for using header lines in CSV files. Thus a
> new flag -h for .import is much appreciated.
> 2. Missing values in columns with numerical affinity should show up as
> NULL values.
>
> Thanks for reading, I look forward to your opinions about these issues.
>
> Greetings
> Christof
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users