[sqlite] ETL anomalies

Udi Karni Wed, 04 Apr 2012 17:12:26 -0700

First - Dr. Hipp - thanks for the .import fix from 1/16 baked into 3.7.11 -
it now recognizes commas embedded within character strings in a csv +
strips the double-quotes. This makes the loading much smoother.


 ==============

Now for 3 strange events I've been running into. Sorry - the data volumes
are large - so it's very time consuming to run many tests to pinpoint these
further. Just raising the flag in case any of this rings a bell with anyone.

==============

(1) I import data into a table in one DB
(2) I then create an identical table in a 2nd DB
(3) I then copy the table from DB1 to DB2 - with some SQL ETL cleanups.
Surprisingly - the data gets appended to the initial table on DB1 instead
of the new table on DB2.

I tried to replicate it with a couple of rows - but it works fine. However
- with the real 100 million row table it happens.

My workaround for the moment is to give the 2 tables on the 2 DBs different
names and ALTER RENAME when all is said and done. That works fine.

sqlite3 orig.db3
create table xyz (.....);
import  xyz.csv  xyz
.exit

sqlite3 final.db3
create table xyz (.....);
attach database 'orig.db3' as indb;
insert into xyz select * from indb.xyz
.exit

Look at the size of orig.db3 - it's now double in size with twice as many
rows
Look at the size of final.db3 - 0K

 ==============

The reason I use the 2 DB approach in leu of just UPDATEing the original DB
- is that on UPDATE sqlite seems to try and read the table first. At least
it seems to do that on large tables.

Apparently - if the table is large enough - memory usage slowly rises - and
then I get an "out of memory" message.

So UPDATE works on smaller tables (up to 10-20GB) but for larger ones I
need to create a whole new copy of the table for every ETL iteration.

Since I would like each large table to end up in its own DB, and I would
like to avoid putting multiple intermediate tables in the same DB, then
delete the unnecessary ones, then finally VACUUM - I use multiple physical
DBs - hence - the abovementioned attach issue.

 ==============

If I run an .import and mistype something - like the path of the .csv file
- or the wrong .separator - the commands obviously fails. I correct the
syntax and run again - I sometimes get a "expected x columns but found y
columns.... stopping at row nn" type of message.

Initiall I very carefully looked at the data and tried to load "nn+1" rows.
Everything worked. I reran the .import and all was fine.

Then I learned to just close / reopen the DB - run the .import again -
without changing anything - and all works fine.

It's as if when .import runs against a problem - it doesn't clean something
up - which leads to other messages. Starting from scratch - making sure the
syntax is correct the first time - works fine.

==============

Again - sorry these sounds so fuzzy. I'll try to provide more detail as I
run across it - but just in case these are simple to figure out.

Thanks !
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] ETL anomalies

Reply via email to