On 2014/12/10 13:39, Simon Slavin wrote:
Dear folks,

A little SQL question for you.  The database file concerned is purely for data 
manipulation at the moment.  I can do anything I like to it, even at the schema 
level, without inconveniencing anyone.

I have a TABLE with about 300 million (sic.) entries in it, as follows:

CREATE TABLE s2 (a TEXT, b TEXT, theCount INTEGER)

There are numerous cases where two or more rows (up to a few thousand in some 
cases) have the same values for a and b.  I would like to merge those rows into 
one row with a 'theCount' which is the total of all the merged rows.  
Presumably I do something like

CREATE TABLE s2merged (a TEXT, b TEXT, theCount INTEGER)

INSERT INTO s2merged SELECT DISTINCT ... FROM s2

I think the one you are looking for is:

INSERT INTO s2merged SELECT a, b, sum(theCount) FROM s2 GROUP BY a,b;

Not sure if your theCount field already contains totals or if it just has 1's... how did duplication happen? Should this be the case you might also be able to use simply:

INSERT INTO s2merged SELECT a, b, count() FROM s2 GROUP BY a,b;

Either way, the last query will obviously show the duplication counts (if 
needed as an exercise).

For 300 mil rows this will be rather quick if it's going to be a once-off thing and not something running often. I'd say it will take under an hour depending on hardware and how much duplication happened in s2. Making an index will take a lot longer, you are better off just running the merge as above - unless of course the eventual use of s2merged includes being a look-up attached DB or such, in which case making an index from the start will be worthwhile.


_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to