Jean-Christophe, Thanks for this idea. In fact, the purpose of my original query is exactly to reduce the database. The 800 mln rows were exported from another source, and I was hoping to be able to use sqlite to manage this massive amount of data (e.g., removing redundant information) before I proceed with my analyses. So, would your approach also work if I import the data, rather than insert?
On Fri, Jun 24, 2011 at 5:45 AM, Jean-Christophe Deschamps <j...@antichoc.net> wrote: > Rense, > >>As for the ranges of n1 and n1: they are both roughly between 60000 >>and 12000000 . >> >>Here are the results of EXPLAIN QUERY PLAN SELECT n1, n2 FROM table1 >>Where n1 < n2 INTERSECT SELECT n2, n1 FROM table1 Where n2 < n1; >> >>1|0|0|SCAN TABLE table1 (~437976176 rows) >>2|0|0|SCAN TABLE table1 (~437976176 rows) >>0|0|0|COMPOUND SUBQUERIES 1 AND 2 USING TEMP B-TREE (INTERSECT) > > If you can change your schema at all and your application supports > taking a bit more time for inserts, I believe you can achieve much > better query times. > > I tried the following schema: > CREATE TABLE "T" ( > "flags" INTEGER, > "n1" INTEGER DEFAULT (random() % 100), > "n2" INTEGER DEFAULT (random() % 100), > CONSTRAINT "T_order" CHECK(((n1 <= n2) and (flags between 1 and 3)))); > >>In my database, the mirrored pairs vastly outnumber the non-mirrored >>ones, to the extent that the non-mirrored pairs are actually extremely >>rare (but relevant). > > Given the fact that mirrored pairs dominate, if you store new pairs > (N1, N2) > as > (flags, n1, n2) > with n1 <= n2 and flags bit 0 set if (n1, n2) = (N1, N2) > and flags bit 1 set if (n1, n2) = (N2, N1) > your query becomes: > select * from T where flags = 3; > > If you build an index on flags, e.g. > CREATE INDEX "ixFlags" ON "T" ("flags"); > then it will be used (but it is regularly pointed out that in cases > where the distribution of values queried doesn't fall in small bins, an > index may in fact slow things down). > > Since your have mostly mirrored pairs, using the schema above will > decrease the size of your DB (or keep it about the same if you use the > index) but and --I see that as a big bonus-- essentially half the > number of rows. > > Note that, if it can be of any value to your application, you can even > have the mirrored pairs sorted if you make the index > CREATE INDEX "ixFlags" ON "T" ("flags", "n1"); > Of course this compound index will take some more room, but depending > on your needs it may prove useful. > > The price to pay is having to query first before insert (or have a > [slow] trigger do the work for you). Could your application support that? > > Sounds to me it could be worth trying on a reduced DB. > > > > -- > <mailto:j...@q-e-d.org>j...@antichoc.net > > _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users