Jean-Christophe,
Thanks for this idea. In fact, the purpose of my original query is
exactly to reduce the database. The 800 mln rows were exported from
another source, and I was hoping to be able to use sqlite to manage
this massive amount of data (e.g., removing redundant information)
before I proceed with my analyses.
So, would your approach also work if I import the data, rather than insert?

On Fri, Jun 24, 2011 at 5:45 AM, Jean-Christophe Deschamps
<j...@antichoc.net> wrote:
> Rense,
>
>>As for the ranges of n1 and n1: they are both roughly between 60000
>>and 12000000 .
>>
>>Here are the results of  EXPLAIN QUERY PLAN SELECT n1, n2 FROM table1
>>Where n1 < n2 INTERSECT SELECT n2, n1 FROM table1 Where n2 < n1;
>>
>>1|0|0|SCAN TABLE table1 (~437976176 rows)
>>2|0|0|SCAN TABLE table1 (~437976176 rows)
>>0|0|0|COMPOUND SUBQUERIES 1 AND 2 USING TEMP B-TREE (INTERSECT)
>
> If you can change your schema at all and your application supports
> taking a bit more time for inserts, I believe you can achieve much
> better query times.
>
> I tried the following schema:
> CREATE TABLE "T" (
>   "flags" INTEGER,
>   "n1" INTEGER DEFAULT (random() % 100),
>   "n2" INTEGER DEFAULT (random() % 100),
>   CONSTRAINT "T_order" CHECK(((n1 <= n2) and (flags between 1 and 3))));
>
>>In my database, the mirrored pairs vastly outnumber the non-mirrored
>>ones, to the extent that the non-mirrored pairs are actually extremely
>>rare (but relevant).
>
> Given the fact that mirrored pairs dominate, if you store new pairs
>  (N1, N2)
> as
>  (flags, n1, n2)
> with n1 <= n2 and flags bit 0 set if (n1, n2) = (N1, N2)
>               and flags bit 1 set if (n1, n2) = (N2, N1)
> your query becomes:
>   select * from T where flags = 3;
>
> If you build an index on flags, e.g.
>   CREATE INDEX "ixFlags" ON "T" ("flags");
> then it will be used (but it is regularly pointed out that in cases
> where the distribution of values queried doesn't fall in small bins, an
> index may in fact slow things down).
>
> Since your have mostly mirrored pairs, using the schema above will
> decrease the size of your DB (or keep it about the same if you use the
> index) but and --I see that as a big bonus-- essentially half the
> number of rows.
>
> Note that, if it can be of any value to your application, you can even
> have the mirrored pairs sorted if you make the index
>   CREATE INDEX "ixFlags" ON "T" ("flags", "n1");
> Of course this compound index will take some more room, but depending
> on your needs it may prove useful.
>
> The price to pay is having to query first before insert (or have a
> [slow] trigger do the work for you).  Could your application support that?
>
> Sounds to me it could be worth trying on a reduced DB.
>
>
>
> --
> <mailto:j...@q-e-d.org>j...@antichoc.net
>
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to