On Thu, Jun 23, 2011 at 11:20:22 -0700, Rense Corten wrote:
> Wow, I'm almost embarrassed by how helpful you are, gentlemen. Thanks a lot.
> 
> As to RAM: I'm trying this on two different machines, one with 12 Gb
> and one with 32 Gb RAM. I won't be able to get more in the near
> future. Something that might be relevant is that the case of the 32Gb
> machine, the database is on an AFS.

I hope they are running 64-bit OS. You also need to make sure you are using
64-bit build of SQLite (32-bit build can only use between 2 and 3 GB and
that's not enough for such huge database).

Than you need to tell sqlite to use the memory for cache. The cache size is
specified in multiples of page size, so first ask sqlite to tell you what the
database's page size is by issuing

    pragma page_size;

query and than set the cache_size by issuing

    pragma cache_size = <value>;

where <value> is desired cache size divided by the page size. You need to
load about billion rows times two integers, so it will certainly have use for
8GB cache. Try giving it 8 GB on the 12 GB machine and perhaps 24 GB on the
32 GB one.

Since the setting is connection-local, you need to issue the pragma
cache_size command in the application before doing the big operation.


AFS (do I remember right that it's a network filesystem?) is likely slow for
this purpose. Trying on local disk may help.

> As for the ranges of n1 and n1: they are both roughly between 60000
> and 12000000 .
> 
> Here are the results of  EXPLAIN QUERY PLAN SELECT n1, n2 FROM table1
> Where n1 < n2 INTERSECT SELECT n2, n1 FROM table1 Where n2 < n1;
> 
> 1|0|0|SCAN TABLE table1 (~437976176 rows)
> 2|0|0|SCAN TABLE table1 (~437976176 rows)
> 0|0|0|COMPOUND SUBQUERIES 1 AND 2 USING TEMP B-TREE (INTERSECT)
> 
> I'm not sure what "reasonable" should look like :). I'm running this
> query right now, let's see what happens.

Well, it shows sqlite is indeed trying to create temporary index to do the
intersect, which can be reasonably fast if the data can fit in the cache
for sorting, but if it starts trashing the cache, it will be pretty slow.
8 GB would hopefully be enough, anything less definitely won't.

Reading such large tables in index order tends to be significantly slower
than reading them sequentially, but covering index (that is one containing
all columns needed, so in this case both n1 and n2) might help as that would
be read in sequential order. I am not sure how well the query planner will
manage to use it here though.

It's worth experimenting with different ways to write the query and indices,
perhaps on a smaller sample first. However creating index, temporary or
regular, will always need huge amount of cache and be unusably slow
otherwise, so setting the cache is probably the most important thing you have
to do.

> In my database, the mirrored pairs vastly outnumber the non-mirrored
> ones, to the extent that the non-mirrored pairs are actually extremely
> rare (but relevant).
> 
> Generally, is there a way to figure out if sqlite is still doing
> anything, or whether it got somehow stuck?

It's possible to install a hook using the
http://sqlite.org/c3ref/progress_handler.html API, which will be called every
N virtual machine instructions, but there is no estimate how many
instructions will be needed overall nor how many instructions are left until
query completion and different instructions (in different phases of the
query) may take different amount of time.

> One thing I noticed is that
> it is not writing to the database while my problematic queries are
> running. Should it be?

Not yet. According to the above query plan, it's first going to do the first
subselect and sort it's output and until that is done, it won't have any
output and thus won't write anything to the database. Than it would probably
be more efficient to sort the second subselect too and merge the two lists,
but I think it's actually not going to.

-- 
                                                 Jan 'Bulb' Hudec <b...@ucw.cz>
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to