[sqlite] trying to optimize left outer join for large data set (multiple indices needed?)

Elizabeth Purdom Mon, 08 Jun 2009 23:13:00 -0700

Hello,

I'm basically a newbie, but have been plunged into not just having 
correct code but having to optimize a particular query. The following 
query works but is very slow on my database with millions of entries in 
one of the tables.


SELECT regions.id, TOTAL(reads.data) FROM  regions LEFT OUTER JOIN reads ON
( regions.chr = reads.chr AND regions.strand=reads.strand
     AND regions.start<=reads.start AND regions.stop>=reads.stop
)
GROUP BY regions.id ORDER BY regions.id
;

[I am writing with simplified names, so if there are any typos, its 
probably that, because the code does run]

'regions' is a table of somewhere between 20K to 300K rows and 'reads' 
is the larger one with millions of rows. I'm using sqlite 3.3.3. I 
currently have a single index of (chr,strand,start,stop), though I don't 
think this is helping (see below)

I would be glad as to any advice as to the most efficient way to do this 
(and any idea of whether it will really speed anything up, or if I'm 
just stuck with the speed I see).

Just to say what I've been thinking, though I don't know if its useful:

In reading about optimizing with indexing, then it seems that I would 
want an index of the relevant columns in 'reads' (chr, strand, start, 
stop); however, since I have inequality matches for both start and stop, 
then a joint index like this will not help 
(http://www.sqlite.org/optoverview.html). I can make two indexes (chr, 
strand, start) and (chr,strand,stop), but I don't know how to get an 
outer join to use them both.

There was then an example about breaking up the where clause into two parts
1=SELECT rowid from tablename WHERE (regions.chr = reads.chr AND 
regions.strand=reads.strand
     AND regions.start<=reads.start)
2= SELECT rowid from tablename WHERE (regions.chr = reads.chr AND 
regions.strand=reads.strand
     AND regions.stop>=reads.stop)
and then intersecting the row ids, but I can't figure out to extend this 
to outer join (I'm not clear what are the correct replacement of rowid). 
I think I could do
SELECT * from regions LEFT OUTER JOIN reads ON [1st]
INTERSECT
SELECT * from regions LEFT OUTER JOIN reads ON [2nd]
but then that seems unlikely to be efficient, since you have to 
intersect all of the column information. And I also don't know how to do 
the syntax for summary that should surround this table.

Thanks for any assistance,
Elizabeth Purdom



_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] trying to optimize left outer join for large data set (multiple indices needed?)

Reply via email to