[SQL] Slow self-join on a 100 million record table

Mark Davies Wed, 01 Jan 2003 21:17:28 -0800

I have a database containing 100 million records, in which each record
contains (in sequence) all of the words in a 100 million word
collection of texts.  There are two columns: TheID (offset value) and
TheWord (sequential words), e.g.:


TheID  TheWord
----  -----

1  I 
2  saw 
3  the 
4  man 
5  that
6  came
 . . . 
100000000 xxx 

To extract strings, I then use self-joins on this one table, in which
[ID], [ID-1], [ID+1] etc are used to find preceding and following
words, e.g.:

select count(*),w1.w1,w2.w1,w3.w1 from 
((select w1, ID+1 as ID from seq where w1 in ('the','that','this')) w1
inner join 
(select w1, ID as ID from seq where w1 in ('man','woman','person')) w2
on w2.ID = w1.ID) 
inner join 
(select w1, ID-1 as ID from seq where w1 in ('who','that','which')) w3
on w3.ID=w1.ID 
group by w1.w1,w2.w1,w3.w1 

This would yield results like "the man that" (words 3-5 above),"that
woman who","this man which", etc.

The problem is, the self-join solution is extremely slow.  I have a
SQL Server 7.0 database with a clustered index on TheWord (sequential
words) and a normal index on TheID.  Even with all of this, however, a
self-join query like the one just listed takes about 15 seconds on my
machine (dual CPU 1.2GHz, 4GB RAM, three 10K rpm SCSI HD w/ RAID-0).

Any suggestions?  Have I messed up in terms of the SQL statement? 
Thanks in advance for any help that you can give.

Mark Davies
Illinois State University

P.S. Yes, I know about Full-Text Indexing in SQL Server, but it's not
adequate for my purposes -- there's a lot more to the project than
what I've described here.

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

[SQL] Slow self-join on a 100 million record table

Reply via email to