I have a database containing 100 million records, in which each record contains (in sequence) all of the words in a 100 million word collection of texts. There are two columns: TheID (offset value) and TheWord (sequential words), e.g.:
TheID TheWord ---- ----- 1 I 2 saw 3 the 4 man 5 that 6 came . . . 100000000 xxx To extract strings, I then use self-joins on this one table, in which [ID], [ID-1], [ID+1] etc are used to find preceding and following words, e.g.: select count(*),w1.w1,w2.w1,w3.w1 from ((select w1, ID+1 as ID from seq where w1 in ('the','that','this')) w1 inner join (select w1, ID as ID from seq where w1 in ('man','woman','person')) w2 on w2.ID = w1.ID) inner join (select w1, ID-1 as ID from seq where w1 in ('who','that','which')) w3 on w3.ID=w1.ID group by w1.w1,w2.w1,w3.w1 This would yield results like "the man that" (words 3-5 above),"that woman who","this man which", etc. The problem is, the self-join solution is extremely slow. I have a SQL Server 7.0 database with a clustered index on TheWord (sequential words) and a normal index on TheID. Even with all of this, however, a self-join query like the one just listed takes about 15 seconds on my machine (dual CPU 1.2GHz, 4GB RAM, three 10K rpm SCSI HD w/ RAID-0). Any suggestions? Have I messed up in terms of the SQL statement? Thanks in advance for any help that you can give. Mark Davies Illinois State University P.S. Yes, I know about Full-Text Indexing in SQL Server, but it's not adequate for my purposes -- there's a lot more to the project than what I've described here. ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])