It looks like having a multiple-column index would help more than having a number of individual indices, at least for the example query. An index on (termid, docid, whichpara, offset) might work better than just having one on termid.
joe On Thu, 2004-01-15 at 10:53, Gregory Newby wrote: > I'm using MySQL for an information retrieval application where word > occurrences are indexed. It seems that performance is not as good as > I would expect (it seems nearly linear with the number of rows). > Any advice would be welcome. I'll lay out a lot of detail. > > Some details follow. The basic scenario is: > - there are over 498M rows. This seems to qualify it as a > rather large table > - the table has 6 fixed-length fields > - I have created indexes on each field > - I set MAX_ROW_SIZE to 550,000,000 > - I've tried this with both MyISAM and INNODB tables, with very > similar performance > - Slower inserts are not really a problem, if I could get faster searching > - I've run ANALYZE, OPTIMIZE, myisamchk etc. (see below) > - Using a socket for local communication; MySQL++ for apps, but I've > also confirmed benchmarking results using the mysql> command line > - I'm not presenting a lot of variety in queries and response times below, > but can send more if needed. > - Due to the patterns in the data, there are many many duplicate > values in the indexes. I suspect this might be important. > - This is not a results transport issue. I'm able to retrieve > about 1000 rows/second, which is OK. It's generating the result set, > prior to results transport, which is slow. > > Queries are definitely disk-bound, with 5-40% of CPU utilization > during a long-running query. (I did a little experimentation with a > ramfs to confirm this...too bad I don't have 40GB of RAM). > > The application: I want to run queries that will identify particular > DOCUMENTS (via "docid") that have particular TERMS (via "termid"). > This is your garden variety task for a search engine or other > information retrieval system. I often submit additional criteria, > such as adjacent terms (offset A - offset B=1) or terms in a > particular tag (term A in <title> and term B in <title>) or proximity > (paragraph A == paragraph B). Being able to do such powerful queries > is the whole point of using MySQL, rather than trying to manage things > myself (using BerkeleyDB or home-grown B-trees, as I've done in the > past). > > I have made a quickie 2000-document, then a longer 100,000 document, > then the full 1.2M document database (input is about 20GB of HTML). > The challenge is that on the 1.2M document dataset with nearly 1/2 a > billion rows (term occurrences), I just can't seem to get a quick > response for combinations with terms occurring in a lot of rows. > > Maybe the answer is that this is the best I can expect due to > duplicate key values and other factors (that is, there are many terms > with each docid....many many terms with each paragraph id....many many > many terms with each offset, etc.). But I hope there might be other > tuning parameters or tips that will speed things up. > > If I look for terms in only a few hundred rows, no problem. But for > terms with thousands of rows, it takes a very long time (several > minutes for the first 1000 rows to an hour for all rows) to get the > query output, even when I order by and limit to get a smaller output > set. > > Concrete example: > > select * from inv0web02 as t0,inv0web02 as t1 where ((t0.termid=35) > and (t1.termid=141) and (t0.docid=t1.docid) and > (t0.whichpara=t1.whichpara) and ( (t1.offset - t0.offset) = 1)) order > by t0.weight_in_subdoc limit 1000 > > This takes about 3 minutes where term35="web" and term141="page" (very > common terms). (Timing depends slightly on whether I employ a Result > or ResUse in my MySQL++ code - but I don't think the MySQL++ is the > controlling factor here since I've also experimented with mysql on the > command line). This is on a RedHat 9 box: Dell 4600 with 12GB RAM, > 2x2.8Ghz Xeon (hyperthreaded to look like 4 CPUs) and 10KRPM drives. > It's a pretty speedy system. I'm using mysql-max-4.0.17-pc-linux-i686 > (binary build). > > With less common terms, I'm able to get a response in just a few > seconds. The subtraction in the offset is not a deciding factor; > performance is comparable without it. Clearly, the issue is the merge > within the same table using WHERE criteria. > > The "order by" doesn't matter much; the "limit" speeds things up quite > a bit since the response set is smaller. (I use these so that the > program that uses the results only needs to look at the "best," and to > limit the response set size). If I remove the "limit 1000", the query > can take well over 1/2 hour (even without actually retrieving the > rows). > > I have my.cnf tuned to use a lot of memory, so we're not doing a lot > of swapping or thrashing. The total memory footprint for the mysql > process maxes at 1.5GB or so. Most of the query wall-clock time seems > to be disk access. > > myisam_sort_buffer_size = 640M > > Explain looks fine, in terms of using indices. For the query above: > > | table | type | possible_keys | key | > key_len | ref | rows | Extra | > +-------+------+--------------------------------------------------+--------------+---------+----------+--------+-----------------------------+ > | t0 | ref | PRIMARY,termid_index,whichpara_index,docid_index | termid_index | > 4 | const | 405296 | Using where; Using filesort | > | t1 | ref | PRIMARY,termid_index,whichpara_index,docid_index | docid_index | > 4 | t0.docid | 402 | Using where | > > A little more detail follows. Thanks for any advice or ideas. > -- Greg > > > mysql> show create table inv0web02; > CREATE TABLE `inv0web02` ( > `docid` int(10) unsigned NOT NULL default '0', > `offset` smallint(5) unsigned NOT NULL default '0', > `termid` int(10) unsigned NOT NULL default '0', > `taglistid` smallint(5) unsigned NOT NULL default '0', > `whichpara` tinyint(3) unsigned NOT NULL default '0', > `weight_in_subdoc` float unsigned NOT NULL default '0', > PRIMARY KEY (`docid`,`offset`), > KEY `termid_index` (`termid`), > KEY `whichpara_index` (`whichpara`), > KEY `taglistid_index` (`taglistid`), > KEY `weight_index` (`weight_in_subdoc`), > KEY `docid_index` (`docid`), > KEY `offset_index` (`offset`) > ) TYPE=MyISAM | > > # /usr/local/mysql/bin/myisamchk -divv /data/mysql/irt/inv0web02 > MyISAM file: /data/mysql/irt/inv0web02 > Record format: Fixed length > Character set: latin1 (8) > File-version: 1 > Creation time: 2004-01-11 23:25:42 > Recover time: 2004-01-12 3:31:35 > Status: checked,analyzed > Data records: 498093481 Deleted blocks: 0 > Datafile parts: 498093481 Deleted data: 0 > Datafile pointer (bytes): 4 Keyfile pointer (bytes): 4 > Datafile length: 8965682658 Keyfile length: 25881316352 > Max datafile length: 77309411326 Max keyfile length: 4398046510079 > Recordlength: 18 > > table description: > Key Start Len Index Type Rec/key Root Blocksize > 1 2 4 unique unsigned long 402 185730048 1024 > 6 2 unsigned short 1 > 2 8 4 multip. unsigned long 132 9169156096 1024 > 3 14 1 multip. binary 1945678 11691716608 1024 > 4 12 2 multip. unsigned short 7600 14722002944 1024 > 5 15 4 multip. float 192 18786515968 1024 > 6 2 4 multip. unsigned long 402 22851028992 1024 > 7 6 2 multip. unsigned short 28253 25881315328 1024 > > Field Start Length Nullpos Nullbit Type > 1 1 1 > 2 2 4 > 3 6 2 > 4 8 4 > 5 12 2 > 6 14 1 > 7 15 4 > > > Dr. Gregory B. Newby, Research Faculty, Arctic Region Supercomputing Center > University of Alaska Fairbanks. PO Box 756020, Fairbanks, AK 99775 > e: newby AT arsc.edu v: 907-474-7160 f: 907-474-5494 w: www.arsc.edu/~newby -- Joe Shear <[EMAIL PROTECTED]> -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]