I'm using MySQL for an information retrieval application where word occurrences are indexed. It seems that performance is not as good as I would expect (it seems nearly linear with the number of rows). Any advice would be welcome. I'll lay out a lot of detail.
Some details follow. The basic scenario is: - there are over 498M rows. This seems to qualify it as a rather large table - the table has 6 fixed-length fields - I have created indexes on each field - I set MAX_ROW_SIZE to 550,000,000 - I've tried this with both MyISAM and INNODB tables, with very similar performance - Slower inserts are not really a problem, if I could get faster searching - I've run ANALYZE, OPTIMIZE, myisamchk etc. (see below) - Using a socket for local communication; MySQL++ for apps, but I've also confirmed benchmarking results using the mysql> command line - I'm not presenting a lot of variety in queries and response times below, but can send more if needed. - Due to the patterns in the data, there are many many duplicate values in the indexes. I suspect this might be important. - This is not a results transport issue. I'm able to retrieve about 1000 rows/second, which is OK. It's generating the result set, prior to results transport, which is slow. Queries are definitely disk-bound, with 5-40% of CPU utilization during a long-running query. (I did a little experimentation with a ramfs to confirm this...too bad I don't have 40GB of RAM). The application: I want to run queries that will identify particular DOCUMENTS (via "docid") that have particular TERMS (via "termid"). This is your garden variety task for a search engine or other information retrieval system. I often submit additional criteria, such as adjacent terms (offset A - offset B=1) or terms in a particular tag (term A in <title> and term B in <title>) or proximity (paragraph A == paragraph B). Being able to do such powerful queries is the whole point of using MySQL, rather than trying to manage things myself (using BerkeleyDB or home-grown B-trees, as I've done in the past). I have made a quickie 2000-document, then a longer 100,000 document, then the full 1.2M document database (input is about 20GB of HTML). The challenge is that on the 1.2M document dataset with nearly 1/2 a billion rows (term occurrences), I just can't seem to get a quick response for combinations with terms occurring in a lot of rows. Maybe the answer is that this is the best I can expect due to duplicate key values and other factors (that is, there are many terms with each docid....many many terms with each paragraph id....many many many terms with each offset, etc.). But I hope there might be other tuning parameters or tips that will speed things up. If I look for terms in only a few hundred rows, no problem. But for terms with thousands of rows, it takes a very long time (several minutes for the first 1000 rows to an hour for all rows) to get the query output, even when I order by and limit to get a smaller output set. Concrete example: select * from inv0web02 as t0,inv0web02 as t1 where ((t0.termid=35) and (t1.termid=141) and (t0.docid=t1.docid) and (t0.whichpara=t1.whichpara) and ( (t1.offset - t0.offset) = 1)) order by t0.weight_in_subdoc limit 1000 This takes about 3 minutes where term35="web" and term141="page" (very common terms). (Timing depends slightly on whether I employ a Result or ResUse in my MySQL++ code - but I don't think the MySQL++ is the controlling factor here since I've also experimented with mysql on the command line). This is on a RedHat 9 box: Dell 4600 with 12GB RAM, 2x2.8Ghz Xeon (hyperthreaded to look like 4 CPUs) and 10KRPM drives. It's a pretty speedy system. I'm using mysql-max-4.0.17-pc-linux-i686 (binary build). With less common terms, I'm able to get a response in just a few seconds. The subtraction in the offset is not a deciding factor; performance is comparable without it. Clearly, the issue is the merge within the same table using WHERE criteria. The "order by" doesn't matter much; the "limit" speeds things up quite a bit since the response set is smaller. (I use these so that the program that uses the results only needs to look at the "best," and to limit the response set size). If I remove the "limit 1000", the query can take well over 1/2 hour (even without actually retrieving the rows). I have my.cnf tuned to use a lot of memory, so we're not doing a lot of swapping or thrashing. The total memory footprint for the mysql process maxes at 1.5GB or so. Most of the query wall-clock time seems to be disk access. myisam_sort_buffer_size = 640M Explain looks fine, in terms of using indices. For the query above: | table | type | possible_keys | key | key_len | ref | rows | Extra | +-------+------+--------------------------------------------------+--------------+---------+----------+--------+-----------------------------+ | t0 | ref | PRIMARY,termid_index,whichpara_index,docid_index | termid_index | 4 | const | 405296 | Using where; Using filesort | | t1 | ref | PRIMARY,termid_index,whichpara_index,docid_index | docid_index | 4 | t0.docid | 402 | Using where | A little more detail follows. Thanks for any advice or ideas. -- Greg mysql> show create table inv0web02; CREATE TABLE `inv0web02` ( `docid` int(10) unsigned NOT NULL default '0', `offset` smallint(5) unsigned NOT NULL default '0', `termid` int(10) unsigned NOT NULL default '0', `taglistid` smallint(5) unsigned NOT NULL default '0', `whichpara` tinyint(3) unsigned NOT NULL default '0', `weight_in_subdoc` float unsigned NOT NULL default '0', PRIMARY KEY (`docid`,`offset`), KEY `termid_index` (`termid`), KEY `whichpara_index` (`whichpara`), KEY `taglistid_index` (`taglistid`), KEY `weight_index` (`weight_in_subdoc`), KEY `docid_index` (`docid`), KEY `offset_index` (`offset`) ) TYPE=MyISAM | # /usr/local/mysql/bin/myisamchk -divv /data/mysql/irt/inv0web02 MyISAM file: /data/mysql/irt/inv0web02 Record format: Fixed length Character set: latin1 (8) File-version: 1 Creation time: 2004-01-11 23:25:42 Recover time: 2004-01-12 3:31:35 Status: checked,analyzed Data records: 498093481 Deleted blocks: 0 Datafile parts: 498093481 Deleted data: 0 Datafile pointer (bytes): 4 Keyfile pointer (bytes): 4 Datafile length: 8965682658 Keyfile length: 25881316352 Max datafile length: 77309411326 Max keyfile length: 4398046510079 Recordlength: 18 table description: Key Start Len Index Type Rec/key Root Blocksize 1 2 4 unique unsigned long 402 185730048 1024 6 2 unsigned short 1 2 8 4 multip. unsigned long 132 9169156096 1024 3 14 1 multip. binary 1945678 11691716608 1024 4 12 2 multip. unsigned short 7600 14722002944 1024 5 15 4 multip. float 192 18786515968 1024 6 2 4 multip. unsigned long 402 22851028992 1024 7 6 2 multip. unsigned short 28253 25881315328 1024 Field Start Length Nullpos Nullbit Type 1 1 1 2 2 4 3 6 2 4 8 4 5 12 2 6 14 1 7 15 4 Dr. Gregory B. Newby, Research Faculty, Arctic Region Supercomputing Center University of Alaska Fairbanks. PO Box 756020, Fairbanks, AK 99775 e: newby AT arsc.edu v: 907-474-7160 f: 907-474-5494 w: www.arsc.edu/~newby -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]