I'm using MySQL for an information retrieval application where word
occurrences are indexed.  It seems that performance is not as good as
I would expect (it seems nearly linear with the number of rows).
Any advice would be welcome.  I'll lay out a lot of detail.

Some details follow.  The basic scenario is:
- there are over 498M rows.  This seems to qualify it as a
  rather large table
- the table has 6 fixed-length fields
- I have created indexes on each field
- I set MAX_ROW_SIZE to 550,000,000
- I've tried this with both MyISAM and INNODB tables, with very
  similar performance
- Slower inserts are not really a problem, if I could get faster searching
- I've run ANALYZE, OPTIMIZE, myisamchk etc. (see below)
- Using a socket for local communication; MySQL++ for apps, but I've
  also confirmed benchmarking results using the mysql> command line 
- I'm not presenting a lot of variety in queries and response times below,
  but can send more if needed.  
- Due to the patterns in the data, there are many many duplicate
  values in the indexes.  I suspect this might be important.
- This is not a results transport issue.  I'm able to retrieve
  about 1000 rows/second, which is OK.  It's generating the result set,
  prior to results transport, which is slow.

Queries are definitely disk-bound, with 5-40% of CPU utilization
during a long-running query.  (I did a little experimentation with a
ramfs to confirm this...too bad I don't have 40GB of RAM).

The application: I want to run queries that will identify particular
DOCUMENTS (via "docid") that have particular TERMS (via "termid").
This is your garden variety task for a search engine or other
information retrieval system.  I often submit additional criteria,
such as adjacent terms (offset A - offset B=1) or terms in a
particular tag (term A in <title> and term B in <title>) or proximity
(paragraph A == paragraph B).  Being able to do such powerful queries
is the whole point of using MySQL, rather than trying to manage things
myself (using BerkeleyDB or home-grown B-trees, as I've done in the
past).

I have made a quickie 2000-document, then a longer 100,000 document,
then the full 1.2M document database (input is about 20GB of HTML).
The challenge is that on the 1.2M document dataset with nearly 1/2 a
billion rows (term occurrences), I just can't seem to get a quick
response for combinations with terms occurring in a lot of rows.

Maybe the answer is that this is the best I can expect due to
duplicate key values and other factors (that is, there are many terms
with each docid....many many terms with each paragraph id....many many
many terms with each offset, etc.).  But I hope there might be other
tuning parameters or tips that will speed things up.

If I look for terms in only a few hundred rows, no problem.  But for
terms with thousands of rows, it takes a very long time (several
minutes for the first 1000 rows to an hour for all rows) to get the
query output, even when I order by and limit to get a smaller output
set.

Concrete example:

        select * from inv0web02 as t0,inv0web02 as t1 where ((t0.termid=35)
        and (t1.termid=141) and (t0.docid=t1.docid) and
        (t0.whichpara=t1.whichpara) and ( (t1.offset - t0.offset) = 1)) order
        by t0.weight_in_subdoc limit 1000

This takes about 3 minutes where term35="web" and term141="page" (very
common terms).  (Timing depends slightly on whether I employ a Result
or ResUse in my MySQL++ code - but I don't think the MySQL++ is the
controlling factor here since I've also experimented with mysql on the
command line).  This is on a RedHat 9 box: Dell 4600 with 12GB RAM,
2x2.8Ghz Xeon (hyperthreaded to look like 4 CPUs) and 10KRPM drives.
It's a pretty speedy system.  I'm using mysql-max-4.0.17-pc-linux-i686
(binary build).

With less common terms, I'm able to get a response in just a few
seconds.  The subtraction in the offset is not a deciding factor;
performance is comparable without it.  Clearly, the issue is the merge
within the same table using WHERE criteria.

The "order by" doesn't matter much; the "limit" speeds things up quite
a bit since the response set is smaller.  (I use these so that the
program that uses the results only needs to look at the "best," and to
limit the response set size).  If I remove the "limit 1000", the query
can take well over 1/2 hour (even without actually retrieving the
rows).

I have my.cnf tuned to use a lot of memory, so we're not doing a lot
of swapping or thrashing.  The total memory footprint for the mysql
process maxes at 1.5GB or so.  Most of the query wall-clock time seems
to be disk access.

        myisam_sort_buffer_size = 640M

Explain looks fine, in terms of using indices.  For the query above:

| table | type | possible_keys                                    | key          | 
key_len | ref      | rows   | Extra                       |
+-------+------+--------------------------------------------------+--------------+---------+----------+--------+-----------------------------+
| t0    | ref  | PRIMARY,termid_index,whichpara_index,docid_index | termid_index |     
  4 | const    | 405296 | Using where; Using filesort |
| t1    | ref  | PRIMARY,termid_index,whichpara_index,docid_index | docid_index  |     
  4 | t0.docid |    402 | Using where                 |

A little more detail follows.  Thanks for any advice or ideas.
  -- Greg


mysql> show create table inv0web02;
 CREATE TABLE `inv0web02` (
  `docid` int(10) unsigned NOT NULL default '0',
  `offset` smallint(5) unsigned NOT NULL default '0',
  `termid` int(10) unsigned NOT NULL default '0',
  `taglistid` smallint(5) unsigned NOT NULL default '0',
  `whichpara` tinyint(3) unsigned NOT NULL default '0',
  `weight_in_subdoc` float unsigned NOT NULL default '0',
  PRIMARY KEY  (`docid`,`offset`),
  KEY `termid_index` (`termid`),
  KEY `whichpara_index` (`whichpara`),
  KEY `taglistid_index` (`taglistid`),
  KEY `weight_index` (`weight_in_subdoc`),
  KEY `docid_index` (`docid`),
  KEY `offset_index` (`offset`)
) TYPE=MyISAM |

# /usr/local/mysql/bin/myisamchk -divv /data/mysql/irt/inv0web02
MyISAM file:         /data/mysql/irt/inv0web02
Record format:       Fixed length
Character set:       latin1 (8)
File-version:        1
Creation time:       2004-01-11 23:25:42
Recover time:        2004-01-12  3:31:35
Status:              checked,analyzed
Data records:            498093481  Deleted blocks:                 0
Datafile parts:          498093481  Deleted data:                   0
Datafile pointer (bytes):        4  Keyfile pointer (bytes):        4
Datafile length:        8965682658  Keyfile length:       25881316352
Max datafile length:   77309411326  Max keyfile length: 4398046510079
Recordlength:                   18

table description:
Key Start Len Index   Type                     Rec/key         Root  Blocksize
1   2     4   unique  unsigned long                402    185730048       1024
    6     2           unsigned short                 1
2   8     4   multip. unsigned long                132   9169156096       1024
3   14    1   multip. binary                   1945678  11691716608       1024
4   12    2   multip. unsigned short              7600  14722002944       1024
5   15    4   multip. float                        192  18786515968       1024
6   2     4   multip. unsigned long                402  22851028992       1024
7   6     2   multip. unsigned short             28253  25881315328       1024

Field Start Length Nullpos Nullbit Type
1     1     1                                                         
2     2     4                                                         
3     6     2                                                         
4     8     4                                                         
5     12    2                                                         
6     14    1                                                         
7     15    4                                                      


Dr. Gregory B. Newby, Research Faculty, Arctic Region Supercomputing Center
University of Alaska Fairbanks.  PO Box 756020, Fairbanks, AK 99775
e: newby AT arsc.edu v: 907-474-7160 f: 907-474-5494 w: www.arsc.edu/~newby

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to