It looks like having a multiple-column index would help more than having
a number of individual indices, at least for the example query.  An
index on (termid, docid, whichpara, offset) might work better than just
having one on termid.

joe

On Thu, 2004-01-15 at 10:53, Gregory Newby wrote:
> I'm using MySQL for an information retrieval application where word
> occurrences are indexed.  It seems that performance is not as good as
> I would expect (it seems nearly linear with the number of rows).
> Any advice would be welcome.  I'll lay out a lot of detail.
> 
> Some details follow.  The basic scenario is:
> - there are over 498M rows.  This seems to qualify it as a
>   rather large table
> - the table has 6 fixed-length fields
> - I have created indexes on each field
> - I set MAX_ROW_SIZE to 550,000,000
> - I've tried this with both MyISAM and INNODB tables, with very
>   similar performance
> - Slower inserts are not really a problem, if I could get faster searching
> - I've run ANALYZE, OPTIMIZE, myisamchk etc. (see below)
> - Using a socket for local communication; MySQL++ for apps, but I've
>   also confirmed benchmarking results using the mysql> command line 
> - I'm not presenting a lot of variety in queries and response times below,
>   but can send more if needed.  
> - Due to the patterns in the data, there are many many duplicate
>   values in the indexes.  I suspect this might be important.
> - This is not a results transport issue.  I'm able to retrieve
>   about 1000 rows/second, which is OK.  It's generating the result set,
>   prior to results transport, which is slow.
> 
> Queries are definitely disk-bound, with 5-40% of CPU utilization
> during a long-running query.  (I did a little experimentation with a
> ramfs to confirm this...too bad I don't have 40GB of RAM).
> 
> The application: I want to run queries that will identify particular
> DOCUMENTS (via "docid") that have particular TERMS (via "termid").
> This is your garden variety task for a search engine or other
> information retrieval system.  I often submit additional criteria,
> such as adjacent terms (offset A - offset B=1) or terms in a
> particular tag (term A in <title> and term B in <title>) or proximity
> (paragraph A == paragraph B).  Being able to do such powerful queries
> is the whole point of using MySQL, rather than trying to manage things
> myself (using BerkeleyDB or home-grown B-trees, as I've done in the
> past).
> 
> I have made a quickie 2000-document, then a longer 100,000 document,
> then the full 1.2M document database (input is about 20GB of HTML).
> The challenge is that on the 1.2M document dataset with nearly 1/2 a
> billion rows (term occurrences), I just can't seem to get a quick
> response for combinations with terms occurring in a lot of rows.
> 
> Maybe the answer is that this is the best I can expect due to
> duplicate key values and other factors (that is, there are many terms
> with each docid....many many terms with each paragraph id....many many
> many terms with each offset, etc.).  But I hope there might be other
> tuning parameters or tips that will speed things up.
> 
> If I look for terms in only a few hundred rows, no problem.  But for
> terms with thousands of rows, it takes a very long time (several
> minutes for the first 1000 rows to an hour for all rows) to get the
> query output, even when I order by and limit to get a smaller output
> set.
> 
> Concrete example:
> 
>       select * from inv0web02 as t0,inv0web02 as t1 where ((t0.termid=35)
>       and (t1.termid=141) and (t0.docid=t1.docid) and
>       (t0.whichpara=t1.whichpara) and ( (t1.offset - t0.offset) = 1)) order
>       by t0.weight_in_subdoc limit 1000
> 
> This takes about 3 minutes where term35="web" and term141="page" (very
> common terms).  (Timing depends slightly on whether I employ a Result
> or ResUse in my MySQL++ code - but I don't think the MySQL++ is the
> controlling factor here since I've also experimented with mysql on the
> command line).  This is on a RedHat 9 box: Dell 4600 with 12GB RAM,
> 2x2.8Ghz Xeon (hyperthreaded to look like 4 CPUs) and 10KRPM drives.
> It's a pretty speedy system.  I'm using mysql-max-4.0.17-pc-linux-i686
> (binary build).
> 
> With less common terms, I'm able to get a response in just a few
> seconds.  The subtraction in the offset is not a deciding factor;
> performance is comparable without it.  Clearly, the issue is the merge
> within the same table using WHERE criteria.
> 
> The "order by" doesn't matter much; the "limit" speeds things up quite
> a bit since the response set is smaller.  (I use these so that the
> program that uses the results only needs to look at the "best," and to
> limit the response set size).  If I remove the "limit 1000", the query
> can take well over 1/2 hour (even without actually retrieving the
> rows).
> 
> I have my.cnf tuned to use a lot of memory, so we're not doing a lot
> of swapping or thrashing.  The total memory footprint for the mysql
> process maxes at 1.5GB or so.  Most of the query wall-clock time seems
> to be disk access.
> 
>       myisam_sort_buffer_size = 640M
> 
> Explain looks fine, in terms of using indices.  For the query above:
> 
> | table | type | possible_keys                                    | key          | 
> key_len | ref      | rows   | Extra                       |
> +-------+------+--------------------------------------------------+--------------+---------+----------+--------+-----------------------------+
> | t0    | ref  | PRIMARY,termid_index,whichpara_index,docid_index | termid_index |   
>     4 | const    | 405296 | Using where; Using filesort |
> | t1    | ref  | PRIMARY,termid_index,whichpara_index,docid_index | docid_index  |   
>     4 | t0.docid |    402 | Using where                 |
> 
> A little more detail follows.  Thanks for any advice or ideas.
>   -- Greg
> 
> 
> mysql> show create table inv0web02;
>  CREATE TABLE `inv0web02` (
>   `docid` int(10) unsigned NOT NULL default '0',
>   `offset` smallint(5) unsigned NOT NULL default '0',
>   `termid` int(10) unsigned NOT NULL default '0',
>   `taglistid` smallint(5) unsigned NOT NULL default '0',
>   `whichpara` tinyint(3) unsigned NOT NULL default '0',
>   `weight_in_subdoc` float unsigned NOT NULL default '0',
>   PRIMARY KEY  (`docid`,`offset`),
>   KEY `termid_index` (`termid`),
>   KEY `whichpara_index` (`whichpara`),
>   KEY `taglistid_index` (`taglistid`),
>   KEY `weight_index` (`weight_in_subdoc`),
>   KEY `docid_index` (`docid`),
>   KEY `offset_index` (`offset`)
> ) TYPE=MyISAM |
> 
> # /usr/local/mysql/bin/myisamchk -divv /data/mysql/irt/inv0web02
> MyISAM file:         /data/mysql/irt/inv0web02
> Record format:       Fixed length
> Character set:       latin1 (8)
> File-version:        1
> Creation time:       2004-01-11 23:25:42
> Recover time:        2004-01-12  3:31:35
> Status:              checked,analyzed
> Data records:            498093481  Deleted blocks:                 0
> Datafile parts:          498093481  Deleted data:                   0
> Datafile pointer (bytes):        4  Keyfile pointer (bytes):        4
> Datafile length:        8965682658  Keyfile length:       25881316352
> Max datafile length:   77309411326  Max keyfile length: 4398046510079
> Recordlength:                   18
> 
> table description:
> Key Start Len Index   Type                     Rec/key         Root  Blocksize
> 1   2     4   unique  unsigned long                402    185730048       1024
>     6     2           unsigned short                 1
> 2   8     4   multip. unsigned long                132   9169156096       1024
> 3   14    1   multip. binary                   1945678  11691716608       1024
> 4   12    2   multip. unsigned short              7600  14722002944       1024
> 5   15    4   multip. float                        192  18786515968       1024
> 6   2     4   multip. unsigned long                402  22851028992       1024
> 7   6     2   multip. unsigned short             28253  25881315328       1024
> 
> Field Start Length Nullpos Nullbit Type
> 1     1     1                                                         
> 2     2     4                                                         
> 3     6     2                                                         
> 4     8     4                                                         
> 5     12    2                                                         
> 6     14    1                                                         
> 7     15    4                                                      
> 
> 
> Dr. Gregory B. Newby, Research Faculty, Arctic Region Supercomputing Center
> University of Alaska Fairbanks.  PO Box 756020, Fairbanks, AK 99775
> e: newby AT arsc.edu v: 907-474-7160 f: 907-474-5494 w: www.arsc.edu/~newby
-- 
Joe Shear <[EMAIL PROTECTED]>


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to