Re: help on query/group by

SGreen Wed, 16 Mar 2005 10:58:53 -0800

I have a favorite technique for improving the results of queries , like 
this, which involve fairly large JOINed tables. It's a form of 
"divide-and-conquer" in that you pre-compute what you can then make the 
JOINS you need to finish up the results.


I agree that the "AND s2.syn LIKE '%'" in the ON clause of the second JOIN 
is worthless and can be eliminated however I recognize this as a common 
pattern for a two-term search and it may be harder to eliminate that 
clause than at first glance.

This is how I would speed things up, Your friend really has 3 types of 
searches possible:
        a) search by drug name only
        b) search by protein name only
        c) search by both drug name and protein name

Since the c) is the more complex situation, I will model it. It's almost 
trivial to clip out the unnecessary parts to make the other two queries. 
If I typed everything correctly, you should be able to cut and paste the 
whole thing into the MySQL client and have it execute.

####### begin##########
CREATE TEMPORARY TABLE tmpSynDrug (key nameID)
SELECT nameID, max(syn) as drugSyn
FROM synonyms
WHERE syn LIKE 'a%'
GROUP BY nameID;

CREATE TEMPORARY TABLE tmpSynProt (key nameID)
SELECT nameID, max(syn) as protSyn
FROM synonyms
WHERE syn LIKE 'a%'
GROUP BY nameID;

CREATE TEMPORARY TABLE tmpMatch (key sentID)
SELECT m.sentID, m.drugID, m.protID, tsd.drugSyn, tsp.protSyn
FROM matches m
INNER JOIN tmpSynDrugs tsd
        ON tsd.nameID = m.drugID
INNER JOIN tmpSynProt tsp
        ON tsp.nameID = m.protID;

# what we should have now is a nice small table that meets most of the 
# original query criteria. Now to summarize by publication by 
# joining through the sentence table

SELECT tm.drugID, tm.protID, tm.drugSyn, tm.protSyn, COUNT(DISTINCT 
s.pmid) as publications
FROM tmpMatch tm
INNER JOIN sentence s
        ON s.id = tm.sentID
GROUP BY 1,2,3,4 ;

# I used a shortcut in the GROUP BY, I referenced the columns 
# by their positions and not by their names

#Now that we have the data we wanted we can cleanup after ourselves:

DROP TABLE tmpMatch, tmpSynProt, tmpSynDrug;

######## end #############

By minimizing the number of records that needs to be JOINed at each stage 
of the query, we keep things moving along. This technique is very useful 
for queries whose JOIN products are somewhere in the hundreds of billions 
or records or more (which yours easily is).  If you didn't want the names 
to be representative, but listed,  you would change the first two queries 
to be like:

CREATE TEMPORARY TABLE tmpSynDrug (key nameID)
SELECT nameID, syn as drugSyn
FROM synonyms
WHERE syn LIKE 'a%';

If you didn't need names at all I would just say:

CREATE TEMPORARY TABLE tmpSynDrug (key nameID)
SELECT DISTINCT nameID
FROM synonyms
WHERE syn LIKE 'a%'

and modify the other queries to not look for the name columns.

HTH,

Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine



 
====== Original   message 
======from "mel list_php" <[EMAIL PROTECTED]> 03/16/2005 12:45 PM
==============================================================
Hi,

A friend of mine asked me to have a look at one of his query, and I'm 
stuck....
Here was his query:
SELECT drugID, protID, COUNT(DISTINCT pmid),
       MAX(s1.syn) AS o1, MAX(s2.syn) AS o2
FROM matches
INNER JOIN synonyms AS s1 ON drugID=s1.nameID AND s1.syn LIKE 'a%'
INNER JOIN synonyms AS s2 ON protID=s2.nameID AND s2.syn LIKE '%'
INNER JOIN sentence ON sentID=id
GROUP BY drugID, protID ORDER BY o1, o2 LIMIT 601

and this is his goal:

"The idea is quite simple: The table called 'matches' contains triples

  drugID, protID, sentID

indicating a co-occurence of a drug and a protein in a sentence. The
user of course searches for either drug name or protein name or
both. In the above query, the user wants everything for all drugs
starting with 'a'.

The MAX() calls more or less arbitrarily choose one of the many names
associated with drugID as a representative. With the COUNT() I want to
find out how many different medline abstracts (not sentences) have a
hit."

The matches table is 1,247,508 rows, sentence is  817,255 rows and 
synonyms 
is 225,497 rows.

First I think using inner join in that case is not helpful, because it is 
making a whole cartesian product on the tables, whereas a left join would 
limit the number of rows.
The second line "INNER JOIN synonyms AS s2 ON protID=s2.nameID AND s2.syn 
LIKE '%'"  is useless I think, because it just retrieves the not null 
values 
for protID.

I also added indexes on the table (i'm not very familiar with indexes, so 
that is probably my problem)
- on matches: index on protID,drugID and sentID
- on sentence: index on id (primary key)
- on synonyms: index on nameID,syn

Here are the tables:
mysql> desc matches;
+--------+---------+------+-----+---------+-------+
| Field  | Type    | Null | Key | Default | Extra |
+--------+---------+------+-----+---------+-------+
| protID | text     | YES  | MUL  | NULL   |       |
| drugID | text    | YES  |         | NULL    |       |
| sentID | int(11) | YES  | MUL | NULL    |       |
+--------+---------+------+-----+---------+-------+
3 rows in set (0.00 sec)

mysql> desc sentence;
+-------+------------------+------+-----+---------+----------------+
| Field | Type                | Null | Key | Default | Extra          |
+-------+------------------+------+-----+---------+----------------+
| id    | int(10) unsigned |        | PRI | NULL    | auto_increment |
| text  | text                  | YES  |     | NULL     |                |
| pmid  | int(11)            | YES  |     | NULL      |                |
+-------+------------------+------+-----+---------+----------------+
3 rows in set (0.00 sec)

mysql> desc synonyms;
+--------+------+------+-----+---------+-------+
| Field     | Type | Null | Key | Default | Extra |
+--------+------+------+-----+---------+-------+
| nameID | text | YES  | MUL | NULL    |       |
| syn       | text | YES  |        | NULL    |       |
+--------+------+------+-----+---------+-------+
2 rows in set (0.00 sec)

I wanted to see how where used the indexes:
mysql> explain select * from matches left join synonyms on drugID=nameID;
+----------+------+---------------+------+---------+----------------+---------+-------+
| table        | type | possible_keys | key   | key_len   | ref  
     | rows      | Extra |
+----------+------+---------------+------+---------+----------------+---------+-------+
| matches   | ALL  | NULL              | NULL|    NULL | NULL  
| 1247508 |       |
| synonyms | ref   | c                     | c     |      23    | 
matches.drugID |       4      |       |
+----------+------+---------------+------+---------+----------------+---------+-------+
2 rows in set (0.00 sec)

mysql> explain select * from matches left join synonyms on drugID=nameID 
left join sentence on sentID=id;
+----------+--------+---------------+---------+---------+----------------+---------+-------+
| table       | type   | possible_keys | key           | key_len | ref  
             | rows       | Extra |
+----------+--------+---------------+---------+---------+----------------+---------+-------+
| matches  | ALL     | NULL           | NULL       |    NULL   | NULL  
        | 1247508 |       |
| synonyms | ref     | c                  | c             |      23     | 
matches.drugID |       4      |       |
| sentence  | eq_ref | PRIMARY     | PRIMARY |       4     | 
matches.sentID 
|       1      |       |
+----------+--------+---------------+---------+---------+----------------+---------+-------+
3 rows in set (0.00 sec)

If I do a "reverse" query of the first one, no index is used ( select * 
from 
synonyms left join matches on drugID=nameID;) what I don't understand: I 
thought syn is indexed, so it will quickly find the ones beginning by 'a%' 

and then do the join with matches but on a limited number of records only, 

so it will be much faster.

I finally manage to have something quite working: (but not the final 
result 
my friend is hoping!!)

select * from matches left join synonyms on drugID=nameID left join 
sentence 
on sentID=id limit 0,100;

returns me the result in 0.01sec . The limit here is compulsory or the 
query 
is once again very slow.

My problem now is then I add the count distinct and the group by it's very 

slow again...

I know that this mail is quite long , and I didn't manage to have a nice 
presentation of the tables, but if somebody could give me advice on 
this?And 
explain me which indexes I should use and why?

Thanks a lot,
Melanie

_________________________________________________________________
It's fast, it's easy and it's free. Get MSN Messenger today! 
http://www.msn.co.uk/messenger


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: help on query/group by

Reply via email to