Re: help on query/group by

mel list_php Fri, 18 Mar 2005 02:58:20 -0800

Hi Shawn,

Thank you very much, I'm impressed by the time you took to answer me, and the quality of the reply!!! I forwarded the answer to my friend. I'm wondering, I knew the mechanism of temporary tables, but as I've never used it I was trying the left join way. Here is a summary of my questions: - why using inner join here?is there any difference with using a left join?I thought using a left join would decrease the number of results. - do you know why without group by my query was running very fast and become so slow with the group by?when it does a group by it's scanning the whole table or an other reason? - I don't know if his version of mysql supports subqueries, but I was wondering if it is possible to replace the temporary tables by subqueries and keeping the same efficiency (my friend told me he would like to have only one sql query).

Once again thank you very much for your help, I will give temporary tables an other chance!!! Melanie

From: [EMAIL PROTECTED]
To: "mel list_php" <[EMAIL PROTECTED]>
CC: mysql@lists.mysql.com
Subject: Re: help on query/group by
Date: Wed, 16 Mar 2005 13:52:44 -0500

I have a favorite technique for improving the results of queries , like
this, which involve fairly large JOINed tables. It's a form of
"divide-and-conquer" in that you pre-compute what you can then make the
JOINS you need to finish up the results.

I agree that the "AND s2.syn LIKE '%'" in the ON clause of the second JOIN
is worthless and can be eliminated however I recognize this as a common
pattern for a two-term search and it may be harder to eliminate that
clause than at first glance.

This is how I would speed things up, Your friend really has 3 types of
searches possible:
        a) search by drug name only
        b) search by protein name only
        c) search by both drug name and protein name

Since the c) is the more complex situation, I will model it. It's almost
trivial to clip out the unnecessary parts to make the other two queries.
If I typed everything correctly, you should be able to cut and paste the
whole thing into the MySQL client and have it execute.

####### begin##########
CREATE TEMPORARY TABLE tmpSynDrug (key nameID)
SELECT nameID, max(syn) as drugSyn
FROM synonyms
WHERE syn LIKE 'a%'
GROUP BY nameID;

CREATE TEMPORARY TABLE tmpSynProt (key nameID)
SELECT nameID, max(syn) as protSyn
FROM synonyms
WHERE syn LIKE 'a%'
GROUP BY nameID;

CREATE TEMPORARY TABLE tmpMatch (key sentID)
SELECT m.sentID, m.drugID, m.protID, tsd.drugSyn, tsp.protSyn
FROM matches m
INNER JOIN tmpSynDrugs tsd
        ON tsd.nameID = m.drugID
INNER JOIN tmpSynProt tsp
        ON tsp.nameID = m.protID;

# what we should have now is a nice small table that meets most of the
# original query criteria. Now to summarize by publication by
# joining through the sentence table

SELECT tm.drugID, tm.protID, tm.drugSyn, tm.protSyn, COUNT(DISTINCT
s.pmid) as publications
FROM tmpMatch tm
INNER JOIN sentence s
        ON s.id = tm.sentID
GROUP BY 1,2,3,4 ;

# I used a shortcut in the GROUP BY, I referenced the columns
# by their positions and not by their names

#Now that we have the data we wanted we can cleanup after ourselves:

DROP TABLE tmpMatch, tmpSynProt, tmpSynDrug;

######## end #############

By minimizing the number of records that needs to be JOINed at each stage
of the query, we keep things moving along. This technique is very useful
for queries whose JOIN products are somewhere in the hundreds of billions
or records or more (which yours easily is).  If you didn't want the names
to be representative, but listed,  you would change the first two queries
to be like:

CREATE TEMPORARY TABLE tmpSynDrug (key nameID)
SELECT nameID, syn as drugSyn
FROM synonyms
WHERE syn LIKE 'a%';

If you didn't need names at all I would just say:

CREATE TEMPORARY TABLE tmpSynDrug (key nameID)
SELECT DISTINCT nameID
FROM synonyms
WHERE syn LIKE 'a%'

and modify the other queries to not look for the name columns.

HTH,

Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine


====== Original   message
======from "mel list_php" <[EMAIL PROTECTED]> 03/16/2005 12:45 PM
==============================================================
Hi,

A friend of mine asked me to have a look at one of his query, and I'm
stuck....
Here was his query:
SELECT drugID, protID, COUNT(DISTINCT pmid),
       MAX(s1.syn) AS o1, MAX(s2.syn) AS o2
FROM matches
INNER JOIN synonyms AS s1 ON drugID=s1.nameID AND s1.syn LIKE 'a%'
INNER JOIN synonyms AS s2 ON protID=s2.nameID AND s2.syn LIKE '%'
INNER JOIN sentence ON sentID=id
GROUP BY drugID, protID ORDER BY o1, o2 LIMIT 601

and this is his goal:

"The idea is quite simple: The table called 'matches' contains triples

  drugID, protID, sentID

indicating a co-occurence of a drug and a protein in a sentence. The
user of course searches for either drug name or protein name or
both. In the above query, the user wants everything for all drugs
starting with 'a'.

The MAX() calls more or less arbitrarily choose one of the many names
associated with drugID as a representative. With the COUNT() I want to
find out how many different medline abstracts (not sentences) have a
hit."

The matches table is 1,247,508 rows, sentence is  817,255 rows and
synonyms
is 225,497 rows.

First I think using inner join in that case is not helpful, because it is
making a whole cartesian product on the tables, whereas a left join would
limit the number of rows.
The second line "INNER JOIN synonyms AS s2 ON protID=s2.nameID AND s2.syn
LIKE '%'"  is useless I think, because it just retrieves the not null
values
for protID.

I also added indexes on the table (i'm not very familiar with indexes, so
that is probably my problem)
- on matches: index on protID,drugID and sentID
- on sentence: index on id (primary key)
- on synonyms: index on nameID,syn

Here are the tables:
mysql> desc matches;
+--------+---------+------+-----+---------+-------+
| Field  | Type    | Null | Key | Default | Extra |
+--------+---------+------+-----+---------+-------+
| protID | text     | YES  | MUL  | NULL   |       |
| drugID | text    | YES  |         | NULL    |       |
| sentID | int(11) | YES  | MUL | NULL    |       |
+--------+---------+------+-----+---------+-------+
3 rows in set (0.00 sec)

mysql> desc sentence;
+-------+------------------+------+-----+---------+----------------+
| Field | Type                | Null | Key | Default | Extra          |
+-------+------------------+------+-----+---------+----------------+
| id    | int(10) unsigned |        | PRI | NULL    | auto_increment |
| text  | text                  | YES  |     | NULL     |                |
| pmid  | int(11)            | YES  |     | NULL      |                |
+-------+------------------+------+-----+---------+----------------+
3 rows in set (0.00 sec)

mysql> desc synonyms;
+--------+------+------+-----+---------+-------+
| Field     | Type | Null | Key | Default | Extra |
+--------+------+------+-----+---------+-------+
| nameID | text | YES  | MUL | NULL    |       |
| syn       | text | YES  |        | NULL    |       |
+--------+------+------+-----+---------+-------+
2 rows in set (0.00 sec)

I wanted to see how where used the indexes:
mysql> explain select * from matches left join synonyms on drugID=nameID;
+----------+------+---------------+------+---------+----------------+---------+-------+
| table        | type | possible_keys | key   | key_len   | ref
     | rows      | Extra |
+----------+------+---------------+------+---------+----------------+---------+-------+
| matches   | ALL  | NULL              | NULL|    NULL | NULL
| 1247508 |       |
| synonyms | ref   | c                     | c     |      23    |
matches.drugID |       4      |       |
+----------+------+---------------+------+---------+----------------+---------+-------+
2 rows in set (0.00 sec)

mysql> explain select * from matches left join synonyms on drugID=nameID
left join sentence on sentID=id;
+----------+--------+---------------+---------+---------+----------------+---------+-------+
| table       | type   | possible_keys | key           | key_len | ref
             | rows       | Extra |
+----------+--------+---------------+---------+---------+----------------+---------+-------+
| matches  | ALL     | NULL           | NULL       |    NULL   | NULL
        | 1247508 |       |
| synonyms | ref     | c                  | c             |      23     |
matches.drugID |       4      |       |
| sentence  | eq_ref | PRIMARY     | PRIMARY |       4     |
matches.sentID
|       1      |       |
+----------+--------+---------------+---------+---------+----------------+---------+-------+
3 rows in set (0.00 sec)

If I do a "reverse" query of the first one, no index is used ( select *
from
synonyms left join matches on drugID=nameID;) what I don't understand: I
thought syn is indexed, so it will quickly find the ones beginning by 'a%'

and then do the join with matches but on a limited number of records only,

so it will be much faster.

I finally manage to have something quite working: (but not the final
result
my friend is hoping!!)

select * from matches left join synonyms on drugID=nameID left join
sentence
on sentID=id limit 0,100;

returns me the result in 0.01sec . The limit here is compulsory or the
query
is once again very slow.

My problem now is then I add the count distinct and the group by it's very

slow again...

I know that this mail is quite long , and I didn't manage to have a nice
presentation of the tables, but if somebody could give me advice on
this?And
explain me which indexes I should use and why?

Thanks a lot,
Melanie

_________________________________________________________________
It's fast, it's easy and it's free. Get MSN Messenger today!
http://www.msn.co.uk/messenger


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

_________________________________________________________________ It's fast, it's easy and it's free. Get MSN Messenger today! http://www.msn.co.uk/messenger


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: help on query/group by

Reply via email to