Hi Shawn,
Thank you very much, I'm impressed by the time you took to answer me, and the quality of the reply!!!
I forwarded the answer to my friend.
I'm wondering, I knew the mechanism of temporary tables, but as I've never used it I was trying the left join way.
Here is a summary of my questions:
- why using inner join here?is there any difference with using a left join?I thought using a left join would decrease the number of results.
- do you know why without group by my query was running very fast and become so slow with the group by?when it does a group by it's scanning the whole table or an other reason?
- I don't know if his version of mysql supports subqueries, but I was wondering if it is possible to replace the temporary tables by subqueries and keeping the same efficiency (my friend told me he would like to have only one sql query).
Once again thank you very much for your help, I will give temporary tables an other chance!!!
Melanie
From: [EMAIL PROTECTED] To: "mel list_php" <[EMAIL PROTECTED]> CC: mysql@lists.mysql.com Subject: Re: help on query/group by Date: Wed, 16 Mar 2005 13:52:44 -0500
I have a favorite technique for improving the results of queries , like this, which involve fairly large JOINed tables. It's a form of "divide-and-conquer" in that you pre-compute what you can then make the JOINS you need to finish up the results.
I agree that the "AND s2.syn LIKE '%'" in the ON clause of the second JOIN is worthless and can be eliminated however I recognize this as a common pattern for a two-term search and it may be harder to eliminate that clause than at first glance.
This is how I would speed things up, Your friend really has 3 types of searches possible: a) search by drug name only b) search by protein name only c) search by both drug name and protein name
Since the c) is the more complex situation, I will model it. It's almost trivial to clip out the unnecessary parts to make the other two queries. If I typed everything correctly, you should be able to cut and paste the whole thing into the MySQL client and have it execute.
####### begin########## CREATE TEMPORARY TABLE tmpSynDrug (key nameID) SELECT nameID, max(syn) as drugSyn FROM synonyms WHERE syn LIKE 'a%' GROUP BY nameID;
CREATE TEMPORARY TABLE tmpSynProt (key nameID) SELECT nameID, max(syn) as protSyn FROM synonyms WHERE syn LIKE 'a%' GROUP BY nameID;
CREATE TEMPORARY TABLE tmpMatch (key sentID) SELECT m.sentID, m.drugID, m.protID, tsd.drugSyn, tsp.protSyn FROM matches m INNER JOIN tmpSynDrugs tsd ON tsd.nameID = m.drugID INNER JOIN tmpSynProt tsp ON tsp.nameID = m.protID;
# what we should have now is a nice small table that meets most of the # original query criteria. Now to summarize by publication by # joining through the sentence table
SELECT tm.drugID, tm.protID, tm.drugSyn, tm.protSyn, COUNT(DISTINCT s.pmid) as publications FROM tmpMatch tm INNER JOIN sentence s ON s.id = tm.sentID GROUP BY 1,2,3,4 ;
# I used a shortcut in the GROUP BY, I referenced the columns # by their positions and not by their names
#Now that we have the data we wanted we can cleanup after ourselves:
DROP TABLE tmpMatch, tmpSynProt, tmpSynDrug;
######## end #############
By minimizing the number of records that needs to be JOINed at each stage of the query, we keep things moving along. This technique is very useful for queries whose JOIN products are somewhere in the hundreds of billions or records or more (which yours easily is). If you didn't want the names to be representative, but listed, you would change the first two queries to be like:
CREATE TEMPORARY TABLE tmpSynDrug (key nameID) SELECT nameID, syn as drugSyn FROM synonyms WHERE syn LIKE 'a%';
If you didn't need names at all I would just say:
CREATE TEMPORARY TABLE tmpSynDrug (key nameID) SELECT DISTINCT nameID FROM synonyms WHERE syn LIKE 'a%'
and modify the other queries to not look for the name columns.
HTH,
Shawn Green Database Administrator Unimin Corporation - Spruce Pine
====== Original message ======from "mel list_php" <[EMAIL PROTECTED]> 03/16/2005 12:45 PM ============================================================== Hi,
A friend of mine asked me to have a look at one of his query, and I'm stuck.... Here was his query: SELECT drugID, protID, COUNT(DISTINCT pmid), MAX(s1.syn) AS o1, MAX(s2.syn) AS o2 FROM matches INNER JOIN synonyms AS s1 ON drugID=s1.nameID AND s1.syn LIKE 'a%' INNER JOIN synonyms AS s2 ON protID=s2.nameID AND s2.syn LIKE '%' INNER JOIN sentence ON sentID=id GROUP BY drugID, protID ORDER BY o1, o2 LIMIT 601
and this is his goal:
"The idea is quite simple: The table called 'matches' contains triples
drugID, protID, sentID
indicating a co-occurence of a drug and a protein in a sentence. The user of course searches for either drug name or protein name or both. In the above query, the user wants everything for all drugs starting with 'a'.
The MAX() calls more or less arbitrarily choose one of the many names associated with drugID as a representative. With the COUNT() I want to find out how many different medline abstracts (not sentences) have a hit."
The matches table is 1,247,508 rows, sentence is 817,255 rows and synonyms is 225,497 rows.
First I think using inner join in that case is not helpful, because it is making a whole cartesian product on the tables, whereas a left join would limit the number of rows. The second line "INNER JOIN synonyms AS s2 ON protID=s2.nameID AND s2.syn LIKE '%'" is useless I think, because it just retrieves the not null values for protID.
I also added indexes on the table (i'm not very familiar with indexes, so that is probably my problem) - on matches: index on protID,drugID and sentID - on sentence: index on id (primary key) - on synonyms: index on nameID,syn
Here are the tables: mysql> desc matches; +--------+---------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------+---------+------+-----+---------+-------+ | protID | text | YES | MUL | NULL | | | drugID | text | YES | | NULL | | | sentID | int(11) | YES | MUL | NULL | | +--------+---------+------+-----+---------+-------+ 3 rows in set (0.00 sec)
mysql> desc sentence; +-------+------------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +-------+------------------+------+-----+---------+----------------+ | id | int(10) unsigned | | PRI | NULL | auto_increment | | text | text | YES | | NULL | | | pmid | int(11) | YES | | NULL | | +-------+------------------+------+-----+---------+----------------+ 3 rows in set (0.00 sec)
mysql> desc synonyms; +--------+------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------+------+------+-----+---------+-------+ | nameID | text | YES | MUL | NULL | | | syn | text | YES | | NULL | | +--------+------+------+-----+---------+-------+ 2 rows in set (0.00 sec)
I wanted to see how where used the indexes: mysql> explain select * from matches left join synonyms on drugID=nameID; +----------+------+---------------+------+---------+----------------+---------+-------+ | table | type | possible_keys | key | key_len | ref | rows | Extra | +----------+------+---------------+------+---------+----------------+---------+-------+ | matches | ALL | NULL | NULL| NULL | NULL | 1247508 | | | synonyms | ref | c | c | 23 | matches.drugID | 4 | | +----------+------+---------------+------+---------+----------------+---------+-------+ 2 rows in set (0.00 sec)
mysql> explain select * from matches left join synonyms on drugID=nameID left join sentence on sentID=id; +----------+--------+---------------+---------+---------+----------------+---------+-------+ | table | type | possible_keys | key | key_len | ref | rows | Extra | +----------+--------+---------------+---------+---------+----------------+---------+-------+ | matches | ALL | NULL | NULL | NULL | NULL | 1247508 | | | synonyms | ref | c | c | 23 | matches.drugID | 4 | | | sentence | eq_ref | PRIMARY | PRIMARY | 4 | matches.sentID | 1 | | +----------+--------+---------------+---------+---------+----------------+---------+-------+ 3 rows in set (0.00 sec)
If I do a "reverse" query of the first one, no index is used ( select * from synonyms left join matches on drugID=nameID;) what I don't understand: I thought syn is indexed, so it will quickly find the ones beginning by 'a%'
and then do the join with matches but on a limited number of records only,
so it will be much faster.
I finally manage to have something quite working: (but not the final result my friend is hoping!!)
select * from matches left join synonyms on drugID=nameID left join sentence on sentID=id limit 0,100;
returns me the result in 0.01sec . The limit here is compulsory or the query is once again very slow.
My problem now is then I add the count distinct and the group by it's very
slow again...
I know that this mail is quite long , and I didn't manage to have a nice presentation of the tables, but if somebody could give me advice on this?And explain me which indexes I should use and why?
Thanks a lot, Melanie
_________________________________________________________________ It's fast, it's easy and it's free. Get MSN Messenger today! http://www.msn.co.uk/messenger
-- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]
_________________________________________________________________
It's fast, it's easy and it's free. Get MSN Messenger today! http://www.msn.co.uk/messenger
-- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]