My reply below: "mel list_php" <[EMAIL PROTECTED]> wrote on 03/21/2005 11:49:26 AM:
> Hi, > Sorry for the late reply, out for the week-end! > > Here is the information you asked for (I modified the columns' type as you > suggested) > mysql> show create table matches\G > *************************** 1. row *************************** > Table: matches > Create Table: CREATE TABLE `matches` ( > `protID` varchar(50) default NULL, > `drugID` int(11) default NULL, > `sentID` int(11) default NULL, > KEY `sentenceId` (`sentID`), > KEY `drugID` (`drugID`), > KEY `protID` (`protID`) > ) TYPE=MyISAM > 1 row in set (0.00 sec) > > mysql> show create table sentence\G > *************************** 1. row *************************** > Table: sentence > Create Table: CREATE TABLE `sentence` ( > `id` int(10) unsigned NOT NULL auto_increment, > `text` text, > `pmid` int(11) default NULL, > PRIMARY KEY (`id`) > ) TYPE=MyISAM > 1 row in set (0.00 sec) > > mysql> show create table synonyms\G > *************************** 1. row *************************** > Table: synonyms > Create Table: CREATE TABLE `synonyms` ( > `nameID` varchar(50) default NULL, > `syn` text, > UNIQUE KEY `c` (`nameID`(20),`syn`(20)) > ) TYPE=MyISAM > 1 row in set (0.01 sec) > > I think I used int as much as possible,here some data samples: > > INSERT INTO `matches` VALUES ('Q9UP51', 202531, 4); > INSERT INTO `matches` VALUES ('SVC2_HUMAN', 202037, 5); > INSERT INTO `matches` VALUES ('PF2R_SHEEP', 202096, 6); > INSERT INTO `matches` VALUES ('CAQS_RAT', 202037, 7); > > > INSERT INTO `sentence` VALUES (4, 'Of <NP>10 contraceptives </NP>tested , > <NP><span class=\'drug\'>Ortho-Gynol</span> </NP>was found to be the most > rapidly acting , followed by <NP>Cooper Creme </NP>, <a > href=\'http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?[UNIPROT-acc: > (Q9UP51)]+-id+008+-view+UniprotDateView\'>Lactikol- > B</a>\n, <NP>Vagi-Serol </NP>, <NP>Marvosan </NP>, <NP>Clinicol </NP>, > <NP>Jelly-X </NP>, <NP>Bor-Oxyquin </NP>, <NP>Cellikol </NP>, and > <NP>Lanteen Blue Jelly </NP>in <NP>that order </NP>. ', 12305459); > > (the text is usually longer than 255 characters, so I think text is the only > choice, except longtext which can only be worst for indexation in my > opinion) > > INSERT INTO `synonyms` VALUES ('202037', 'testosterone'); > INSERT INTO `synonyms` VALUES ('Q91XV1', 'gonadotropin'); > INSERT INTO `synonyms` VALUES ('202037', 'estradiol'); > INSERT INTO `synonyms` VALUES ('Q91XV1', 'gonadotropin-suppressing'); > > I tried to run the query you gave me (with temporary tables) it is still too > long so I guess that my only solution now is to use indexes. > When modifying the columns' type from text to varchar, even if the text was > indexed as unique mysql complained about duplicates, and actually there > were few duplicates, I thought it was as you said because only the beginning > of the text field is indexed but I had 2 cases were the names were short > (less than 15 characters) and the same (even the spaces).Removing one and > the index on varchar was created. Any idea why? > > I will try to play with the indexes, if I understand well I'd better index > the three columns in once, because that will automatically index each of > them?or am I wrong? > > Anyway, thank you for all your advices, I'm really learning a lot of things > with that case! > Melanie > > > > > > >From: [EMAIL PROTECTED] > >To: "mel list_php" <[EMAIL PROTECTED]> > >CC: mysql@lists.mysql.com > >Subject: Re: help on query/group by > >Date: Fri, 18 Mar 2005 12:43:06 -0500 > > > >"mel list_php" <[EMAIL PROTECTED]> wrote on 03/18/2005 10:35:30 AM: > > > > > Hi again, > > > > > > Thanks for the explanation about the join and the group by. > > > > > > I wanted to test your query (almost a simple copy/paste :-)) ). > > > > > > The first 2 queries are ok, but the third one still is too long : > > > mysql> CREATE TEMPORARY TABLE tmpSynDrug( KEY ( nameID (20)) ) > >SELECT > > > nameID, max( syn ) AS drugSyn > > > -> FROM synonyms > > > -> WHERE syn > > > -> LIKE 'a%' > > > -> GROUP BY nameID; > > > Query OK, 9693 rows affected (1.07 sec) > > > Records: 9693 Duplicates: 0 Warnings: 0 > > > > > > mysql> > > > mysql> CREATE TEMPORARY TABLE tmpSynProt (key (nameID(20))) > > > -> SELECT nameID, max(syn) as protSyn > > > -> FROM synonyms > > > -> WHERE syn LIKE 'a%' > > > -> GROUP BY nameID; > > > Query OK, 9693 rows affected (1.03 sec) > > > Records: 9693 Duplicates: 0 Warnings: 0 > > > > > > mysql> > > > mysql> CREATE TEMPORARY TABLE tmpMatch( KEY ( sentID) ) SELECT > >m.sentID, > > > m.drugID, m.protID, tsd.drugSyn, tsp.protSyn > > > -> FROM matches m > > > -> INNER JOIN tmpSynDrug tsd ON tsd.nameID = m.drugID > > > -> INNER JOIN tmpSynProt tsp ON tsp.nameID = m.protID; > > > > > > > > > I've tried once this morning (GMT time), it ran for one hour nothing I > > > killed mysql (btw, how can I kill only one query when mysql "freezes"?I > > > aborted but then mysqladmin shutdown didn't work anymore..) > > > I thought it was maybe because I have other heavy stuff running, but I > >tried > > > a second time now and it's been running for 2 hours now, with almost > >nothing > > > else on the desktop. > > > So apparently the join between matches and the other tables is still too > > > > > heavy.... > > > Any idea? > ><snip> > > > >OK, I reviewed what you have posted so far and I found a performance > >killer. On the table "matches", the columns protID and drugID are > >declared as text. This is "bad" for searching as you can only index the > >first portion of any text column. Those columns should be declared as CHAR > > or VARCHAR or better yet, some integer value. If all 3 columns in the > >"matches" table are integers (INT or BIGINT, preferably UNSIGNED) then > >this becomes a fixed-width table and lookups become exceedingly fast. All > >indexes on those columns also become number-based and numeric comparisons > >occur *much* faster than string comparisons. I very rarely use non-numeric > >primary keys for just this reason. > > > >You mentioned there were indexes on the table and provided the output of > >DESC for the table but DESC does a very poor job of actually describing > >indexes. I prefer the output of SHOW CREATE TABLE xxxx\G as it gives me a > >complete table creation statement(Use /G and not ; to eliminate a lot of > >excess formatting in the output). Can you generate that for me, please? I > >practically guarantee that if we re-tool that table (including the > >indexes), our query times will drop like rocks. > > > >You can kill a single query through the commands SHOW [FULL] PROCESSLIST > >(to identify the # of the process you want to kill) and KILL # (using the > ># you just looked up). This usually drops the connection to the client > >running the query you killed, too (so be prepared to reconnect). > > > >http://dev.mysql.com/doc/mysql/en/show-processlist.html > >http://dev.mysql.com/doc/mysql/en/kill.html > > > >I just realized that we are only querying for the search conditions DRUG > >like 'a%' AND PROTIEN like 'a%'. If we wanted to do the OR of that (DRUG > >like ... OR Protien like ...) we could use a UNION query to generate > >tmpMatch > > > >CREATE TEMPORARY TABLE tmpMatch (key sentID) > >(SELECT m.sentID, m.drugID, m.protID, tsd.drugSyn, tsp.protSyn > >FROM matches m > >INNER JOIN tmpSynDrugs tsd > > ON tsd.nameID = m.drugID > >INNER JOIN tmpSynProt tsp > > ON tsp.nameID = m.protID) > >UNION > >(SELECT m.sentID, m.drugID, m.protID, tsd.drugSyn, tsp.protSyn > >FROM matches m > >INNER JOIN tmpSynDrugs tsd > > ON tsd.nameID = m.drugID > >INNER JOIN tmpSynProt tsp > > ON tsp.nameID = m.protID); > > > >Shawn Green > >Database Administrator > >Unimin Corporation - Spruce Pine > I would modify your design a bit more. I would setup a new Protein table and Drug table that looks like create table xxx ( id bigint unsigned not null auto_increment , shortname varchar(20) , fullname varchar(50) ) (Where xxx is either Protein or Drug)That way I could change matches to look like: CREATE TABLE matches ( protien_ID bigint unsigned not null , drug_id bigint unsigned not null , sent_id bigint unsigned not null , KEY (drug_id, prot_id, sent_id) , KEY (drug_id, sent_id) , KEY (prot_id, sent_id) , KEY (sent_id, drug_id, prot_id) ) I chose "bigint unsigned" so that you get the most possible records ( 18446744073709551615) in each table. Odds are that you will run out of disk space long before you reach that value. I created multiple indexes on the matches table so that you are indexed for retrieval speed, not for INSERT processing. The different combinations if KEYS represent many of the 3 types of queries that you will probably see most often: query on both drug and protein, query on either drug or protein, and query by sentence. I did not create a PRIMARY key as it could be possible that the same drug or protein is mentioned more than once in the same sentence so you may need to be able to store all individual references. If that is not the case, I would make the first KEY a PRIMARY KEY. I also noticed that you have a type mismatch between your sentence and your matches table. In sentence you declare id as INT UNSIGNED but in matches that column is only an INT. Even if you do nothing else, you need to give that column the same data type in both tables. I would also suggest you modify your synonyms table to generate a numeric PRIMARY KEY. Create Table: CREATE TABLE `synonyms` ( `id` bigint unsigned not null auto_increment `nameID` varchar(50) default NULL, `syn` text, PRIMARY KEY id UNIQUE KEY `c` (`nameID`(20),`syn`(20)) ) TYPE=MyISAM That way you can use a synonym.id (a number) wherever you need to reference a synonym and stop using the nameID. As I think more about it, using one table to represent BOTH drug synonyms and protein synonyms doesn't make good sense to me. I would probably split the synonyms into two separate tables, one for each data type (drug or protein). Sure some syn values will be the same between the two tables but it just makes better sense to me to do it as two tables, not one. I know I just mentioned a lot of changes. Do not think I am asking you do make them all. I am just trying to let you compare how someone else would approach this design to your own. If this gives you some ideas, that's great. I just don't want you to think that I am saying "your way is wrong, my way is right" as there are frequently many workable solutions to any problem and I KNOW I am not always right. We'll get it going... :-) Shawn Green Database Administrator Unimin Corporation - Spruce Pine