Re: help on query/group by

SGreen Mon, 21 Mar 2005 09:46:18 -0800

My reply below:

"mel list_php" <[EMAIL PROTECTED]> wrote on 03/21/2005 11:49:26 AM:


> Hi,
> Sorry for the late reply, out for the week-end!
> 
> Here is the information you asked for (I modified the columns' type as 
you 
> suggested)
> mysql> show create table matches\G
> *************************** 1. row ***************************
>        Table: matches
> Create Table: CREATE TABLE `matches` (
>   `protID` varchar(50) default NULL,
>   `drugID` int(11) default NULL,
>   `sentID` int(11) default NULL,
>   KEY `sentenceId` (`sentID`),
>   KEY `drugID` (`drugID`),
>   KEY `protID` (`protID`)
> ) TYPE=MyISAM
> 1 row in set (0.00 sec)
> 
> mysql> show create table sentence\G
> *************************** 1. row ***************************
>        Table: sentence
> Create Table: CREATE TABLE `sentence` (
>   `id` int(10) unsigned NOT NULL auto_increment,
>   `text` text,
>   `pmid` int(11) default NULL,
>   PRIMARY KEY  (`id`)
> ) TYPE=MyISAM
> 1 row in set (0.00 sec)
> 
> mysql> show create table synonyms\G
> *************************** 1. row ***************************
>        Table: synonyms
> Create Table: CREATE TABLE `synonyms` (
>   `nameID` varchar(50) default NULL,
>   `syn` text,
>   UNIQUE KEY `c` (`nameID`(20),`syn`(20))
> ) TYPE=MyISAM
> 1 row in set (0.01 sec)
> 
> I think I used int as much as possible,here some data samples:
> 
> INSERT INTO `matches` VALUES ('Q9UP51', 202531, 4);
> INSERT INTO `matches` VALUES ('SVC2_HUMAN', 202037, 5);
> INSERT INTO `matches` VALUES ('PF2R_SHEEP', 202096, 6);
> INSERT INTO `matches` VALUES ('CAQS_RAT', 202037, 7);
> 
> 
> INSERT INTO `sentence` VALUES (4, 'Of <NP>10 contraceptives </NP>tested 
, 
> <NP><span class=\'drug\'>Ortho-Gynol</span> </NP>was found to be the 
most 
> rapidly acting , followed by <NP>Cooper Creme </NP>, <a 
> href=\'http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?[UNIPROT-acc:
> (Q9UP51)]+-id+008+-view+UniprotDateView\'>Lactikol- 
> B</a>\n, <NP>Vagi-Serol </NP>, <NP>Marvosan </NP>, <NP>Clinicol </NP>, 
> <NP>Jelly-X </NP>, <NP>Bor-Oxyquin </NP>, <NP>Cellikol </NP>, and 
> <NP>Lanteen Blue Jelly </NP>in <NP>that order </NP>. ', 12305459);
> 
> (the text is usually longer than 255 characters, so I think text is the 
only 
> choice, except longtext which can only be worst for indexation in my 
> opinion)
> 
> INSERT INTO `synonyms` VALUES ('202037', 'testosterone');
> INSERT INTO `synonyms` VALUES ('Q91XV1', 'gonadotropin');
> INSERT INTO `synonyms` VALUES ('202037', 'estradiol');
> INSERT INTO `synonyms` VALUES ('Q91XV1', 'gonadotropin-suppressing');
> 
> I tried to run the query you gave me (with temporary tables) it is still 
too 
> long so I guess that my only solution now is to use indexes.
> When modifying the columns' type from text to varchar, even if the text 
was 
> indexed as unique mysql  complained about duplicates, and actually there 

> were few duplicates, I thought it was as you said because only the 
beginning 
> of the text field is indexed but I had 2 cases were the names were short 

> (less than 15 characters) and the same (even the spaces).Removing one 
and 
> the index on varchar was created. Any idea why?
> 
> I will try to play with the indexes, if I understand well I'd better 
index 
> the three columns in once, because that will automatically index each of 

> them?or am I wrong?
> 
> Anyway, thank you for all your advices, I'm really learning a lot of 
things 
> with that case!
> Melanie
> 
> 
> 
> 
> 
> >From: [EMAIL PROTECTED]
> >To: "mel list_php" <[EMAIL PROTECTED]>
> >CC: mysql@lists.mysql.com
> >Subject: Re: help on query/group by
> >Date: Fri, 18 Mar 2005 12:43:06 -0500
> >
> >"mel list_php" <[EMAIL PROTECTED]> wrote on 03/18/2005 10:35:30 
AM:
> >
> > > Hi again,
> > >
> > > Thanks for the explanation about the join and the group by.
> > >
> > > I wanted to test your query (almost a simple copy/paste :-)) ).
> > >
> > > The first 2 queries are ok, but the third one still is too long :
> > > mysql> CREATE  TEMPORARY  TABLE tmpSynDrug(  KEY ( nameID (20))  )
> >SELECT
> > > nameID, max( syn )  AS drugSyn
> > >     -> FROM synonyms
> > >     -> WHERE syn
> > >     -> LIKE  'a%'
> > >     -> GROUP  BY nameID;
> > > Query OK, 9693 rows affected (1.07 sec)
> > > Records: 9693  Duplicates: 0  Warnings: 0
> > >
> > > mysql>
> > > mysql> CREATE TEMPORARY TABLE tmpSynProt (key (nameID(20)))
> > >     -> SELECT nameID, max(syn) as protSyn
> > >     -> FROM synonyms
> > >     -> WHERE syn LIKE 'a%'
> > >     -> GROUP BY nameID;
> > > Query OK, 9693 rows affected (1.03 sec)
> > > Records: 9693  Duplicates: 0  Warnings: 0
> > >
> > > mysql>
> > > mysql> CREATE  TEMPORARY  TABLE tmpMatch(  KEY ( sentID)  ) SELECT
> >m.sentID,
> > > m.drugID, m.protID, tsd.drugSyn, tsp.protSyn
> > >     -> FROM matches m
> > >     -> INNER  JOIN tmpSynDrug tsd ON tsd.nameID = m.drugID
> > >     -> INNER  JOIN tmpSynProt tsp ON tsp.nameID = m.protID;
> > >
> > >
> > > I've tried once this morning (GMT time), it ran for one hour nothing 
I
> > > killed mysql (btw, how can I kill only one query when mysql 
"freezes"?I
> > > aborted but then mysqladmin shutdown didn't work anymore..)
> > > I thought it was maybe because I have other heavy stuff running, but 
I
> >tried
> > > a second time now and it's been running for 2 hours now, with almost
> >nothing
> > > else on the desktop.
> > > So apparently the join between matches and the other tables is still 
too
> >
> > > heavy....
> > > Any idea?
> ><snip>
> >
> >OK, I reviewed what you have posted so far and I found a performance
> >killer.  On the table "matches", the columns protID and drugID are
> >declared as text. This is "bad" for searching as you can only index the
> >first portion of any text column. Those columns should be declared as 
CHAR
> >  or VARCHAR or better yet, some integer value. If all 3 columns in the
> >"matches" table are integers (INT or  BIGINT, preferably UNSIGNED) then
> >this becomes a fixed-width table and lookups become exceedingly fast. 
All
> >indexes on those columns also become number-based and numeric 
comparisons
> >occur *much* faster than string comparisons. I very rarely use 
non-numeric
> >primary keys for just this reason.
> >
> >You mentioned there were indexes on the table and provided the output 
of
> >DESC for the table  but DESC does a very poor job of actually 
describing
> >indexes. I prefer the output of SHOW CREATE TABLE xxxx\G  as it gives 
me a
> >complete table creation statement(Use /G and not ; to eliminate a lot 
of
> >excess formatting in the output). Can you generate that for me, please? 
I
> >practically guarantee that if we re-tool that table (including the
> >indexes), our query times will drop like rocks.
> >
> >You can kill a single query through the commands SHOW [FULL] 
PROCESSLIST
> >(to identify the # of the process you want to kill) and KILL # (using 
the
> ># you just looked up). This usually drops the connection to the client
> >running the query you killed, too (so be prepared to reconnect).
> >
> >http://dev.mysql.com/doc/mysql/en/show-processlist.html
> >http://dev.mysql.com/doc/mysql/en/kill.html
> >
> >I just realized that we are only querying for the search conditions 
DRUG
> >like 'a%' AND PROTIEN like 'a%'. If we wanted to do the OR of that 
(DRUG
> >like ... OR Protien like ...) we could use a UNION query to generate
> >tmpMatch
> >
> >CREATE TEMPORARY TABLE tmpMatch (key sentID)
> >(SELECT m.sentID, m.drugID, m.protID, tsd.drugSyn, tsp.protSyn
> >FROM matches m
> >INNER JOIN tmpSynDrugs tsd
> >         ON tsd.nameID = m.drugID
> >INNER JOIN tmpSynProt tsp
> >         ON tsp.nameID = m.protID)
> >UNION
> >(SELECT m.sentID, m.drugID, m.protID, tsd.drugSyn, tsp.protSyn
> >FROM matches m
> >INNER JOIN tmpSynDrugs tsd
> >         ON tsd.nameID = m.drugID
> >INNER JOIN tmpSynProt tsp
> >         ON tsp.nameID = m.protID);
> >
> >Shawn Green
> >Database Administrator
> >Unimin Corporation - Spruce Pine
> 

I would modify your design a bit more. I would setup a new Protein table 
and Drug table that looks like

create table xxx (
        id bigint unsigned not null auto_increment
        , shortname varchar(20)
        , fullname varchar(50)
)

(Where xxx is either Protein or Drug)That way I could change matches to 
look like:

CREATE TABLE matches (
        protien_ID bigint unsigned not null
        , drug_id bigint unsigned not null
        , sent_id bigint unsigned not null
        , KEY (drug_id, prot_id, sent_id)
        , KEY (drug_id, sent_id)
        , KEY (prot_id, sent_id)
        , KEY (sent_id, drug_id, prot_id)
)

I chose "bigint unsigned" so that you get the most possible records (
18446744073709551615) in each table. Odds are that you will run out of 
disk space long before you reach that value. I created multiple indexes on 
the matches table so that you are indexed for retrieval speed, not for 
INSERT processing.  The different combinations if KEYS represent many of 
the 3 types of queries that you will probably see most often: query on 
both drug and protein, query on either drug or protein, and query by 
sentence. I did not create a PRIMARY key as it could be possible that the 
same drug or protein is mentioned more than once in the same sentence so 
you may need to be able to store all individual references. If that is not 
the case, I would make the first KEY a PRIMARY KEY.

I also noticed that you have a type mismatch between your sentence and 
your matches table. In sentence you declare id as INT UNSIGNED but in 
matches that column is only an INT. Even if you do nothing else, you need 
to give that column the same data type in both tables.

I would also suggest you modify your synonyms table to generate a numeric 
PRIMARY KEY. 
 Create Table: CREATE TABLE `synonyms` (
  `id` bigint unsigned not null auto_increment
  `nameID` varchar(50) default NULL,
  `syn` text,
  PRIMARY KEY id
  UNIQUE KEY `c` (`nameID`(20),`syn`(20))
) TYPE=MyISAM

That way you can use a synonym.id (a number) wherever you need to 
reference a synonym and stop using the nameID. As I think more about it, 
using one table to represent BOTH drug synonyms and protein synonyms 
doesn't make good sense to me. I would probably split the synonyms into 
two separate tables, one for each data type (drug or protein). Sure some 
syn values will be the same between the two tables but it just makes 
better sense to me to do it as two tables, not one.

I know I just mentioned a lot of changes. Do not think I am asking you do 
make them all. I am just trying to let you compare how someone else would 
approach this design to your own. If this gives you some ideas, that's 
great. I just don't want you to think that I am saying "your way is wrong, 
my way is right" as there are frequently many workable solutions to any 
problem and I KNOW I am not always right.

We'll get it going... :-)

Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine

Re: help on query/group by

Reply via email to