Re: combining information from 4 tables into a fifth.

SGreen Tue, 10 Aug 2004 09:45:40 -0700

I don't know about everyone else but I would build scores like this:

Create an empty scores table. Make gene_name the Primary Key. Allow the 
other columns to be null or give them a default of 0.0. Whatever that 
value is will represent "no score available" (I suggest null as a score of 
0 may be a possible value)


Insert the contents of one of your 3 feeder table into scores (I picked 
g2d):

INSERT scores (gene_name, g2d)
SELECT gene_name, score
FROM g2d

What I would like to be able to do next is an "INSERT...SELECT ... ON 
DUPLICATE KEY UPDATE" statement for the other two tables.  However, that 
INSERT form is not available 
(http://dev.mysql.com/doc/mysql/en/INSERT.html) So what I wind up doing is 
creating a temporary table of IDs that I will need to update. I use that 
list to help me merge the data for the two remaining tables.

CREATE TEMPORARY TABLE tmpMatches
SELECT DISTINCT s.gene_name
FROM scores s
INNER JOIN dgp
        ON dgp.gene_name = s.gene_name;

ALTER TABLE tmpMatches ADD KEY(gene_name);

INSERT scores (gene_name, dgp)
SELECT gene_name, score
FROM dgp
LEFT JOIN tmpMatches tm
        ON tm.gene_name = dgp.gene_name
WHERE tm.gene_name is null;

UPDATE scores s
INNER JOIN dgp 
        ON dgp.gene_name = s.gene_name
SET s.dgp = dgp.score;

TRUNCATE TABLE tmpMatches;

INSERT tmpMatches (gene_name)
SELECT DISTINCT s.gene_name
FROM scores s
INNER JOIN pocus p
        ON p.gene_name = s.gene_name;

INSERT scores (gene_name, pocus)
SELECT gene_name, score
FROM pocus p
LEFT JOIN tmpMatches tm
        ON tm.gene_name = p.gene_name
WHERE tm.gene_name is null;

UPDATE scores s
INNER JOIN pocus p 
        ON p.gene_name = s.gene_name
SET s.pocus = p.score;

DROP TABLE tmpMatches;


What you should end up with is that scores should list each gene only once 
with the score values from the other tables (where they are available). 
Wherever you have null values will indicate no data from that table.

Notes on technique: 
        It is much faster for the query engine to find matches than 
non-matches (That's why I build a table of gene_names that already exist 
in both tables) . There should already be indexes on both tables so this 
will be a very fast comparison.

        The index on my temporary table considerably speeds up the LEFT 
JOIN in the INSERT in the next step. (reducing the cost of our 
non-matching query)

        The UPDATE is also joining tables on indexed columns so it should 
also execute very quickly.

        Using this same basic method to combine columns between similar 
tables, just like your situation, I have merged two 500,000 row tables *on 
a laptop* in less than 15 seconds

Yours,
Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine

Jeroen Van Goey <[EMAIL PROTECTED]> wrote on 08/10/2004 11:44:29 
AM:

> Hey all,
> 
> I've three tables (g2d, dgp and pocus), with a gene-name, a score and
> some other information:
> 
> mysql> select * from dgp;
> +--------+---------------+--------------
> +------------------------------+------------+--------+
> | chr    | start         | stop         | gene_name             |
> score   | strand |
> +--------+---------------+--------------
> +------------------------------+------------+--------+
> | chr13 | 69810099 | 70239331 | ENSG00000165659 | 738.681 |  -      |
> | chr13 | 64676788 | 65600573 | ENSG00000184226 | 627.447 |  -      |
> | chr13 | 59781822 | 59787282 | ENSG00000150506 | 619.467 |  +     |
> ....
> 
> And I've a table with a lot of gene-name's. Now I want to combine
> these 4 tables into a fifth table, named score, wich list in the first
> column all the gene names, and in the three succesive columns the
> score for that gene from the g2d, dgp, and pocus tables,  if a score
> is found. Otherwise the default value should be 0.00
> Like this:
> 
> mysql> select * from score;
> +------------------------------+-----------------+------------------
> +----------------------+
> | gene_name               | score_g2d  | score_dgp   |  score_pocus  | 
> +------------------------------+-----------------+------------------
> +----------------------+
>  | ENSG00000165659 |  0.000         | 738.681       | 0.00 |
>  | ENSG00000184226 |  123,123     | 627.447       | 0.00               |
>  | ENSG00000999999 |  0.000         |  0.00           | 987,987 |
> ......
> 
> How should I construct my query? I've been trying for hours now, and
> can't get it rigth. Thanks in advance.
> 
> -- 
> MySQL General Mailing List
> For list archives: http://lists.mysql.com/mysql
> To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]
>

Re: combining information from 4 tables into a fifth.

Reply via email to