Jeroen,

One minor caveat here. You are grouping by gene_name, but selecting non-aggregate values from the chr and strand columns. Since they aren't part of the list of grouped columns, you'll get effectively randomly chosen values (probably the first found per group) for each from among the possible values per gene_name. (This is why other systems won't allow this.)

That is fine so long as the chr and strand values are unique per gene_name. I expect there is one chromosome per gene, so that should be no problem for the chr column. If there is a unique value for strand per gene_name, then you'll have no problem there, either. If there isn't, then the strand column in the new table won't be much use.

Sorry if that was obvious to you. My wife's the geneticist in the family. I haven't absorbed enough to have an expectation about uniqueness of strand values, and I wasn't sure if you had absorbed enough SQL to have an expectation about the non-grouped columns.

Michael

Lachlan Mulcahy wrote:

Hi Jeroen,

Have you tried creating a new table by selecting the data that you want and
then removing the old table and replacing it with the new one.

Eg.

CREATE TABLE new_g2d (
        <the precise definition for your g2d table goes here i imagine something
like the following>
        chr char(10) NOT NULL DEFAULT '',
        start int NOT NULL DEFAULT 0,
        stop int NOT NULL DEFAULT 0,
        gene_name char(64) NOT NULL DEFAULT '',
        score decimal(6,3) NOT NULL DEFAULT 000.000,
        strand char(5) NOT NULL DEFAULT '-'
) SELECT
        chr,
        MIN(start),
        MAX(stop),
        gene_name,
        AVG(score),
        strand
FROM
        g2d
GROUP BY gene_name;

ALTER TABLE g2d RENAME TO g2d_old;

ALTER TABLE new_g2d RENAME TO g2d;


When you are satisfied that your new g2d table is what you want you can safely drop your g2d_old table.

HTH,
Lachlan


-----Original Message----- From: Jeroen Van Goey [mailto:[EMAIL PROTECTED] Sent: Wednesday, 11 August 2004 9:16 PM To: [EMAIL PROTECTED] Subject: removing duplicates and giving average of other field.


Hi again,

thanks for the quick reply on my previous question, the solution given
by Michael Stassen worked like a charm. But now I've again run into a
situation where I'm stuck with my limited knowledge of MySQL.

Given the table:
mysql> select * from g2d;
+-------+----------+----------+-----------------+---------+--------+
| chr   | start    | stop     | gene_name       | score   | strand |
+-------+----------+----------+-----------------+---------+--------+
| chr13 | 58214248 | 58214733 | ENSG00000139734 | 774.518 | -      |
| chr13 | 58214732 | 58215034 | ENSG00000139734 | 774.518 | -      |
| chr13 | 58215238 | 58215432 | ENSG00000139734 | 774.518 | -      |
| chr13 | 58215021 | 58215182 | ENSG00000139734 | 774.518 | -      |
| chr13 | 58214732 | 58215031 | ENSG00000139734 | 755.930 | -      |
| chr13 | 58214296 | 58214727 | ENSG00000139734 | 755.930 | -      |
| chr13 | 58215045 | 58215098 | ENSG00000139734 | 755.930 | -      |
| chr13 | 65481104 | 65481424 | ENSG00000184226 | 896.699 | -      |
| chr13 | 65481119 | 65481424 | ENSG00000184226 | 809.759 | -      |
| chr13 | 65481116 | 65481355 | ENSG00000184226 | 809.759 | -      |
| chr13 | 65481113 | 65481421 | ENSG00000184226 | 808.113 | -      |
| chr13 | 65481092 | 65481424 | ENSG00000184226 | 796.108 | -      |
| chr13 | 65481110 | 65481424 | ENSG00000184226 | 796.011 | -      |
| chr13 | 65481098 | 65481358 | ENSG00000184226 | 796.011 | -      |
....

I would like to reduce the table, by combining all the records with
the same gene_name. The score field should then give the average of
all the scores for that gene_name.

The table above would become then:

mysql> select * from g2d;
+-------+----------+----------+-----------------+---------+--------+
| chr   | start    | stop     | gene_name       | score   | strand |
+-------+----------+----------+-----------------+---------+--------+
| chr13 | 58214248 | 58215432 | ENSG00000139734 | 766.551 | -      |
| chr13 | 65481092 | 65481424 | ENSG00000184226 | 819.408 | -      |

where 766.551 = (((774.518 * 4) + (755.930 *3)) / 7)
 and 819.408 = ((896.699 + (809.759 *2) + 808.113 + 796.108 + 796.011) / 6)

If there's also an easy way to make the start field the lowest value
found for that gene_name, and the stop field the highest value, that
would be nice, but it isn't that important for my application, so
don't bother to much with it.

Thanks in advance, I appreciate your input very much.



-- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]



Reply via email to