Re: SQL code sought to remove duplicates, using most recent date

Michael Stassen Sun, 21 Nov 2004 09:58:27 -0800


[EMAIL PROTECTED] wrote:

Hi!
I have a problem that may only be solvable using a programming approach, but
I'm wondering if there is an SQL method in MySQL.
The problem is that I have a table that looks like this: CREATE TABLE `test` ( `uid` mediumint(6) NOT NULL auto_increment, `perid` varchar(10) default NULL, `prog` char(2) default NULL, `startdt` date default NULL, `stopdt` date default NULL, PRIMARY KEY (`uid`) ) TYPE=MyISAM AUTO_INCREMENT=1 ;

Please see a listing (below) of some records, for this is my problem. The perid identifies people (person id is perid). The next code says what program they are in (always a 2 digit code). Then there is a startdate and a stopdate. Some records (id# 471) are repeated because people are in the same program and the startdates are on the same day, and the stopdates are the only way to cull out the "repeats". When I say "repeats" I mean they have the same perid, program and startdate.

This doesn't make sense to me. Are you saying that in your data model it is appropriate for the same person to start the same program on the same day 2 or 3 times? Or do you mean that it isn't, but your application has allowed that to happen, and now you are trying to clean up? It appears that you are changing a person's stopdt by adding a new row, rather than by updating the existing row. Ususally, that's not what you want.

Going forward, after you've removed the duplicates, I'd suggest adding a unique constraint to prevent duplicates:

  ALTER TABLE test
  ADD UNIQUE per_prog_start (perid, prog, startdt);

Then change the stopdt by updating the existing row, rather than adding a new row.

I would very much like to get a
listing for each perid that shows their perid, program, startdate and then
the latest (only the latest) stopdate.  This would give unique records if I
were only to examine the perid, program code and startdate.

        1 222 JP  2004-01-26 2006-01-26
        2  38 SW  2004-01-21 2004-04-21
        3 471 BP  2004-01-09 2004-06-01
        4 471 BP  2004-01-09 2004-06-06
        5 471 JP  2004-06-01 2004-06-08
        6 471 JP  2004-06-08 2006-06-08
        7 471 JP  2004-06-01 2006-06-15
        8 471 JP  2004-11-10 2006-11-10
        9 440 OT  2004-02-19 2004-04-01
       10 479 JP  2003-11-14 2003-12-02

So is there some way to do this using an SQL query?


Yes.

  SELECT perid, prog, startdt, MAX(stopdt)
  FROM test
  GROUP BY perid, prog, startdt;

Note that you can't get the uid with this simple query, because it isn't one of the grouped columns. If you need the uid, as I expect you do, it becomes a little more complicated.

  CREATE TEMPORARY TABLE tmp
  ( `perid` varchar(10) default NULL,
    `prog` char(2) default NULL,
    `startdt` date default NULL,
    `stopdt` date default NULL
  );

  LOCK TABLES test READ;

  INSERT INTO tmp
    SELECT perid, prog, startdt, MAX(stopdt)
    FROM test
    GROUP BY perid, prog, startdt;

  SELECT test.*
  FROM test JOIN tmp USING (perid, prog, startdt, stopdt)
  ORDER BY uid;

  UNLOCK TABLES;
  DROP TABLE tmp;

This method is documented in the manual <http://dev.mysql.com/doc/mysql/en/example-Maximum-column-group-row.html>, along with a simpler version using subqueries if you have mysql 4.1.

Of course, that just selects the rows you want. If you want to remove the duplicates from test, you have a few options. If there aren't too many, you could

  SELECT perid, prog, startdt
  FROM test
  GROUP BY perid, prog, startdt
  HAVING COUNT(stopdt) > 1;

to find the groups with duplicates, then manually remove the bad rows.

To have mysql do it, we modify the temp table solution from above:

  CREATE TEMPORARY TABLE tmp
  ( `perid` varchar(10) default NULL,
    `prog` char(2) default NULL,
    `startdt` date default NULL,
    `stopdt` date default NULL
  );

  LOCK TABLES test WRITE;

  INSERT INTO tmp
    SELECT perid, prog, startdt, MAX(stopdt)
    FROM test
    GROUP BY perid, prog, startdt;

  DELETE test
  FROM test LEFT JOIN tmp USING (perid, prog, startdt, stopdt)
  WHERE tmp.perid IS NULL;

  UNLOCK TABLES;
  DROP TABLE tmp;

Thanks so much for looking at this!

Cheers!

-Alex


Michael

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: SQL code sought to remove duplicates, using most recent date

Reply via email to