Re: can't get the order I want after inserting new rows

Michael Stassen Tue, 11 Jan 2005 09:03:52 -0800

Marcus Claesson wrote:

Hi Michael!

Before the holiday you tried to help me with a problem that I had
over-simplified and made more confusing than it was. I'll now try again,
and also by giving the columns their right names.

My table consists of parsed outputs from a bioinformatics tool called
blast.

I've heard of blast (my wife has a PhD in Genetics). I'm going to pretend your table is named "blast" rather than "table" in what follows.

Every 'sbj_name' (which can be full sentences and probably unsuitable for '=' with other sbj_names)


which suggests a table design change I'll describe below.

has one 'sbj_count' and at
least one 'hsp_count'. The higher the 'score' the lower the 'sbj_count'.

So, If I understand correctly, sbj_count is not a count at all, but a rank. It orders the sbj_names according to their highest scores.

My perl program, which uses the table, needs a certain order of the
query output to work. It should also be capable of adding new data into
the table. After adding new data the query output should have the same
structure, thus (only) column sbj_count needs to be updated. For
example, this is my old data:


{I edited a little for readability}

sbj_name    sbj_count    hsp_count    score
--------    -------      -------      -----
a               1           1           900
a               1           2           500
b               2           1           800
c               3           1           700
c               3           2           600
c               3           3           500

and this is new:

sbj_name    sbj_count    hsp_count    score
--------    -------      -------      -----
d               1           1          1000
d               1           2           400
e               2           1           900
e               2           2           500
f               3           1           700
g               4           1           600

If I now would ask the same query (SELECT * from blast ORDER BY
sbj_count,hsp_count) that gave the correct output of the new data I
would get:

sbj_name    sbj_count    hsp_count    score
--------    -------      -------      -----
a               1           1           900
a               1           2           500
d               1           1          1000
d               1           2           400
b               2           1           800
e               2           1           900
e               2           2           500
c               3           1           700
c               3           2           600
c               3           3           500
f               3           1           700
g               4           1           600


Actually, no.  That query would give

 sbj_name | sbj_count | hsp_count | score |
----------+-----------+-----------+-------+
 a        |         1 |         1 |   900 |
 d        |         1 |         1 |  1000 |
 a        |         1 |         2 |   500 |
 d        |         1 |         2 |   400 |
...

To get the output you show, you would need to use

  SELECT * FROM blast ORDER BY sbj_count, sbj_name, hsp_count;

but I suppose that's moot, as it's not what you want.

Thus the new sbj_names share sbj_count with the old sbj_names.
Instead I want SELECT * from table ORDER BY sbj_count,hsp_count: sbj_name sbj_count hsp_count score -------- ------- ------- ----- d 1 1 1000 d 1 2 400 a 2 1 900 a 2 2 500 e 3 1 900 e 3 2 500 b 4 1 800 c 5 1 700 c 5 2 600 c 5 3 500 f 6 1 700 g 7 1 600
As you can see sbj_count has been updated based on the score value, but
has also given the same sbj_count value to all the same sbj_name, and
thus kept them together in the output.
Hope this one is a bit clearer.

I think so. You want to get the high score for each sbj_name, then order all the data according to those high scores. The sbj_count column is supposed to hold the sbj_name ranks based on the high scores.

Usually, you do not store values in a column which can be calculated from the values in other columns, as they are redundant. That is, we would normally not have a sbj_count column as it wastes space. Also, until we fix it, it contains erroneous values after insertions.

I believe this is fundamentally a two-step process. First we find the high scores and compute the ranks, saving the result in a temporary table:

 CREATE TEMPORARY TABLE ranks (rank INT NOT NULL AUTO_INCREMENT PRIMARY KEY)
   SELECT sbj_name, MAX(score) AS high_score FROM blast
   GROUP BY sbj_name ORDER BY high_score DESC;

This gives the following:

  SELECT * FROM ranks;
+------+----------+------------+
| rank | sbj_name | high_score |
+------+----------+------------+
|    1 | d        |       1000 |
|    2 | a        |        900 |
|    3 | e        |        900 |
|    4 | b        |        800 |
|    5 | c        |        700 |
|    6 | f        |        700 |
|    7 | g        |        600 |
+------+----------+------------+
7 rows in set (0.01 sec)

Then join the original table with the temporary table to get the order you want:

  SELECT b.sbj_name, r.rank, b.hsp_count, b.score
  FROM ranks r JOIN blast b ON r.sbj_name = b.sbj_name
  ORDER BY r.rank, b.hsp_count;
+----------+------+-----------+-------+
| sbj_name | rank | hsp_count | score |
+----------+------+-----------+-------+
| d        |    1 |         1 |  1000 |
| d        |    1 |         2 |   400 |
| a        |    2 |         1 |   900 |
| a        |    2 |         2 |   500 |
| e        |    3 |         1 |   900 |
| e        |    3 |         2 |   500 |
| b        |    4 |         1 |   800 |
| c        |    5 |         1 |   700 |
| c        |    5 |         2 |   600 |
| c        |    5 |         3 |   500 |
| f        |    6 |         1 |   700 |
| g        |    7 |         1 |   600 |
+----------+------+-----------+-------+
12 rows in set (0.00 sec)

Finally, we should clean up:

  DROP TABLE ranks;

Notice we did this without any reference to sbj_count! As I mentioned above, we don't really need it, as we can calculate its values on the fly. On the other hand, if you will be retrieving the data far more frequently than you will be changing it, it may be convenient (faster) to keep the calculated ranks in a column (sbj_count) rather than creating and joining a temporary table every time. In that case, create the temporary table as above, then

  UPDATE blast, ranks
  SET blast.sbj_count = ranks.rank
  WHERE blast.sbj_name=ranks.sbj_name;

But remember, that column will be inaccurate during the time between inserting new rows and running this update. You (your app) will be responsible for making sure this happens every time data is added. You will also have to decide whether this will present a potential problem for users (getting incorrect results with unlucky timing), in which case you may need to lock the table or use transactions to prevent this. That's one advantage of just doing this on the fly -- you avoid this issue completely.

Finally, I'd like to respond to your description of sbj_name as "...full sentences and probably unsuitable for '=' with other sbj_names". Note that the above solution relies on just such a comparison. Also, your table currently repeats that long information on multiple rows. Normally, you'd keep such data in a separate table:

CREATE TABLE sbj (id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, name CHAR(1) ); {I used CHAR(1) for name because it fits the sample data. You would, of course, use whatever column type you currently have for sbj_name.}

Assuming you already have data you wish to keep, you can fill this table with the sbj_names you already have:

  INSERT INTO sbj (name) SELECT sbj_name FROM blast GROUP BY sbj_name;

Now we change blast to refeference sbj by id rather than name:

  ALTER TABLE blast ADD COLUMN sbj_id INT NOT NULL AFTER sbj_name;

Assuming existing data, put the correct ids in the new column:

  UPDATE blast, sbj
  SET blast.sbj_id = sbj.id
  WHERE blast.sbj_name = sbj.name;

Now get rid of the no-longer-needed sbj_name column:

  ALTER TABLE blast DROP COLUMN sbj_name;

Going forward, new names are inserted into table sbj, and their ids are used in the inserts into table blast. Then, we modify my solution above accordingly:

# create ranks table
 CREATE TEMPORARY TABLE ranks (rank INT NOT NULL AUTO_INCREMENT PRIMARY KEY)
   SELECT sbj_id, MAX(score) AS high_score FROM blast
   GROUP BY sbj_id ORDER BY high_score DESC;

# order results on the fly
  SELECT s.name, r.rank, b.hsp_count, b.score
  FROM ranks r
  JOIN blast b ON r.sbj_id = b.sbj_id
  JOIN sbj s ON b.sbj_id = s.id
  ORDER BY r.rank, b.hsp_count;

# OR
# update sbj_count for future use
  UPDATE blast, ranks
  SET blast.sbj_count = ranks.rank
  WHERE blast.sbj_id=ranks.sbj_id;

# clean up
  DROP TABLE ranks

If you update the ranks in sbj_count, then your select would be:

  SELECT s.name, b.sbj_count, b.hsp_count, b.score
  FROM blast b
  JOIN sbj s ON b.sbj_id = s.id
  ORDER BY b.sbj_count, b.hsp_count;

Many thanks!

I hope that helps. I did assume you have already handled correctly assigning hsp_count according to the score order within a particular sbj. Let me know if I've misunderstood.

Marcus


Michael

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: can't get the order I want after inserting new rows

Reply via email to