Dan Bolser <[EMAIL PROTECTED]> wrote on 04/11/2005 11:50:31 AM:

> On Mon, 11 Apr 2005, Dan Bolser wrote:
> 
> >
> >Requirement: 
> >
> >Given two columns of a table (Column1 and Column2) of length x, return 
two
> >columns (Column1_Scram and Column2_Scram) such that the distributions 
of
> >values in Column1 and Column2 are preserved in Column1_Scram and
> >Column2_Scram, but the pairs of values are randomized.
> >
> >
> >Solution suggested by Shawn Green: 
> >
> >Create a table with two columns, and populate this table with random 
pairs
> >of primary keys picked from the original table. Additionally, allow no
> >duplicate primary keys within either column. Select x rows from this
> >table, and link both primary keys (the primary key pair) back to the
> >original table to get the appropriate number of randomized pairs of
> >Column1 and Column2.
> >
> >He suggests doing the above like this (more or less):
> >
> >OriginalTable
> >PK   A   B
> >1   a   c
> >2   a   d
> >3   b   e
> >...
> >
> >CREATE TEMPORARY TABLE IntermediateTable (
> >  PK1  INT     NOT NULL,
> >  A    CHAR(1) NOT NULL,
> >  PK2  INT     NOT NULL,
> >  B    CHAR(1) NOT NULL,
> >  #
> >  UNIQUE INDEX (PK1,A),
> >  UNIQUE INDEX (PK2,B)
> >);
> >
> >INSERT IGNORE INTO IntermediateTable
> >SELECT 
> >  x.PK, x.A,
> >  y.PK, y.B
> >FROM 
> >  OriginalTable x,
> >  OriginalTable y
> >ORDER BY
> >  RAND();
> >
> >SELECT 
> >  x.A,
> >  y.B
> >FROM
> >  IntermediateTable 
> >INNER JOIN 
> >  OriginalTable x ON (PK1 = x.PK) INNER JOIN
> >  OriginalTable y ON (PK2 = y.PK)
> >LIMIT 
> >  the_length_of_OriginalTable;
> >
> >
> >The problem with this solution: 
> >
> >Its too slow on reasonable sized tables! 
> 
> 
> Their is also a problem with the way RAND() works...
> 
> SELECT
>   x.PK, x.A,
>   y.PK, y.B
> FROM
>   OriginalTable x,
>   OriginalTable y
> ORDER BY
>   RAND()
> LIMIT 
>   1;
> 
> This takes soooooo long to pick a random row. Some cleaver 'LIMIT'
> optimization could pick a results set almost instantly, instead of 
taking
> in excess of half an hour with ~50,000 rows.
> 
> 
> 

Let's try this. I will assume, because you used the PK hack, you have 
duplicate values in at least one of your sets. Let's cure the Rand() speed 
issue by adding a column to Original Table to hold a random number and 
eliminate the lookup problem. Since integer math is much faster than 
floating point math, we will set up this field as an integer field and 
fill it appropriately

ALTER TABLE OriginalTable ADD COLUMN RandomKey INT UNSIGNED;

UPDATE OriginalTable SET RandomKey = RAND()*2000000;

Let's also modify IntermediateTable like this:

DROP TABLE IntermediateTable;

CREATE TABLE FirstColumn
        id INT auto_increment
        , a char(1)
        , PRIMARY KEY (id) 
);

CREATE TABLE SecondColumn
        id INT auto_increment
        , b char(1)
        , PRIMARY KEY (id)
);

And populate the new tables:
INSERT FirstColumn (a)
SELECT a
FROM OriginalTable
ORDER BY PK1;

INSERT SecondColumn (b)
SELECT b
FROM OriginalTable
ORDER BY RandomKey;

Then get your randomized (A,B) pairs this way:

SELECT x.A, y.B
FROM FirstColumn x
INNER JOIN SecondColumn y
        on x.id = y.id;

This should be MUCH faster than 30 mins (I would guess on the order of 2 
or 3 at most). FirstColumn gets filled with data in original order, 
SecondColumn gets filled with data in random order (thanks to the random 
value). By creating new tables to cache those values we create two new 
contiguous auto_increment runs (this way you can analyze subsets of your 
original data and not need to worry about mismatching on the final INNER 
JOIN).

On the next pass, Re-run the UPDATE to assign new RAND() values to your 
data. Do not empty or refill FirstColum. Execute a "TRUNCATE TABLE 
SecondColumn;" then refill it (INSERT SecondColumn...) and repeat the 
final query. 

HTH!!

Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine

Reply via email to