Dan Bolser <[EMAIL PROTECTED]> wrote on 04/11/2005 11:50:31 AM: > On Mon, 11 Apr 2005, Dan Bolser wrote: > > > > >Requirement: > > > >Given two columns of a table (Column1 and Column2) of length x, return two > >columns (Column1_Scram and Column2_Scram) such that the distributions of > >values in Column1 and Column2 are preserved in Column1_Scram and > >Column2_Scram, but the pairs of values are randomized. > > > > > >Solution suggested by Shawn Green: > > > >Create a table with two columns, and populate this table with random pairs > >of primary keys picked from the original table. Additionally, allow no > >duplicate primary keys within either column. Select x rows from this > >table, and link both primary keys (the primary key pair) back to the > >original table to get the appropriate number of randomized pairs of > >Column1 and Column2. > > > >He suggests doing the above like this (more or less): > > > >OriginalTable > >PK A B > >1 a c > >2 a d > >3 b e > >... > > > >CREATE TEMPORARY TABLE IntermediateTable ( > > PK1 INT NOT NULL, > > A CHAR(1) NOT NULL, > > PK2 INT NOT NULL, > > B CHAR(1) NOT NULL, > > # > > UNIQUE INDEX (PK1,A), > > UNIQUE INDEX (PK2,B) > >); > > > >INSERT IGNORE INTO IntermediateTable > >SELECT > > x.PK, x.A, > > y.PK, y.B > >FROM > > OriginalTable x, > > OriginalTable y > >ORDER BY > > RAND(); > > > >SELECT > > x.A, > > y.B > >FROM > > IntermediateTable > >INNER JOIN > > OriginalTable x ON (PK1 = x.PK) INNER JOIN > > OriginalTable y ON (PK2 = y.PK) > >LIMIT > > the_length_of_OriginalTable; > > > > > >The problem with this solution: > > > >Its too slow on reasonable sized tables! > > > Their is also a problem with the way RAND() works... > > SELECT > x.PK, x.A, > y.PK, y.B > FROM > OriginalTable x, > OriginalTable y > ORDER BY > RAND() > LIMIT > 1; > > This takes soooooo long to pick a random row. Some cleaver 'LIMIT' > optimization could pick a results set almost instantly, instead of taking > in excess of half an hour with ~50,000 rows. > > >
Let's try this. I will assume, because you used the PK hack, you have duplicate values in at least one of your sets. Let's cure the Rand() speed issue by adding a column to Original Table to hold a random number and eliminate the lookup problem. Since integer math is much faster than floating point math, we will set up this field as an integer field and fill it appropriately ALTER TABLE OriginalTable ADD COLUMN RandomKey INT UNSIGNED; UPDATE OriginalTable SET RandomKey = RAND()*2000000; Let's also modify IntermediateTable like this: DROP TABLE IntermediateTable; CREATE TABLE FirstColumn id INT auto_increment , a char(1) , PRIMARY KEY (id) ); CREATE TABLE SecondColumn id INT auto_increment , b char(1) , PRIMARY KEY (id) ); And populate the new tables: INSERT FirstColumn (a) SELECT a FROM OriginalTable ORDER BY PK1; INSERT SecondColumn (b) SELECT b FROM OriginalTable ORDER BY RandomKey; Then get your randomized (A,B) pairs this way: SELECT x.A, y.B FROM FirstColumn x INNER JOIN SecondColumn y on x.id = y.id; This should be MUCH faster than 30 mins (I would guess on the order of 2 or 3 at most). FirstColumn gets filled with data in original order, SecondColumn gets filled with data in random order (thanks to the random value). By creating new tables to cache those values we create two new contiguous auto_increment runs (this way you can analyze subsets of your original data and not need to worry about mismatching on the final INNER JOIN). On the next pass, Re-run the UPDATE to assign new RAND() values to your data. Do not empty or refill FirstColum. Execute a "TRUNCATE TABLE SecondColumn;" then refill it (INSERT SecondColumn...) and repeat the final query. HTH!! Shawn Green Database Administrator Unimin Corporation - Spruce Pine