On Mon, 11 Apr 2005 [EMAIL PROTECTED] wrote: >Dan Bolser <[EMAIL PROTECTED]> wrote on 04/11/2005 11:50:31 AM: > >> On Mon, 11 Apr 2005, Dan Bolser wrote: >> >> > >> >Requirement: >> > >> >Given two columns of a table (Column1 and Column2) of length x, return >two >> >columns (Column1_Scram and Column2_Scram) such that the distributions >of >> >values in Column1 and Column2 are preserved in Column1_Scram and >> >Column2_Scram, but the pairs of values are randomized. >> > >> > >> >Solution suggested by Shawn Green: >> > >> >Create a table with two columns, and populate this table with random >pairs >> >of primary keys picked from the original table. Additionally, allow no >> >duplicate primary keys within either column. Select x rows from this >> >table, and link both primary keys (the primary key pair) back to the >> >original table to get the appropriate number of randomized pairs of >> >Column1 and Column2. >> > >> >He suggests doing the above like this (more or less): >> > >> >OriginalTable >> >PK A B >> >1 a c >> >2 a d >> >3 b e >> >... >> > >> >CREATE TEMPORARY TABLE IntermediateTable ( >> > PK1 INT NOT NULL, >> > A CHAR(1) NOT NULL, >> > PK2 INT NOT NULL, >> > B CHAR(1) NOT NULL, >> > # >> > UNIQUE INDEX (PK1,A), >> > UNIQUE INDEX (PK2,B) >> >); >> > >> >INSERT IGNORE INTO IntermediateTable >> >SELECT >> > x.PK, x.A, >> > y.PK, y.B >> >FROM >> > OriginalTable x, >> > OriginalTable y >> >ORDER BY >> > RAND(); >> > >> >SELECT >> > x.A, >> > y.B >> >FROM >> > IntermediateTable >> >INNER JOIN >> > OriginalTable x ON (PK1 = x.PK) INNER JOIN >> > OriginalTable y ON (PK2 = y.PK) >> >LIMIT >> > the_length_of_OriginalTable; >> > >> > >> >The problem with this solution: >> > >> >Its too slow on reasonable sized tables! >> >> >> Their is also a problem with the way RAND() works... >> >> SELECT >> x.PK, x.A, >> y.PK, y.B >> FROM >> OriginalTable x, >> OriginalTable y >> ORDER BY >> RAND() >> LIMIT >> 1; >> >> This takes soooooo long to pick a random row. Some cleaver 'LIMIT' >> optimization could pick a results set almost instantly, instead of >taking >> in excess of half an hour with ~50,000 rows. >> >> >> > >Let's try this. I will assume, because you used the PK hack, you have >duplicate values in at least one of your sets. Let's cure the Rand() speed >issue by adding a column to Original Table to hold a random number and >eliminate the lookup problem. Since integer math is much faster than >floating point math, we will set up this field as an integer field and >fill it appropriately > >ALTER TABLE OriginalTable ADD COLUMN RandomKey INT UNSIGNED; > >UPDATE OriginalTable SET RandomKey = RAND()*2000000; > >Let's also modify IntermediateTable like this: > >DROP TABLE IntermediateTable; > >CREATE TABLE FirstColumn > id INT auto_increment > , a char(1) > , PRIMARY KEY (id) >); > >CREATE TABLE SecondColumn > id INT auto_increment > , b char(1) > , PRIMARY KEY (id) >); > >And populate the new tables: >INSERT FirstColumn (a) >SELECT a >FROM OriginalTable >ORDER BY PK1; > >INSERT SecondColumn (b) >SELECT b >FROM OriginalTable >ORDER BY RandomKey; > >Then get your randomized (A,B) pairs this way: > >SELECT x.A, y.B >FROM FirstColumn x >INNER JOIN SecondColumn y > on x.id = y.id; > >This should be MUCH faster than 30 mins (I would guess on the order of 2 >or 3 at most). FirstColumn gets filled with data in original order, >SecondColumn gets filled with data in random order (thanks to the random >value). By creating new tables to cache those values we create two new >contiguous auto_increment runs (this way you can analyze subsets of your >original data and not need to worry about mismatching on the final INNER >JOIN). > >On the next pass, Re-run the UPDATE to assign new RAND() values to your >data. Do not empty or refill FirstColum. Execute a "TRUNCATE TABLE >SecondColumn;" then refill it (INSERT SecondColumn...) and repeat the >final query. > >HTH!!
I finally got round to trying this out - It works really well! I did the following... DROP TABLE IF EXISTS rand_a; CREATE TABLE rand_a ( PK MEDIUMINT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT, A_id INT UNSIGNED NOT NULL, # INDEX (A_id) ); DROP TABLE IF EXISTS rand_b; CREATE TABLE rand_b ( PK MEDIUMINT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT, B_id INT UNSIGNED NOT NULL, # INDEX (b_id) ); INSERT INTO rand_a (A_id) # SELECT A_id FROM x ORDER BY RAND(); INSERT INTO rand_b (B_id) # SELECT B_id FROM x ORDER BY RAND(); I then get the full list of random pairs (with the same marginal distribution as before) with... SELECT * FROM rand_a INNER JOIN rand_b USING (PK); I chekcked my marginals like this... SELECT COUNT(*), COUNT(DISTINCT A_id), COUNT(DISTINCT B_id) FROM nrints x; SELECT COUNT(*), COUNT(DISTINCT A_id), COUNT(DISTINCT B_id) FROM rand_a INNER JOIN rand_b USING (PK); And they were both identical - which is what I wanted! The whole thing takes a couple of seconds on the original 'pair' table which has 20,000 rows. Thank you! > >Shawn Green >Database Administrator >Unimin Corporation - Spruce Pine > -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]