Re: SCRAMBLE(A,B) (was UDF:Request).

Dan Bolser Tue, 24 May 2005 02:12:57 -0700

On Mon, 11 Apr 2005 [EMAIL PROTECTED] wrote:

>Dan Bolser <[EMAIL PROTECTED]> wrote on 04/11/2005 11:50:31 AM:
>
>> On Mon, 11 Apr 2005, Dan Bolser wrote:
>> 
>> >
>> >Requirement: 
>> >
>> >Given two columns of a table (Column1 and Column2) of length x, return 
>two
>> >columns (Column1_Scram and Column2_Scram) such that the distributions 
>of
>> >values in Column1 and Column2 are preserved in Column1_Scram and
>> >Column2_Scram, but the pairs of values are randomized.
>> >
>> >
>> >Solution suggested by Shawn Green: 
>> >
>> >Create a table with two columns, and populate this table with random 
>pairs
>> >of primary keys picked from the original table. Additionally, allow no
>> >duplicate primary keys within either column. Select x rows from this
>> >table, and link both primary keys (the primary key pair) back to the
>> >original table to get the appropriate number of randomized pairs of
>> >Column1 and Column2.
>> >
>> >He suggests doing the above like this (more or less):
>> >
>> >OriginalTable
>> >PK   A   B
>> >1   a   c
>> >2   a   d
>> >3   b   e
>> >...
>> >
>> >CREATE TEMPORARY TABLE IntermediateTable (
>> >  PK1  INT     NOT NULL,
>> >  A    CHAR(1) NOT NULL,
>> >  PK2  INT     NOT NULL,
>> >  B    CHAR(1) NOT NULL,
>> >  #
>> >  UNIQUE INDEX (PK1,A),
>> >  UNIQUE INDEX (PK2,B)
>> >);
>> >
>> >INSERT IGNORE INTO IntermediateTable
>> >SELECT 
>> >  x.PK, x.A,
>> >  y.PK, y.B
>> >FROM 
>> >  OriginalTable x,
>> >  OriginalTable y
>> >ORDER BY
>> >  RAND();
>> >
>> >SELECT 
>> >  x.A,
>> >  y.B
>> >FROM
>> >  IntermediateTable 
>> >INNER JOIN 
>> >  OriginalTable x ON (PK1 = x.PK) INNER JOIN
>> >  OriginalTable y ON (PK2 = y.PK)
>> >LIMIT 
>> >  the_length_of_OriginalTable;
>> >
>> >
>> >The problem with this solution: 
>> >
>> >Its too slow on reasonable sized tables! 
>> 
>> 
>> Their is also a problem with the way RAND() works...
>> 
>> SELECT
>>   x.PK, x.A,
>>   y.PK, y.B
>> FROM
>>   OriginalTable x,
>>   OriginalTable y
>> ORDER BY
>>   RAND()
>> LIMIT 
>>   1;
>> 
>> This takes soooooo long to pick a random row. Some cleaver 'LIMIT'
>> optimization could pick a results set almost instantly, instead of 
>taking
>> in excess of half an hour with ~50,000 rows.
>> 
>> 
>> 
>
>Let's try this. I will assume, because you used the PK hack, you have 
>duplicate values in at least one of your sets. Let's cure the Rand() speed 
>issue by adding a column to Original Table to hold a random number and 
>eliminate the lookup problem. Since integer math is much faster than 
>floating point math, we will set up this field as an integer field and 
>fill it appropriately
>
>ALTER TABLE OriginalTable ADD COLUMN RandomKey INT UNSIGNED;
>
>UPDATE OriginalTable SET RandomKey = RAND()*2000000;
>
>Let's also modify IntermediateTable like this:
>
>DROP TABLE IntermediateTable;
>
>CREATE TABLE FirstColumn
>        id INT auto_increment
>        , a char(1)
>        , PRIMARY KEY (id) 
>);
>
>CREATE TABLE SecondColumn
>        id INT auto_increment
>        , b char(1)
>        , PRIMARY KEY (id)
>);
>
>And populate the new tables:
>INSERT FirstColumn (a)
>SELECT a
>FROM OriginalTable
>ORDER BY PK1;
>
>INSERT SecondColumn (b)
>SELECT b
>FROM OriginalTable
>ORDER BY RandomKey;
>
>Then get your randomized (A,B) pairs this way:
>
>SELECT x.A, y.B
>FROM FirstColumn x
>INNER JOIN SecondColumn y
>        on x.id = y.id;
>
>This should be MUCH faster than 30 mins (I would guess on the order of 2 
>or 3 at most). FirstColumn gets filled with data in original order, 
>SecondColumn gets filled with data in random order (thanks to the random 
>value). By creating new tables to cache those values we create two new 
>contiguous auto_increment runs (this way you can analyze subsets of your 
>original data and not need to worry about mismatching on the final INNER 
>JOIN).
>
>On the next pass, Re-run the UPDATE to assign new RAND() values to your 
>data. Do not empty or refill FirstColum. Execute a "TRUNCATE TABLE 
>SecondColumn;" then refill it (INSERT SecondColumn...) and repeat the 
>final query. 
>
>HTH!!



I finally got round to trying this out - It works really well!

I did the following...

DROP   TABLE IF EXISTS rand_a;
CREATE TABLE           rand_a (
  PK            MEDIUMINT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
  A_id          INT       UNSIGNED NOT NULL,
  #
  INDEX (A_id)
);

DROP   TABLE IF EXISTS rand_b;
CREATE TABLE           rand_b (
  PK            MEDIUMINT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
  B_id          INT       UNSIGNED NOT NULL,
  #
  INDEX (b_id)
);

INSERT INTO rand_a (A_id)
#
SELECT A_id FROM x ORDER BY RAND();

INSERT INTO rand_b (B_id)
#
SELECT B_id FROM x ORDER BY RAND();


I then get the full list of random pairs (with the same marginal
distribution as before) with...

    SELECT *
      FROM rand_a 
INNER JOIN rand_b 
     USING (PK);


I chekcked my marginals like this...

    SELECT COUNT(*), COUNT(DISTINCT A_id), COUNT(DISTINCT B_id)
      FROM nrints x;

    SELECT COUNT(*), COUNT(DISTINCT A_id), COUNT(DISTINCT B_id)
      FROM rand_a 
INNER JOIN rand_b 
     USING (PK);

And they were both identical - which is what I wanted! The whole thing
takes a couple of seconds on the original 'pair' table which has 20,000
rows.

Thank you!

>
>Shawn Green
>Database Administrator
>Unimin Corporation - Spruce Pine
>


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: SCRAMBLE(A,B) (was UDF:Request).

Reply via email to