Requirement: 

Given two columns of a table (Column1 and Column2) of length x, return two
columns (Column1_Scram and Column2_Scram) such that the distributions of
values in Column1 and Column2 are preserved in Column1_Scram and
Column2_Scram, but the pairs of values are randomized.


Solution suggested by Shawn Green: 

Create a table with two columns, and populate this table with random pairs
of primary keys picked from the original table. Additionally, allow no
duplicate primary keys within either column. Select x rows from this
table, and link both primary keys (the primary key pair) back to the
original table to get the appropriate number of randomized pairs of
Column1 and Column2.

He suggests doing the above like this (more or less):

OriginalTable
PK      A       B
1       a       c
2       a       d
3       b       e
...

CREATE TEMPORARY TABLE IntermediateTable (
  PK1  INT     NOT NULL,
  A    CHAR(1) NOT NULL,
  PK2  INT     NOT NULL,
  B    CHAR(1) NOT NULL,
  #
  UNIQUE INDEX (PK1,A),
  UNIQUE INDEX (PK2,B)
);

INSERT IGNORE INTO IntermediateTable
SELECT 
  x.PK, x.A,
  y.PK, y.B
FROM 
  OriginalTable x,
  OriginalTable y
ORDER BY
  RAND();

SELECT 
  x.A,
  y.B
FROM
  IntermediateTable 
INNER JOIN 
  OriginalTable x ON (PK1 = x.PK) INNER JOIN
  OriginalTable y ON (PK2 = y.PK)
LIMIT 
  the_length_of_OriginalTable;


The problem with this solution: 

Its too slow on reasonable sized tables! I never get past the second step
with my data after 10 minutes!

I have 52699 rows in my 'OriginalTable' leading to ~2.7 billion checks
when inserting into the IntermediateTable... or rather 5.4 billion, as I
guess it has to check both rows for the UNIQUE key constraint on every
attempted insert. 

Ideally I would like to be able to do several thousand randomizations over
my data, and at 10 mins a pop that would take all week. (assuming the
query was about to finish when I killed it after 10 mins.)

Is their a faster way to do this randomization in SQL? Am I doing
something really dumb that was never intended by Shawn?

I can easily get the data I need with a quick step into perl, but it would
be really neat if I could do all this in MySQL.

I can imagine a general way to create 'random' joins (over scrambled
data) would have some interesting applications.

Dan.


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to