Re: SCRAMBLE(A,B) (was UDF:Request).

2005-05-24 Thread Dan Bolser
On Mon, 11 Apr 2005 [EMAIL PROTECTED] wrote:

>Dan Bolser <[EMAIL PROTECTED]> wrote on 04/11/2005 11:50:31 AM:
>
>> On Mon, 11 Apr 2005, Dan Bolser wrote:
>> 
>> >
>> >Requirement: 
>> >
>> >Given two columns of a table (Column1 and Column2) of length x, return 
>two
>> >columns (Column1_Scram and Column2_Scram) such that the distributions 
>of
>> >values in Column1 and Column2 are preserved in Column1_Scram and
>> >Column2_Scram, but the pairs of values are randomized.
>> >
>> >
>> >Solution suggested by Shawn Green: 
>> >
>> >Create a table with two columns, and populate this table with random 
>pairs
>> >of primary keys picked from the original table. Additionally, allow no
>> >duplicate primary keys within either column. Select x rows from this
>> >table, and link both primary keys (the primary key pair) back to the
>> >original table to get the appropriate number of randomized pairs of
>> >Column1 and Column2.
>> >
>> >He suggests doing the above like this (more or less):
>> >
>> >OriginalTable
>> >PK   A   B
>> >1   a   c
>> >2   a   d
>> >3   b   e
>> >...
>> >
>> >CREATE TEMPORARY TABLE IntermediateTable (
>> >  PK1  INT NOT NULL,
>> >  ACHAR(1) NOT NULL,
>> >  PK2  INT NOT NULL,
>> >  BCHAR(1) NOT NULL,
>> >  #
>> >  UNIQUE INDEX (PK1,A),
>> >  UNIQUE INDEX (PK2,B)
>> >);
>> >
>> >INSERT IGNORE INTO IntermediateTable
>> >SELECT 
>> >  x.PK, x.A,
>> >  y.PK, y.B
>> >FROM 
>> >  OriginalTable x,
>> >  OriginalTable y
>> >ORDER BY
>> >  RAND();
>> >
>> >SELECT 
>> >  x.A,
>> >  y.B
>> >FROM
>> >  IntermediateTable 
>> >INNER JOIN 
>> >  OriginalTable x ON (PK1 = x.PK) INNER JOIN
>> >  OriginalTable y ON (PK2 = y.PK)
>> >LIMIT 
>> >  the_length_of_OriginalTable;
>> >
>> >
>> >The problem with this solution: 
>> >
>> >Its too slow on reasonable sized tables! 
>> 
>> 
>> Their is also a problem with the way RAND() works...
>> 
>> SELECT
>>   x.PK, x.A,
>>   y.PK, y.B
>> FROM
>>   OriginalTable x,
>>   OriginalTable y
>> ORDER BY
>>   RAND()
>> LIMIT 
>>   1;
>> 
>> This takes soo long to pick a random row. Some cleaver 'LIMIT'
>> optimization could pick a results set almost instantly, instead of 
>taking
>> in excess of half an hour with ~50,000 rows.
>> 
>> 
>> 
>
>Let's try this. I will assume, because you used the PK hack, you have 
>duplicate values in at least one of your sets. Let's cure the Rand() speed 
>issue by adding a column to Original Table to hold a random number and 
>eliminate the lookup problem. Since integer math is much faster than 
>floating point math, we will set up this field as an integer field and 
>fill it appropriately
>
>ALTER TABLE OriginalTable ADD COLUMN RandomKey INT UNSIGNED;
>
>UPDATE OriginalTable SET RandomKey = RAND()*200;
>
>Let's also modify IntermediateTable like this:
>
>DROP TABLE IntermediateTable;
>
>CREATE TABLE FirstColumn
>id INT auto_increment
>, a char(1)
>, PRIMARY KEY (id) 
>);
>
>CREATE TABLE SecondColumn
>id INT auto_increment
>, b char(1)
>, PRIMARY KEY (id)
>);
>
>And populate the new tables:
>INSERT FirstColumn (a)
>SELECT a
>FROM OriginalTable
>ORDER BY PK1;
>
>INSERT SecondColumn (b)
>SELECT b
>FROM OriginalTable
>ORDER BY RandomKey;
>
>Then get your randomized (A,B) pairs this way:
>
>SELECT x.A, y.B
>FROM FirstColumn x
>INNER JOIN SecondColumn y
>on x.id = y.id;
>
>This should be MUCH faster than 30 mins (I would guess on the order of 2 
>or 3 at most). FirstColumn gets filled with data in original order, 
>SecondColumn gets filled with data in random order (thanks to the random 
>value). By creating new tables to cache those values we create two new 
>contiguous auto_increment runs (this way you can analyze subsets of your 
>original data and not need to worry about mismatching on the final INNER 
>JOIN).
>
>On the next pass, Re-run the UPDATE to assign new RAND() values to your 
>data. Do not empty or refill FirstColum. Execute a "TRUNCATE TABLE 
>SecondColumn;" then refill it (INSERT SecondColumn...) and repeat the 
>final query. 
>
>HTH!!


I finally got round to trying this out - It works really well!

I did the following...

DROP   TABLE IF EXISTS rand_a;
CREATE TABLE   rand_a (
  PKMEDIUMINT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
  A_id  INT   UNSIGNED NOT NULL,
  #
  INDEX (A_id)
);

DROP   TABLE IF EXISTS rand_b;
CREATE TABLE   rand_b (
  PKMEDIUMINT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
  B_id  INT   UNSIGNED NOT NULL,
  #
  INDEX (b_id)
);

INSERT INTO rand_a (A_id)
#
SELECT A_id FROM x ORDER BY RAND();

INSERT INTO rand_b (B_id)
#
SELECT B_id FROM x ORDER BY RAND();


I then get the full list of random pairs (with the same marginal
distribution as before) with...

SELECT *
  FROM rand_a 
INNER JOIN rand_b 
 USING (PK);


I chekcked my marginals like this...

SELECT COUNT(*), COUNT(DISTINCT A_id), COUNT(DISTINCT B_id)
  FROM nrints x;

SELECT COUNT(*), 

Re: SCRAMBLE(A,B) (was UDF:Request).

2005-04-11 Thread SGreen
Dan Bolser <[EMAIL PROTECTED]> wrote on 04/11/2005 11:50:31 AM:

> On Mon, 11 Apr 2005, Dan Bolser wrote:
> 
> >
> >Requirement: 
> >
> >Given two columns of a table (Column1 and Column2) of length x, return 
two
> >columns (Column1_Scram and Column2_Scram) such that the distributions 
of
> >values in Column1 and Column2 are preserved in Column1_Scram and
> >Column2_Scram, but the pairs of values are randomized.
> >
> >
> >Solution suggested by Shawn Green: 
> >
> >Create a table with two columns, and populate this table with random 
pairs
> >of primary keys picked from the original table. Additionally, allow no
> >duplicate primary keys within either column. Select x rows from this
> >table, and link both primary keys (the primary key pair) back to the
> >original table to get the appropriate number of randomized pairs of
> >Column1 and Column2.
> >
> >He suggests doing the above like this (more or less):
> >
> >OriginalTable
> >PK   A   B
> >1   a   c
> >2   a   d
> >3   b   e
> >...
> >
> >CREATE TEMPORARY TABLE IntermediateTable (
> >  PK1  INT NOT NULL,
> >  ACHAR(1) NOT NULL,
> >  PK2  INT NOT NULL,
> >  BCHAR(1) NOT NULL,
> >  #
> >  UNIQUE INDEX (PK1,A),
> >  UNIQUE INDEX (PK2,B)
> >);
> >
> >INSERT IGNORE INTO IntermediateTable
> >SELECT 
> >  x.PK, x.A,
> >  y.PK, y.B
> >FROM 
> >  OriginalTable x,
> >  OriginalTable y
> >ORDER BY
> >  RAND();
> >
> >SELECT 
> >  x.A,
> >  y.B
> >FROM
> >  IntermediateTable 
> >INNER JOIN 
> >  OriginalTable x ON (PK1 = x.PK) INNER JOIN
> >  OriginalTable y ON (PK2 = y.PK)
> >LIMIT 
> >  the_length_of_OriginalTable;
> >
> >
> >The problem with this solution: 
> >
> >Its too slow on reasonable sized tables! 
> 
> 
> Their is also a problem with the way RAND() works...
> 
> SELECT
>   x.PK, x.A,
>   y.PK, y.B
> FROM
>   OriginalTable x,
>   OriginalTable y
> ORDER BY
>   RAND()
> LIMIT 
>   1;
> 
> This takes soo long to pick a random row. Some cleaver 'LIMIT'
> optimization could pick a results set almost instantly, instead of 
taking
> in excess of half an hour with ~50,000 rows.
> 
> 
> 

Let's try this. I will assume, because you used the PK hack, you have 
duplicate values in at least one of your sets. Let's cure the Rand() speed 
issue by adding a column to Original Table to hold a random number and 
eliminate the lookup problem. Since integer math is much faster than 
floating point math, we will set up this field as an integer field and 
fill it appropriately

ALTER TABLE OriginalTable ADD COLUMN RandomKey INT UNSIGNED;

UPDATE OriginalTable SET RandomKey = RAND()*200;

Let's also modify IntermediateTable like this:

DROP TABLE IntermediateTable;

CREATE TABLE FirstColumn
id INT auto_increment
, a char(1)
, PRIMARY KEY (id) 
);

CREATE TABLE SecondColumn
id INT auto_increment
, b char(1)
, PRIMARY KEY (id)
);

And populate the new tables:
INSERT FirstColumn (a)
SELECT a
FROM OriginalTable
ORDER BY PK1;

INSERT SecondColumn (b)
SELECT b
FROM OriginalTable
ORDER BY RandomKey;

Then get your randomized (A,B) pairs this way:

SELECT x.A, y.B
FROM FirstColumn x
INNER JOIN SecondColumn y
on x.id = y.id;

This should be MUCH faster than 30 mins (I would guess on the order of 2 
or 3 at most). FirstColumn gets filled with data in original order, 
SecondColumn gets filled with data in random order (thanks to the random 
value). By creating new tables to cache those values we create two new 
contiguous auto_increment runs (this way you can analyze subsets of your 
original data and not need to worry about mismatching on the final INNER 
JOIN).

On the next pass, Re-run the UPDATE to assign new RAND() values to your 
data. Do not empty or refill FirstColum. Execute a "TRUNCATE TABLE 
SecondColumn;" then refill it (INSERT SecondColumn...) and repeat the 
final query. 

HTH!!

Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine


Re: SCRAMBLE(A,B) (was UDF:Request).

2005-04-11 Thread Dan Bolser
On Mon, 11 Apr 2005, Dan Bolser wrote:

>
>Requirement: 
>
>Given two columns of a table (Column1 and Column2) of length x, return two
>columns (Column1_Scram and Column2_Scram) such that the distributions of
>values in Column1 and Column2 are preserved in Column1_Scram and
>Column2_Scram, but the pairs of values are randomized.
>
>
>Solution suggested by Shawn Green: 
>
>Create a table with two columns, and populate this table with random pairs
>of primary keys picked from the original table. Additionally, allow no
>duplicate primary keys within either column. Select x rows from this
>table, and link both primary keys (the primary key pair) back to the
>original table to get the appropriate number of randomized pairs of
>Column1 and Column2.
>
>He suggests doing the above like this (more or less):
>
>OriginalTable
>PK A   B
>1  a   c
>2  a   d
>3  b   e
>...
>
>CREATE TEMPORARY TABLE IntermediateTable (
>  PK1  INT NOT NULL,
>  ACHAR(1) NOT NULL,
>  PK2  INT NOT NULL,
>  BCHAR(1) NOT NULL,
>  #
>  UNIQUE INDEX (PK1,A),
>  UNIQUE INDEX (PK2,B)
>);
>
>INSERT IGNORE INTO IntermediateTable
>SELECT 
>  x.PK, x.A,
>  y.PK, y.B
>FROM 
>  OriginalTable x,
>  OriginalTable y
>ORDER BY
>  RAND();
>
>SELECT 
>  x.A,
>  y.B
>FROM
>  IntermediateTable 
>INNER JOIN 
>  OriginalTable x ON (PK1 = x.PK) INNER JOIN
>  OriginalTable y ON (PK2 = y.PK)
>LIMIT 
>  the_length_of_OriginalTable;
>
>
>The problem with this solution: 
>
>Its too slow on reasonable sized tables! 


Their is also a problem with the way RAND() works...

SELECT
  x.PK, x.A,
  y.PK, y.B
FROM
  OriginalTable x,
  OriginalTable y
ORDER BY
  RAND()
LIMIT 
  1;

This takes soo long to pick a random row. Some cleaver 'LIMIT'
optimization could pick a results set almost instantly, instead of taking
in excess of half an hour with ~50,000 rows.





-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



SCRAMBLE(A,B) (was UDF:Request).

2005-04-11 Thread Dan Bolser

Requirement: 

Given two columns of a table (Column1 and Column2) of length x, return two
columns (Column1_Scram and Column2_Scram) such that the distributions of
values in Column1 and Column2 are preserved in Column1_Scram and
Column2_Scram, but the pairs of values are randomized.


Solution suggested by Shawn Green: 

Create a table with two columns, and populate this table with random pairs
of primary keys picked from the original table. Additionally, allow no
duplicate primary keys within either column. Select x rows from this
table, and link both primary keys (the primary key pair) back to the
original table to get the appropriate number of randomized pairs of
Column1 and Column2.

He suggests doing the above like this (more or less):

OriginalTable
PK  A   B
1   a   c
2   a   d
3   b   e
...

CREATE TEMPORARY TABLE IntermediateTable (
  PK1  INT NOT NULL,
  ACHAR(1) NOT NULL,
  PK2  INT NOT NULL,
  BCHAR(1) NOT NULL,
  #
  UNIQUE INDEX (PK1,A),
  UNIQUE INDEX (PK2,B)
);

INSERT IGNORE INTO IntermediateTable
SELECT 
  x.PK, x.A,
  y.PK, y.B
FROM 
  OriginalTable x,
  OriginalTable y
ORDER BY
  RAND();

SELECT 
  x.A,
  y.B
FROM
  IntermediateTable 
INNER JOIN 
  OriginalTable x ON (PK1 = x.PK) INNER JOIN
  OriginalTable y ON (PK2 = y.PK)
LIMIT 
  the_length_of_OriginalTable;


The problem with this solution: 

Its too slow on reasonable sized tables! I never get past the second step
with my data after 10 minutes!

I have 52699 rows in my 'OriginalTable' leading to ~2.7 billion checks
when inserting into the IntermediateTable... or rather 5.4 billion, as I
guess it has to check both rows for the UNIQUE key constraint on every
attempted insert. 

Ideally I would like to be able to do several thousand randomizations over
my data, and at 10 mins a pop that would take all week. (assuming the
query was about to finish when I killed it after 10 mins.)

Is their a faster way to do this randomization in SQL? Am I doing
something really dumb that was never intended by Shawn?

I can easily get the data I need with a quick step into perl, but it would
be really neat if I could do all this in MySQL.

I can imagine a general way to create 'random' joins (over scrambled
data) would have some interesting applications.

Dan.


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]