[Boston.pm] combinations

2003-08-04 Thread David Byrne
I am fairly new to Perl and haven't approached a scipt
this complex or computation this intensive.  So I
would certainly appreciate any advice.
 
I have successfully created a hash of arrays
equivalent to a 122 x 6152 matrix that I want to run
in 'pairwise combinations' and execute the 'sum of the
difference squares' for each combination.
 
In other words:
rows:  y1...y122
columns:   x1...x6152
 
so...
comb(y1,y2): 
{( y1[x1] - y2[x1] ) ^2 + ( y1[x2] - y2[x2] ) ^2 + ...
+ ( y1[x122] - y2[x122] ) ^2};
 
comb(y1,y3): 
{( y1[x1] - y3[x1] ) ^2 + ( y1[x2] - y3[x2] ) ^2 + ...
+ ( y1[x122] - y3[x122] ) ^2};.
.
.
comb(y1,y6152)
comb(y2,y3)
.
.
comb(y2,y6152)
comb(y3,y4)
.
.
etc.
 
This is going to be very large.  According to the
combinations formula (nCk, n=6152, k=2), the output
will be a hash (with, for example, 'y1y2' key and
'd^2' value) of about 19 million records.  

I think my next step is to create a combinations
formula, but I'm having problems doing so.
 
Thank you in advance,
David

__
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] damian talk to boston.pm in sept.

2003-08-04 Thread Mikey Smelto


Who is this Smelto guy anyway?

On Tue, 29 Jul 2003, Uri Guttman wrote:

 i will get the url of his free talks and we will do the usual round of
 voting for your favorites.
*Please* can we *finally* have the Perligata talk? :)

 so vote early and often for your favorite. if a new set of talks is
 listed, use that instead.
Early  often?

  % cat ~/bin/perligata
  #!/bin/sh
  echo yet another vote for perligata | \
mail -s 'damian talk vote' [EMAIL PROTECTED]
  % crontab -l | grep perligata
  00,15,30,45  *  *  *  *  /Users/cdevers/bin/perligata
...on second thought... Smelto should run this.

*ahem*

--
Chris Devers [EMAIL PROTECTED]
http://devers.homeip.net:8080/
POM, n. \pronounced P-O-M or pom (esp. Australian)\ [Acronym for Phase
  Of the Moon.]
Chiefly, as POM-dependent, flaky, unreliable. See also PHASE.
-- from _The Computer Contradictionary_, Stan Kelly-Bootle, 1995
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm
_
The new MSN 8: advanced junk mail protection and 2 months FREE*  
http://join.msn.com/?page=features/junkmail

___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] combinations

2003-08-04 Thread Kenneth Graves
   Date: Mon, 4 Aug 2003 13:53:52 -0700 (PDT)
   From: David Byrne [EMAIL PROTECTED]

   I am fairly new to Perl and haven't approached a scipt
   this complex or computation this intensive.  So I
   would certainly appreciate any advice.

   I have successfully created a hash of arrays
   equivalent to a 122 x 6152 matrix that I want to run
   in 'pairwise combinations' and execute the 'sum of the
   difference squares' for each combination.

   In other words:
   rows:  y1...y122
   columns:   x1...x6152

This is a single large matrix?  Sparse or dense?
If sparse, a hash of hashes is probably the memory efficient way to store it:
$matrix{y32}{x53} = value for row 32, column 53;
If dense, you could use an array of arrays:
$matrix[32][53] = value for row 32, column 53;
Or you could investigate PDL (Piddle, Perl Data Language).

   so...
   comb(y1,y2): 
   {( y1[x1] - y2[x1] ) ^2 + ( y1[x2] - y2[x2] ) ^2 + ...
   + ( y1[x122] - y2[x122] ) ^2};

You've reversed x and y compared to above.

# array of arrays version
for my $i (1..6152) {
for my $j ($i+1 .. 6152) {
$comb[$i][$j] = 0;
$comb[$i][$j] += ($matrix[$i][$_] - $matrix[$j][$_]) **2
for (1..122);
}
}

   This is going to be very large.  According to the
   combinations formula (nCk, n=6152, k=2), the output
   will be a hash (with, for example, 'y1y2' key and
   'd^2' value) of about 19 million records.  

Yes.  PDL is more memory efficient.  Or just run it on a machine that
has lots of RAM+swap.  Or use various techniques to move most of the
storage out of memory into files or a database.

(Simplest example: instead of creating a $comb AoA above, just create
a $comb scalar each round, then write it out:
print comb of rows $i and $j is $comb\n;
)

--kag
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


[Boston.pm] Postal address De-duping

2003-08-04 Thread Joel Gwynn
Hey, all.  We do lots of (snail) mailings, and we're looking for a fast,
customizable de-duping solution.  We're currently taking a look at
doubletake from http://peoplesmith.com/, which is not too expensive, but
I was thinking there might be some perl stuff out there, given perl's
text-processing powers.


___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Postal address De-duping

2003-08-04 Thread Jon Orwant
On Monday, August 4, 2003, at 05:12  PM, Joel Gwynn wrote:

Hey, all.  We do lots of (snail) mailings, and we're looking for a 
fast,
customizable de-duping solution.  We're currently taking a look at
doubletake from http://peoplesmith.com/, which is not too expensive, 
but
I was thinking there might be some perl stuff out there, given perl's
text-processing powers.
There's a wee script I wrote for TPJ a while back that scrapes the U.S. 
Postal Service's address canonicalizer.  The script is on tpj.com; look 
under Archives for the article called Five Quick Hacks.  The 
canonicalizer (well, they call it a zip code locator or something 
like that) will transform variants on the same address into the One 
True Address that the USPS recognizes, so de-duping then becomes a 
matter of simple string matching.

Won't help you for foreign addresses, obviously.

-Jon

___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm