[jira] [Commented] (MAHOUT-846) Improve Scalability of Gaussian Cluster For Wide Vectors

2011-10-19 Thread Hudson (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131309#comment-13131309 ] Hudson commented on MAHOUT-846: --- Integrated in Mahout-Quality #1105 (See [https://builds.ap

Re: Average distance between two points in unit hypercube?

2011-10-19 Thread Ted Dunning
The distribution of the dot product of two randomly chosen, uniformly distributed unit vectors is roughly normally distributed with a standard deviation that declines with increasing dimension roughly with your observed sqrt scaling factor. In fact, it is just this scaling property that makes the

Re: [jira] [Commented] (MAHOUT-845) Make cluster top terms code more reusable

2011-10-19 Thread Lance Norskog
What are some other cases where this would be useful? Lance On Wed, Oct 19, 2011 at 11:03 AM, Frank Scholten (Commented) (JIRA) < j...@apache.org> wrote: > >[ > https://issues.apache.org/jira/browse/MAHOUT-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedComme

[jira] [Commented] (MAHOUT-846) Improve Scalability of Gaussian Cluster For Wide Vectors

2011-10-19 Thread Hudson (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131194#comment-13131194 ] Hudson commented on MAHOUT-846: --- Integrated in Mahout-Quality #1104 (See [https://builds.ap

Re: Average distance between two points in unit hypercube?

2011-10-19 Thread Sean Owen
What's numColumns -- is that the total number of possible dimensions? On Wed, Oct 19, 2011 at 10:10 PM, Sebastian Schelter wrote: > Seems to be the wrong way around indeed. I don't think the > normalization can be used in the distributed implementation anymore as > the number of overlapping dimen

Re: Average distance between two points in unit hypercube?

2011-10-19 Thread Sebastian Schelter
Seems to be the wrong way around indeed. I don't think the normalization can be used in the distributed implementation anymore as the number of overlapping dimensions is not known anymore (this is information is lost because we only have the dot product between the vectors and their squares at hand

Re: Average distance between two points in unit hypercube?

2011-10-19 Thread Sean Owen
Right, that's not quite the issue. It's that some comparisons are made in 2-space, some in 10-space, etc. It would be nice to have some idea that a distance is 2-space is "about as meaningfully far" as some other distance in 10-space. I'm trying to find the order of that correcting factor and it se

Re: Average distance between two points in unit hypercube?

2011-10-19 Thread Ted Dunning
None of this actually applies because real data are not uniformly distributed (not even close). Do the sampling on your own data and pick a good guess from that. On Wed, Oct 19, 2011 at 11:40 AM, Sean Owen wrote: > Ah, I'm looking for the distance between points within, rather than > on, the hy

Re: Average distance between two points in unit hypercube?

2011-10-19 Thread Sean Owen
Sebastian: I had a look at the distributed Euclidean similarity and it computes similarity as ... 1 - 1 / (1+d). This is the wrong way around right? Higher distance moves the value to 1. For consistency, I'm looking to stick with a 1/(1+d) expression for now (unless someone tells me that's just t

[jira] [Commented] (MAHOUT-845) Make cluster top terms code more reusable

2011-10-19 Thread Frank Scholten (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13130827#comment-13130827 ] Frank Scholten commented on MAHOUT-845: --- +1 Adding that method to Vector. This way w

Re: Average distance between two points in unit hypercube?

2011-10-19 Thread Justin Cranshaw
Oops, that's not quite right. I guess you're looking for the expected distance between random continuous points in [0,1]^n not {0,1}^n. Sqrt(n/2) should be an upper bound in any case. If you figure out the expected squared distance of any single coordinate you can use the analysis in my last

Re: Average distance between two points in unit hypercube?

2011-10-19 Thread Sean Owen
Ah, I'm looking for the distance between points within, rather than on, the hypercube. (Think of it as random rating vectors, in the range 0..1, across all movies. They're not binary ratings but ratings from 0 to 1.) On Wed, Oct 19, 2011 at 6:30 PM, Justin Cranshaw wrote: > I think the analytic a

Re: Average distance between two points in unit hypercube?

2011-10-19 Thread Justin Cranshaw
I think the analytic answer should be sqrt(n/2). So let's suppose X and Y are random points in the n dimensional hypercube {0,1}^n. Let Z_i be an indicator variable that is 1 if X_i != Y_i and 0 otherwise. Then d(X,Y)^2 =sum (X_i - Y_i)^2 = sum( Z_i). Then the expected squared distance is E

[jira] [Updated] (MAHOUT-846) Improve Scalability of Gaussian Cluster For Wide Vectors

2011-10-19 Thread Jeff Eastman (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Eastman updated MAHOUT-846: Description: The pdf() implementation in GaussianCluster is pretty lame. It is computing a running

[jira] [Created] (MAHOUT-846) Improve Scalability of Gaussian Cluster For Wide Vectors

2011-10-19 Thread Jeff Eastman (Created) (JIRA)
Improve Scalability of Gaussian Cluster For Wide Vectors Key: MAHOUT-846 URL: https://issues.apache.org/jira/browse/MAHOUT-846 Project: Mahout Issue Type: Improvement Affects Versi

Re: Average distance between two points in unit hypercube?

2011-10-19 Thread Federico Castanedo
what about this: http://www.wisdom.weizmann.ac.il/~oded/p_aver-metric.html HTW 2011/10/19 Sean Owen > (And when I do the simulation correctly, I get a better answer: sqrt(n/6) ) > > On Wed, Oct 19, 2011 at 5:21 PM, Sean Owen wrote: > > Hmm. Not knowing the analytics answer I just wrote a simu

Re: Average distance between two points in unit hypercube?

2011-10-19 Thread Sean Owen
(And when I do the simulation correctly, I get a better answer: sqrt(n/6) ) On Wed, Oct 19, 2011 at 5:21 PM, Sean Owen wrote: > Hmm. Not knowing the analytics answer I just wrote a simulation. > sqrt(n / 3) looks like a shockingly good fit for the average distance > between two randomly chosen po

Re: Average distance between two points in unit hypercube?

2011-10-19 Thread Sean Owen
Hmm. Not knowing the analytics answer I just wrote a simulation. sqrt(n / 3) looks like a shockingly good fit for the average distance between two randomly chosen points in the n-dimensional hypercube. Accident? error? known result? Seems clear that something like sqrt(n) would be a better factor

[jira] [Commented] (MAHOUT-845) Make cluster top terms code more reusable

2011-10-19 Thread Jeff Eastman (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13130725#comment-13130725 ] Jeff Eastman commented on MAHOUT-845: - +1 We could easily add a static method to Abstr

Re: Average distance between two points in unit hypercube?

2011-10-19 Thread Justin Cranshaw
I've most often seen something like exp(-d(x,y)) for converting distance to similarity. Unlike 1/(1+d) this has exponential decay in distance, which is usually more desirable. There is a similar kludge to what you describe, where people use exp(-d/h) for some bandwidth h. I'm not sure there's

Re: Average distance between two points in unit hypercube?

2011-10-19 Thread Sean Owen
There's already a cosine distance measure implementation available; this concerns the right-est way to implement a Euclidean distance-based measure. On Wed, Oct 19, 2011 at 4:12 PM, Christian Prokopp wrote: > Do you have a particular reason for not going with cosine? > > On 19 October 2011 15:51,

Re: Average distance between two points in unit hypercube?

2011-10-19 Thread Christian Prokopp
Do you have a particular reason for not going with cosine? On 19 October 2011 15:51, Sean Owen wrote: > Interesting question came up recently about using the Euclidean > distance d between two vectors as a notion of their similarity. > > You can use 1 / (1 + d), which mostly works, except that i

Average distance between two points in unit hypercube?

2011-10-19 Thread Sean Owen
Interesting question came up recently about using the Euclidean distance d between two vectors as a notion of their similarity. You can use 1 / (1 + d), which mostly works, except that it 'penalizes' larger vectors, who have more dimensions along which to differ. This is bad when those vectors are

[jira] [Created] (MAHOUT-845) Make cluster top terms code more reusable

2011-10-19 Thread Frank Scholten (Created) (JIRA)
Make cluster top terms code more reusable - Key: MAHOUT-845 URL: https://issues.apache.org/jira/browse/MAHOUT-845 Project: Mahout Issue Type: Improvement Components: Clustering Re