[
https://issues.apache.org/jira/browse/MAHOUT-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131309#comment-13131309
]
Hudson commented on MAHOUT-846:
---
Integrated in Mahout-Quality #1105 (See
[https://builds.ap
The distribution of the dot product of two randomly chosen, uniformly
distributed unit vectors is roughly normally distributed with a standard
deviation that declines with increasing dimension roughly with your observed
sqrt scaling factor.
In fact, it is just this scaling property that makes the
What are some other cases where this would be useful?
Lance
On Wed, Oct 19, 2011 at 11:03 AM, Frank Scholten (Commented) (JIRA) <
j...@apache.org> wrote:
>
>[
> https://issues.apache.org/jira/browse/MAHOUT-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedComme
[
https://issues.apache.org/jira/browse/MAHOUT-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131194#comment-13131194
]
Hudson commented on MAHOUT-846:
---
Integrated in Mahout-Quality #1104 (See
[https://builds.ap
What's numColumns -- is that the total number of possible dimensions?
On Wed, Oct 19, 2011 at 10:10 PM, Sebastian Schelter wrote:
> Seems to be the wrong way around indeed. I don't think the
> normalization can be used in the distributed implementation anymore as
> the number of overlapping dimen
Seems to be the wrong way around indeed. I don't think the
normalization can be used in the distributed implementation anymore as
the number of overlapping dimensions is not known anymore (this is
information is lost because we only have the dot product between the
vectors and their squares at hand
Right, that's not quite the issue. It's that some comparisons are made
in 2-space, some in 10-space, etc. It would be nice to have some idea
that a distance is 2-space is "about as meaningfully far" as some
other distance in 10-space. I'm trying to find the order of that
correcting factor and it se
None of this actually applies because real data are not uniformly
distributed (not even close). Do the sampling on your own data and pick a
good guess from that.
On Wed, Oct 19, 2011 at 11:40 AM, Sean Owen wrote:
> Ah, I'm looking for the distance between points within, rather than
> on, the hy
Sebastian: I had a look at the distributed Euclidean similarity and it
computes similarity as ...
1 - 1 / (1+d). This is the wrong way around right? Higher distance
moves the value to 1.
For consistency, I'm looking to stick with a 1/(1+d) expression for
now (unless someone tells me that's just t
[
https://issues.apache.org/jira/browse/MAHOUT-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13130827#comment-13130827
]
Frank Scholten commented on MAHOUT-845:
---
+1 Adding that method to Vector. This way w
Oops, that's not quite right. I guess you're looking for the expected distance
between random continuous points in [0,1]^n not {0,1}^n. Sqrt(n/2) should be
an upper bound in any case. If you figure out the expected squared distance of
any single coordinate you can use the analysis in my last
Ah, I'm looking for the distance between points within, rather than
on, the hypercube. (Think of it as random rating vectors, in the range
0..1, across all movies. They're not binary ratings but ratings from 0
to 1.)
On Wed, Oct 19, 2011 at 6:30 PM, Justin Cranshaw wrote:
> I think the analytic a
I think the analytic answer should be sqrt(n/2).
So let's suppose X and Y are random points in the n dimensional hypercube
{0,1}^n. Let Z_i be an indicator variable that is 1 if X_i != Y_i and 0
otherwise. Then d(X,Y)^2 =sum (X_i - Y_i)^2 = sum( Z_i). Then the expected
squared distance is E
[
https://issues.apache.org/jira/browse/MAHOUT-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeff Eastman updated MAHOUT-846:
Description:
The pdf() implementation in GaussianCluster is pretty lame. It is computing a
running
Improve Scalability of Gaussian Cluster For Wide Vectors
Key: MAHOUT-846
URL: https://issues.apache.org/jira/browse/MAHOUT-846
Project: Mahout
Issue Type: Improvement
Affects Versi
what about this:
http://www.wisdom.weizmann.ac.il/~oded/p_aver-metric.html
HTW
2011/10/19 Sean Owen
> (And when I do the simulation correctly, I get a better answer: sqrt(n/6) )
>
> On Wed, Oct 19, 2011 at 5:21 PM, Sean Owen wrote:
> > Hmm. Not knowing the analytics answer I just wrote a simu
(And when I do the simulation correctly, I get a better answer: sqrt(n/6) )
On Wed, Oct 19, 2011 at 5:21 PM, Sean Owen wrote:
> Hmm. Not knowing the analytics answer I just wrote a simulation.
> sqrt(n / 3) looks like a shockingly good fit for the average distance
> between two randomly chosen po
Hmm. Not knowing the analytics answer I just wrote a simulation.
sqrt(n / 3) looks like a shockingly good fit for the average distance
between two randomly chosen points in the n-dimensional hypercube.
Accident? error? known result? Seems clear that something like sqrt(n)
would be a better factor
[
https://issues.apache.org/jira/browse/MAHOUT-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13130725#comment-13130725
]
Jeff Eastman commented on MAHOUT-845:
-
+1 We could easily add a static method to Abstr
I've most often seen something like exp(-d(x,y)) for converting distance to
similarity. Unlike 1/(1+d) this has exponential decay in distance, which is
usually more desirable. There is a similar kludge to what you describe, where
people use exp(-d/h) for some bandwidth h. I'm not sure there's
There's already a cosine distance measure implementation available;
this concerns the right-est way to implement a Euclidean
distance-based measure.
On Wed, Oct 19, 2011 at 4:12 PM, Christian Prokopp
wrote:
> Do you have a particular reason for not going with cosine?
>
> On 19 October 2011 15:51,
Do you have a particular reason for not going with cosine?
On 19 October 2011 15:51, Sean Owen wrote:
> Interesting question came up recently about using the Euclidean
> distance d between two vectors as a notion of their similarity.
>
> You can use 1 / (1 + d), which mostly works, except that i
Interesting question came up recently about using the Euclidean
distance d between two vectors as a notion of their similarity.
You can use 1 / (1 + d), which mostly works, except that it
'penalizes' larger vectors, who have more dimensions along which to
differ. This is bad when those vectors are
Make cluster top terms code more reusable
-
Key: MAHOUT-845
URL: https://issues.apache.org/jira/browse/MAHOUT-845
Project: Mahout
Issue Type: Improvement
Components: Clustering
Re
24 matches
Mail list logo