subject:"\[jira\] Commented\: \(MAHOUT\-121\) Speed up distance calculations for sparse vectors"

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740519#action_12740519
]

Grant Ingersoll commented on MAHOUT-121:

bq. Declaring variables out of the loop looks like an optimization, but I
suppose, compiler will do that automatically

This appears to be moving it right out of the computation all together, though,
as getStd (we need a better name for that!) isn't even in the computation
anymore, AFAICT.

Speed up distance calculations for sparse vectors
-

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740527#action_12740527
]

Sean Owen commented on MAHOUT-121:
--

On the point of variable declaration -- the location of the declaration has no
impact on performance. It is not somehow 'declared' multiple times if within a
loop. Pulling it out of the loop could even be very slightly slower, if it
means you are forced to assign it a dummy value at initialization which is
never used. There's one more benefit when referencing objects: an Object
reference that exists, say, in a loop -- the reference goes out of scope when
the loop finishes and its referent is immediately GC-able. If it's declared
outside the loop, the reference exists until the larger containing block
finishes. Unless you set it to null, but, that's more maintenance. It's a waste
if the referent is large and not actually used anymore. This has actually
bitten me in the past.

So in general, in Java. declare variables as deep and late as possible.

Agree with Shashikant that while using double instead of Double is nice, using
-1 (really, use -1.0 -- double literals belong with double values) as a signal
value probably isn't right here. Just say 0.0.

Speed up distance calculations for sparse vectors
-

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors


[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740568#action_12740568
 ] 

Nicolás Fantone commented on MAHOUT-121:


Thanks for the feedback, people. There are some things I'd like to point out:

{quote}
??I don't get the getStd() method now. You moved the functionality from 
computeCentroid, but now getStd is never called.??
{quote}

Please, feel free to contradict me here - That was the whole point: getStd() is 
NEVER called. I search through the entire mahout-core project and no call was 
found. So, why do we need to add functionality where it does not belong, and 
moreover, it is not needed? If there's a computeCentroid() method, then let it 
compute a centroid... not estimate some deviation (which isn't actually used,  
anyway) hundreds of times iteratively.  If there's a real need for that 
particular value, then create a method that calculates it. As there was no such 
method, I thought it was appropriate to use getStd().

{quote}
??I also am not following the commenting out of the check to see if the two 
vectors are the same size??
{quote}

Feel free to uncomment those lines and leave them as they once were. I realize 
that a size check is a needed and healthy thing, but then again, if the purpose 
of the optimized version of distance() is performance-gain perhaps we should 
consider dropping it out (there are a couple of methods in the project that 
already do that - check getDistanceSquared() is SparseVector).

{quote}
??Declaring variables out of the loop looks like an optimization, but I 
suppose, compiler will do that automatically. Even if it doesn't, the 
performance penalty is not so significant to sacrifice the code readability 
(which is arguable.)??
{quote}

I have had this discussion dozens of times with my colleagues at the University 
and partners at work. Though I can accept the root of the issue at hand is kind 
of a myth (whether declaration inside of a loop impacts performance negatively 
or not), instantiation of thousands of unnecessary objects IS costly in JAVA - 
not to mention garbage-accumulative. At the end of the day, avoiding this kind 
of code-writing is, more often than not, a healthy thing to do - even if it 
means a decrease in readability.

{quote}
??I ran a simple test with million loops over float multiplication  runs in 
10ms. Moving the declarations outside the loop didn't change the results.??
{quote}

floats are a primitive type. Want a simple test? Use String, which also happen 
to be immutable. Something as straightforward as,

{{String s = new String(sample-string);}}
{{for (int i = 0; i  25; i++){s += anotherString;}}}

will take several minutes to complete thanks to the cost of instantiating a new 
String each iteration with a new value. Now, replace String with StringBuilder, 
+= with append, and voilà: no more instantiations, same result, few seconds.

{quote}
??The floating point comparision (lengthSquared != -1) is not correct. It 
should be (lengthSquared  0), if at all we want to remove the Double object.??
{quote}

Agree. You are completely right.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740573#action_12740573
]

Grant Ingersoll commented on MAHOUT-121:

bq. Please, feel free to contradict me here - That was the whole point:
getStd() is NEVER called.

Ah, I see now. We were in deed calculating it in computeCentroid lots of
times, but it is only ever used for DisplayKMeans. You are correct.

I will fix the lengthSquared comparison and commit.

Thanks!

Speed up distance calculations for sparse vectors
-

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740576#action_12740576
]

Grant Ingersoll commented on MAHOUT-121:

What about the commenting out of:
{code}
if (centroid.size() != v.size()) {
{code}

It seems like if we are going to do this in this distance measure, we should do
it in the other distance measure. We could push this out a layer via
documentation and require the caller to do the check, if required. Otherwise,
do we have some guarantee that the centroid and the vector are the same
cardinality in all cases of calls to this method? I think we do explicitly for
the Cluster calls, but that isn't the same.

Speed up distance calculations for sparse vectors
-

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740577#action_12740577
]

Sean Owen commented on MAHOUT-121:
--

Still picking on this loops / declaration issue:

Instantiating unneeded objects is indeed, obviously, a waste. But the changes
in the last patch (which I think we're discussing?) does not save
instantiations -- nor does it really add any either. I don't follow this point
then.

The point illustrated by the String loop example has nothing to do with how
variables declared, and everything to do with the difference between String and
StringBuilder. It doesn't seem to address the point previously raised.

Here is a difference that doesn't matter:

String s = null;
for (int i = 0; i 25; i++){s = anotherString: + i;}

for (int i = 0; i 25; i++){String s = anotherString: + i;}

In fact the first is ever so slightly worse since it sets s to null, but the
value is unused. But it is worse for another reason: s continues to point to
anotherString: 24 after the loop terminates, which is also pointless.

Hence I would undo that part of the patch unless there is another purpose to it
I missed.

Speed up distance calculations for sparse vectors
-

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740580#action_12740580
]

Sean Owen commented on MAHOUT-121:
--

And finally to Grant's point about compilers and loop unrolling, and I may be
stating things people already know:

- This isn't an example of unrolling is it?
- javac will never perform optimizations like this, or really of any kind,
given it is bound to output certain bytecode given a certain input program. So
I tend to code efficiently from the get-go in Java
- (But ProGuard can, I love ProGuard)
- But the JIT compiler might at runtime. The theory, and I like it, is that
this half of the compilation is better done by a process on the platform where
the target code will run. For instance it can inline some method calls from
even non-final classes, even though Java method invocation is always
polymorphic, if it could detect that there doesn't happen to be any subclasses
loaded in the JVM at runtime. That would not be possible at compile time.

Speed up distance calculations for sparse vectors
-

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740587#action_12740587
]

Nicolás Fantone commented on MAHOUT-121:

{quote}
The point illustrated by the String loop example has nothing to do with how
variables declared, and everything to do with the difference between String and
StringBuilder. It doesn't seem to address the point previously raised.
{quote}

Not quite right. The difference between String and StringBuilder IS EXACTLY the
difference between instantiating thousands of objects and re-using just one,
which is, I believe, the matter at hand here.

{quote}
In fact the first is ever so slightly worse since it sets s to null, but the
value is unused. But it is worse for another reason: s continues to point to
anotherString: 24 after the loop terminates, which is also pointless.
{quote}

If you create new Strings in a loop, then you'll have as many objects as
iterations pointing to anotherString: 0, anotherString: 1, ...,
anotherString: 121410, and so on, waiting to be gcollected - which may not
even happen in the short term. Even more pointless, following your logic.

{quote}
Hence I would undo that part of the patch unless there is another purpose to it
I missed.
{quote}

Perhaps someone could run a profiler with and without the latest patch? I tend
to think the gain in execution speed would not be significant if any at all, as
some of you have stated. However, unless code readability is a priority, I see
no harm in changing something that can only help performance.

{quote}
This isn't an example of unrolling is it?
{quote}

That's right. It is not. It is about the cost of instatiation vs. reusability
of short-lived objects.

Speed up distance calculations for sparse vectors
-

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740589#action_12740589
]

Sean Owen commented on MAHOUT-121:
--

Are we talking about the same patch? sections like this?

+double distance = 0.0;
+Vector clusterCenter = null;
for (Cluster cluster : clusters) {
- double distance = measure.distance(point, cluster.getCenter());
- if (nearestCluster == null || distance nearestDistance) {
+ clusterCenter = cluster.getCenter();
+ distance = measure.distance(clusterCenter.getLengthSquared(),
clusterCenter, point);
+ if (distance nearestDistance) {
nearestCluster = cluster;
nearestDistance = distance;
}

The question here is indeed whether to declare distance and clusterCenter
inside or outside the loop.

While I completely agree with your statements, they do not seem to pertain to
this.

I think we're not talking about the same thing, since you pick up on the point
of creating Strings in a loop -- which both my examples did. That is not a
difference and not germane to the point I have in mind.

I am further making the point that there is not a tradeoff between readability
and speed here -- putting the declaration outside the loop actually makes both
worse. I am quite sure this should be reversed.

Speed up distance calculations for sparse vectors
-

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740646#action_12740646
]

Nicolás Fantone commented on MAHOUT-121:

Jira is hardly the place to discuss this (almost moral) issues, as this tends
to be a matter of personal choice most of the time. But I feel the urge to
reply: some of you have really helped me out here at work with my project.

{quote}
Are we talking about the same patch? sections like this?
{quote}

Yes we are. And while we are taking a look at that quoted piece of code, I
shall mention that the declaration of distance and clusterCenter outside the
loop isn't the only thing being changed in the patch. There's also an
elimination of a an unnecessary boolean evaluation in the if statement.

{quote}
I think we're not talking about the same thing, since you pick up on the point
of creating Strings in a loop - which both my examples did.
{quote}

We may not. Perhaps, my String example wasn't the happiest of them all - but it
certainly IS showing how costly it is in JAVA to just instantiate and
initialize an object. Furthermore, I thought I gave a counterargument to what
you were implying with those examples.

{quote}
I am further making the point that there is not a tradeoff between readability
and speed here - putting the declaration outside the loop actually makes both
worse.
{quote}

I can tell none of my propositions were convincing enough. Maybe someone
else's?
http://weblogs.java.net/blog/ddevore/archive/2006/08/declare_variabl_1.html

Speed up distance calculations for sparse vectors
-

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740654#action_12740654
]

Sean Owen commented on MAHOUT-121:
--

My comments -- which narrowly concern where to declare variables -- are
relevant to the patch. It is not a matter of choice or style, or morality or
whatever.

Your point about Strings is undeniable, but is irrelevant to this question. See
my counter-example for an example which would be relevant to the point I am
trying to make.

This blog posts corroborates exactly what I am saying. Declaring outside the
loop only incurs an extra initialization. It also makes the point that it is
not the same thing to *allocate an Object* outside the loop, versus inside. Of
course -- it's the difference between allocating one object and many. But, that
is not what is happening in your patch.

I think, then, all evidence suggests the variable declaration changes should be
reverted. I am making no other claims. For instance, removing the unnecessary
boolean condition remains a positive change I am sure.

Speed up distance calculations for sparse vectors
-

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740666#action_12740666
]

Grant Ingersoll commented on MAHOUT-121:

The only thing I'm missing at this point is the question from earlier about the
removal of the size comparison, then I am ready to commit.
{quote}
What about the commenting out of:
{code}
if (centroid.size() != v.size()) {
{code}
It seems like if we are going to do this in this distance measure, we should do
it in the other distance measure. We could push this out a layer via
documentation and require the caller to do the check, if required. Otherwise,
do we have some guarantee that the centroid and the vector are the same
cardinality in all cases of calls to this method? I think we do explicitly for
the Cluster calls, but that isn't the same.
{quote}

Speed up distance calculations for sparse vectors
-

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740681#action_12740681
]

Nicolás Fantone commented on MAHOUT-121:

Sean, we are definitely not following each other. Probably due to my lack of
communication skills.

{quote}
Your point about Strings is undeniable, but is irrelevant to this question. See
my counter-example for an example which would be relevant to the point I am
trying to make.
{quote}

How could it be irrelevant when it's the exact same point I tried to make in my
patch? There's a Vector being instantiated and allocated in every for
iteration, in every task, in every reduce job, in every node of the cluster.
And it is not necessary. Every time. The very same thing goes for my String
example... except with Strings.

{quote}
Declaring outside the loop only incurs an extra initialization.
{quote}

Extra? There's only ONE initialization. If the declaration is done inside the
loop, thousands of initializations are going to be done. That's thousands minus
one extra initializations.

Grant, maybe you should leave the size comparison for now. It won't impact
speed noticeably and, as of now, KMeans is only using the optimized distance
calculation for both computing convergence and emitting points. Is there
anywhere else a size check is done between input vectors? I believe there isn't.

Speed up distance calculations for sparse vectors
-

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740684#action_12740684
]

Sean Owen commented on MAHOUT-121:
--

Where in the patch do you move an object allocation from inside the loop to
outside? I do not see this. If I did, indeed, you would have a point. That's
why we may not be talking about the same thing. All I see is moving
declarations.

Yes, the one extra initialization is trivial. The fact that a reference lives
on outside the loop unnecessarily is also tiny, but not as tiny. But they are
both negatives. I think the consensus was it also was very slightly less
readable? so we have a couple small negatives -- how can that add up to support
for a change?

Speed up distance calculations for sparse vectors
-

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors


[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740694#action_12740694
 ] 

Nicolás Fantone commented on MAHOUT-121:


{quote}
Where in the patch do you move an object allocation from inside the loop to 
outside? I do not see this. 
{quote}

Here:
{code:title=getDistanceSquared() in SparseVector.java - Original 
|borderStyle=solid}
double result = 0.0;
IteratorVector.Element iter = iterateNonZero();
while (iter.hasNext()) {
  Vector.Element elt = iter.next();
  double centroidValue = v.getQuick(elt.index());
  double delta = elt.get() - centroidValue;
  result += (delta * delta) - (centroidValue * centroidValue);
}
return result;
{code}

{code:title=getDistanceSquared() in SparseVector.java - Patched 
|borderStyle=solid}
  ... 
double result = 0.0;
IteratorVector.Element iter = iterateNonZero();
Vector.Element elt = null;
double centroidValue = 0.0;
double delta = 0.0;
while (iter.hasNext()) {
  elt = iter.next();
  centroidValue = v.getQuick(elt.index());
  delta = elt.get() - centroidValue;
  result += (delta * delta) - (centroidValue * centroidValue);
}
...
{code}

And here:

{code:title=emitPointToNearestCluster() in Cluster.java - Patched 
|borderStyle=solid}
 ...
double distance = 0.0;
Vector clusterCenter = null;
 for (Cluster cluster : clusters) {
  clusterCenter = cluster.getCenter();
  distance = measure.distance(clusterCenter.getLengthSquared(), 
clusterCenter, point);
  if (distance  nearestDistance) {
 nearestCluster = cluster;
 nearestDistance = distance;
   }
...
{code}

{code:title=emitPointToNearestCluster() in Cluster.java - From Grant patch 
|borderStyle=solid}
 Cluster nearestCluster = null;
 double nearestDistance = Double.MAX_VALUE;
 for (Cluster cluster : clusters) {
  Vector clusterCenter = cluster.getCenter();
  double distance = measure.distance(clusterCenter.getLengthSquared(), 
clusterCenter, point);
   if (nearestCluster == null || distance  nearestDistance) {
 nearestCluster = cluster;
 nearestDistance = distance;
{code}

And in outputPointWithClusterInfo(), which is actually the same bit of code as 
this last one.

 Speed up distance calculations for sparse vectors
 -

 Key: MAHOUT-121
 URL: https://issues.apache.org/jira/browse/MAHOUT-121
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, 
 MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch, 
 MAHOUT-121-new-distance-optimization.patch, mahout-121.patch, 
 MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
 MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch


 From my mail to the Mahout mailing list.
 I am working on clustering a dataset which has thousands of sparse vectors. 
 The complete dataset has few tens of thousands of feature items but each 
 vector has only couple of hundred feature items. For this, there is an 
 optimization in distance calculation, a link to which I found the archives of 
 Mahout mailing list.
 http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
 I tried out this optimization.  The test setup had 2000 document  vectors 
 with few hundred items.  I ran canopy generation with Euclidean distance and 
 t1, t2 values as 250 and 200.
  
 Current Canopy Generation: 28 min 15 sec.
 Canopy Generation with distance optimization: 1 min 38 sec.
 I know by experience that using Integer, Double objects instead of primitives 
 is computationally expensive. I changed the sparse vector  implementation to 
 used primitive collections by Trove [
 http://trove4j.sourceforge.net/ ].
 Distance optimization with Trove: 59 sec
 Current canopy generation with Trove: 21 min 55 sec
 To sum, these two optimizations reduced cluster generation time by a 97%.
 Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
  
 Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740814#action_12740814
]

Grant Ingersoll commented on MAHOUT-121:

I committed an in-between version of MAHOUT-121-new-distance-optimization and
MAHOUT-121-cluster-distance.patch. The former didn't pass tests due to not
checking to see if the nearestCluster was null. I did, however, move that
check for null to the second part of the || clause, as it is the less frequent
case and is likely to be short-circuited by the distance check.

Committed revision 802283.

Speed up distance calculations for sparse vectors
-

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-06 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740382#action_12740382
]

Grant Ingersoll commented on MAHOUT-121:

Nicolas,

I don't get the getStd() method now. You moved the functionality from
computeCentroid, but now getStd is never called, except in DisplayKMeans, which
isn't part of the main execution path. I also am not following the commenting
out of the check to see if the two vectors are the same size, but then again,
my eyes are glazing over in need of sleep.

Speed up distance calculations for sparse vectors
-

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-06 Thread Shashikant Kore (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12740420#action_12740420
]

Shashikant Kore commented on MAHOUT-121:

Declaring variables out of the loop looks like an optimization, but I suppose,
compiler will do that automatically. Even if it doesn't, the performance
penalty is not so significant to sacrifice the code readability (which is
arguable.) I ran a simple test with million loops over float multiplication
runs in 10ms. Moving the declarations outside the loop didn't change the
results.

The floating point comparision (lengthSquared != -1) is not correct. It should
be (lengthSquared 0), if at all we want to remove the Double object.

Speed up distance calculations for sparse vectors
-

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-05 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739712#action_12739712
]

Grant Ingersoll commented on MAHOUT-121:

Nicolas, any word on your update? Otherwise I will just apply my patch

Speed up distance calculations for sparse vectors
-

Key: MAHOUT-121
URL: https://issues.apache.org/jira/browse/MAHOUT-121
Project: Mahout
Issue Type: Improvement
Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k,
MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch,
mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch,
MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch,
Mahout1211.patch

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-01 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737875#action_12737875
]

Grant Ingersoll commented on MAHOUT-121:

Hi Nicolas,

I'm getting an error applying your patch:
{quote}
patch -p 0 -i ../patches/MAHOUT-121-distance-optimization.patch --dry-run
patching file
core/src/main/java/org/apache/mahout/clustering/kmeans/Cluster.java
patch: malformed patch at line 46: @@ -323,6 +327,10 @@
{quote}

I create my patches by going to the top directory of mahout and doing:
svn diff ../mypatch.patch

Speed up distance calculations for sparse vectors
-

Key: MAHOUT-121
URL: https://issues.apache.org/jira/browse/MAHOUT-121
Project: Mahout
Issue Type: Improvement
Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k,
MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch,
mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch,
MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch,
Mahout1211.patch

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-08-01 Thread JIRA

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737956#action_12737956
]

Nicolás Fantone commented on MAHOUT-121:

Um... weird. That's the way I created the patch, too. I'm going to revise it on
Monday at work and upload a new one. (Maybe it's just an empty or a new line
that I erased from the original .java and it's not translating correctly into
the patch?)

Speed up distance calculations for sparse vectors
-

Key: MAHOUT-121
URL: https://issues.apache.org/jira/browse/MAHOUT-121
Project: Mahout
Issue Type: Improvement
Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k,
MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch,
mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch,
MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch,
Mahout1211.patch

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-07-29 Thread JIRA

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736627#action_12736627
]

Nicolás Fantone commented on MAHOUT-121:

Hi all - I've just sign up to Jira!

Before committing this patch, could any of you take a look to my latest mail in
the Mahout mailing list (it's a bit extensive to copy/paste in here)? With some
enlightenment, I could create a patch from my work and we may be able to make
bigger improvements for KMeans.

Speed up distance calculations for sparse vectors
-

Key: MAHOUT-121
URL: https://issues.apache.org/jira/browse/MAHOUT-121
Project: Mahout
Issue Type: Improvement
Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k,
MAHOUT-121-cluster-distance.patch, mahout-121.patch, MAHOUT-121.patch,
MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch,
mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-07-28 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736036#action_12736036
]

Grant Ingersoll commented on MAHOUT-121:

Looks like we missed some opportunities for using our new distance
capabilities. Will try to get a patch out soon.

Speed up distance calculations for sparse vectors
-

Key: MAHOUT-121
URL: https://issues.apache.org/jira/browse/MAHOUT-121
Project: Mahout
Issue Type: Improvement
Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k,
mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch,
MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch,
Mahout1211.patch

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-07-28 Thread Shashikant Kore (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736460#action_12736460
]

Shashikant Kore commented on MAHOUT-121:

Grant,

The latest patch brings back the changes from my first patch. This can go in.

Speed up distance calculations for sparse vectors
-

Key: MAHOUT-121
URL: https://issues.apache.org/jira/browse/MAHOUT-121
Project: Mahout
Issue Type: Improvement
Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k,
MAHOUT-121-cluster-distance.patch, mahout-121.patch, MAHOUT-121.patch,
MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch,
mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-26 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724548#action_12724548
]

Grant Ingersoll commented on MAHOUT-121:

bq. I see, EuclideanDistanceMeasure still returns the sqrt value and other
calculations are moved to SquaredEuclideanDistanceMeasure. If someone is
interested in only relative measure, using SquaredEuclideanDistanceMeasure is a
faster solution. The distance value returned by EuclideanDistanceMeasure still
continues to be the same. Am I missing something here?

Nope, your not missing anything. I added the SquaredEDM class yesterday per
this discussion.

Speed up distance calculations for sparse vectors
-

Key: MAHOUT-121
URL: https://issues.apache.org/jira/browse/MAHOUT-121
Project: Mahout
Issue Type: Improvement
Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k,
mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch,
MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch,
Mahout1211.patch

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-26 Thread Ted Dunning

No.  You didn't miss anything.

The point of EDM is to be strictly correct relative to the definition, to
not confuse people and yet still provide a pointer to SEDM.  The point of
SEDM is for speed.  This is all just as you said.

On Fri, Jun 26, 2009 at 6:29 AM, Shashikant Kore (JIRA) j...@apache.orgwrote:

 Didn't get your question. I see, EuclideanDistanceMeasure still returns the
 sqrt value and other calculations are moved to
 SquaredEuclideanDistanceMeasure. If someone is interested in only relative
 measure, using SquaredEuclideanDistanceMeasure is a faster solution.  The
 distance value returned by EuclideanDistanceMeasure still continues to be
 the same. Am I missing something here?

Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-26 Thread Ted Dunning

Not irrelevant at all.  It is a very good point.

It means that the strict triangle optimization cannot be applied with the
no-square-root optimization.

There is a stricter bound that you can use with squared distances (r_12^2 -
3 r_x1^2 as opposed to r_12 - r_x1) but using that spoils the generality of
the k-means algorithm.  We could also push a method into some distances
using an additional interface something like HasTriangleBound and allow the
distance to compute the bound, but this is getting pretty far out in the
weeds.

On Fri, Jun 26, 2009 at 6:35 AM, Sean Owen (JIRA) j...@apache.org wrote:


[
 https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724520#action_12724520]

 Sean Owen commented on MAHOUT-121:
 --

 This may be irrelevant -- haven't thought it through -- since someone
 mentioned using the triangle inequality to optimize some stuff earlier, I
 wonder if it is a problem that a squared-distance measure no longer
 satisfies this inequality? That is, it is not true that the square of one
 side is less than the sum of squares of other two sides.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-25 Thread Shashikant Kore (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723980#action_12723980
]

Shashikant Kore commented on MAHOUT-121:

Grant,

I was trying to verify the patch, but for some reason, my hadoop setup refuses
to run with Error reading task output message. Couldn't get it working after
lots of effort. Will try it again. If you could post the performance numbers
before and after the distance calculation improvement, it would be helpful to
verify if those are in line with expectations.

Speed up distance calculations for sparse vectors
-

Key: MAHOUT-121
URL: https://issues.apache.org/jira/browse/MAHOUT-121
Project: Mahout
Issue Type: Improvement
Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, mahout-121.patch,
MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch,
MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-25 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724075#action_12724075
]

Grant Ingersoll commented on MAHOUT-121:

I haven't verified this, but intuitively, it doesn't seem all that
unreasonable. Presumably your vectors are a lot more sparse now.

BTW, I'll add the min #docs, % as options to the Driver.

Speed up distance calculations for sparse vectors
-

Key: MAHOUT-121
URL: https://issues.apache.org/jira/browse/MAHOUT-121
Project: Mahout
Issue Type: Improvement
Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, mahout-121.patch,
MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch,
MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-25 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724228#action_12724228
]

Grant Ingersoll commented on MAHOUT-121:

BTW, any objection to me removing the sqrt calculation from
EuclideanDistanceMeasure?

Speed up distance calculations for sparse vectors
-

Key: MAHOUT-121
URL: https://issues.apache.org/jira/browse/MAHOUT-121
Project: Mahout
Issue Type: Improvement
Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k,
mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch,
MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch,
Mahout1211.patch

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-25 Thread Ted Dunning

THat is one option.

Another is to have a separate distance measure that doesn't have the square
root.

On Thu, Jun 25, 2009 at 12:40 PM, Grant Ingersoll (JIRA) j...@apache.orgwrote:


[
 https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724228#action_12724228]

 Grant Ingersoll commented on MAHOUT-121:
 

 BTW, any objection to me removing the sqrt calculation from
 EuclideanDistanceMeasure?

Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-25 Thread Ted Dunning

That is what I meant.

As for name, how about SquaredEuclideanDistance?

On Thu, Jun 25, 2009 at 1:10 PM, Grant Ingersoll gsing...@apache.orgwrote:

 I suppose we could have a PseudoEuclideanDM class that doesn't do the
 square root from which the Euclidean merely wraps and gives back the sqrt.
  Actually, I like that solution, except the name part of it

Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-25 Thread Grant Ingersoll



On Jun 25, 2009, at 4:10 PM, Grant Ingersoll wrote:



On Jun 25, 2009, at 3:41 PM, Ted Dunning wrote:


THat is one option.

Another is to have a separate distance measure that doesn't have  
the square

root.



Yeah, but that implies that we have one interface for all the other  
DistanceMeasures and one for Euclidean, or, I suppose we could have  
a PseudoEuclideanDM class that doesn't do the square root from which  
the Euclidean merely wraps and gives back the sqrt.  Actually, I  
like that solution, except the name part of it...





I'm going to go with the base class approach.   Seems like that should  
work.  Committing shortly.

Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-25 Thread Jeff Eastman


+1

Ted Dunning wrote:

That is what I meant.

As for name, how about SquaredEuclideanDistance?

On Thu, Jun 25, 2009 at 1:10 PM, Grant Ingersoll gsing...@apache.orgwrote:

  

I suppose we could have a PseudoEuclideanDM class that doesn't do the
square root from which the Euclidean merely wraps and gives back the sqrt.
 Actually, I like that solution, except the name part of it



  




PGP.sig
Description: PGP signature

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-25 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724278#action_12724278
]

Grant Ingersoll commented on MAHOUT-121:

Committed revision 788504.

Speed up distance calculations for sparse vectors
-

Key: MAHOUT-121
URL: https://issues.apache.org/jira/browse/MAHOUT-121
Project: Mahout
Issue Type: Improvement
Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k,
mahout-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch,
MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch,
Mahout1211.patch

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-24 Thread Grant Ingersoll (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723732#action_12723732
]

Grant Ingersoll commented on MAHOUT-121:

Note, also on that Snapshot that I uploaded that I was running w/ MAHOUT-139

Speed up distance calculations for sparse vectors
-

Key: MAHOUT-121
URL: https://issues.apache.org/jira/browse/MAHOUT-121
Project: Mahout
Issue Type: Improvement
Components: Matrix
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, MAHOUT-121.patch,
MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch,
mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-18 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721646#action_12721646
]

Sean Owen commented on MAHOUT-121:
--

Since I am not hearing objections, and cognizant that people are waiting on
this, going to commit. If there are issues we can roll back or tweak from there.

Speed up distance calculations for sparse vectors
-

Key: MAHOUT-121
URL: https://issues.apache.org/jira/browse/MAHOUT-121
Project: Mahout
Issue Type: Improvement
Components: Matrix
Reporter: Shashikant Kore
Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch,
MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch,
Mahout1211.patch

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-17 Thread Ted Dunning

Shashikant,

How does k-means treat your data?

On Tue, Jun 16, 2009 at 10:12 PM, Shashikant Kore (JIRA) j...@apache.orgwrote:

 This is  voodoo for me. For the dataset I am working with has a window of
 0.05 in which the result changes from 0 canopies to 3,000 canopies.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720627#action_12720627
]

Grant Ingersoll commented on MAHOUT-121:

After the patch, it was pretty fast, but like I said, I didn't evaluate any
correctness just yet.

Speed up distance calculations for sparse vectors
-

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720693#action_12720693
]

Grant Ingersoll commented on MAHOUT-121:

FYI, Sean, can you generate the patch from within the trunk directory, this
will make applying it easier for everyone else.

Speed up distance calculations for sparse vectors
-

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720694#action_12720694
]

Grant Ingersoll commented on MAHOUT-121:

I'm getting a lot of failures when applying this patch.

Speed up distance calculations for sparse vectors
-

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-17 Thread Jeff Eastman (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720764#action_12720764
]

Jeff Eastman commented on MAHOUT-121:
-

A bit premature perhaps. There is still an unresolved cardinality() reference
in
/Mahout/utils/src/test/java/org/apache/mahout/utils/vectors/lucene/LuceneIterableTest.java
and the VectorTest testSparseVectorTimesX fails intermittently. AFAICT, none
of the patches include /utils and that compile fails after the unit tests and
packaging have succeeded.

Speed up distance calculations for sparse vectors
-

Key: MAHOUT-121
URL: https://issues.apache.org/jira/browse/MAHOUT-121
Project: Mahout
Issue Type: Improvement
Components: Matrix
Reporter: Shashikant Kore
Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch,
mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch

Licensing of Trove seems to be an issue which needs to be addressed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors