[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776921#action_12776921
 ] 

Sean Owen commented on MAHOUT-103:
--

Re-post an updated patch and happy to give my comments on it. The more the 
merrier. If it's basically sound I'd like to mention it in the forthcoming book 
which I'm writing now.

I use the GroupLens, Jester, Netflix data sets regularly. Indeed, just drop the 
rating. The framework can do this automatically too if you like in the 
DataModel.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-11-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776924#action_12776924
 ] 

Sean Owen commented on MAHOUT-165:
--

IntDoubleHash right? We could look at that, but I thought the status here was 
that Colt worked just fine and fast. Perhaps I miss something but I don't see a 
remaining issue with using (part of) Colt.

I somehow strongly suspect we will benefit from not reinventing a wheel here, 
and whatever we need can be done with Colt, plus perhaps some contributed 
changes, plus a custom implementation here and there.

+1 for Whatever Is Needed To Use Colt?

 Using better primitives hash for sparse vector for performance gains
 

 Key: MAHOUT-165
 URL: https://issues.apache.org/jira/browse/MAHOUT-165
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Affects Versions: 0.2
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Fix For: 0.3

 Attachments: colt.jar, mahout-165-trove.patch, 
 MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch


 In SparseVector, we need primitives hash map for index and values. The 
 present implementation of this hash map is not as efficient as some of the 
 other implementations in non-Apache projects. 
 In an experiment, I found that, for get/set operations, the primitive hash of 
  Colt performance an order of magnitude better than OrderedIntDoubleMapping. 
 For iteration it is 2x slower, though. 
 Using Colt in Sparsevector improved performance of canopy generation. For an 
 experimental dataset, the current implementation takes 50 minutes. Using 
 Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the 
 delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: 0.2 status

2009-11-12 Thread deneche abdelhakim
please use Decision Forests instead of Random Forests



On Thu, Nov 12, 2009 at 9:01 AM, Robin Anil robin.a...@gmail.com wrote:
 Please edit/add stuff.

 Robin


 ==

 Apache Mahout 0.2 has been released and is now available for public
 download. Apache Mahout is a subproject of Apache Lucene with the goal
 of delivering scalable machine learning algorithm implementations
 under the Apache license.
 link
 Mahout is a machine learning library meant to scale to the size of
 data we manage today. Built on top of the powerful map/reduce paradigm
 of Apache Hadoop project, Mahout lets you run popular machine learning
 methods like clustering, collaborative filtering, classification over
 Terabytes of data over thousands of computers.

 The complete changelist can be found here:
 http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12313278

 New Mahout 0.2 features include

 - Major performance enhancements in Collaborative Filtering,
 Classification and Clustering
 - New: Latent Dirichlet Allocation(LDA) implementation for topic modelling
 - New: Frequent Itemset Mining for mining top-k patterns from a list
 of transactions
 - New: Random Forests implementation for Decision Tree classification
 (In Memory  Partial Data)
 - New: HBase storage support for Naive Bayes model building and classification
 - New: Generation of vectors from Text documents for use with Mahout 
 Algorithms
 - Performance improvements in various Vector implementations
 - Tons of bug fixes and code cleanup



 On Thu, Nov 12, 2009 at 9:06 AM, Grant Ingersoll gsing...@apache.org wrote:

 Anyone care to writeup a release announcement?  Here's Solr's: 
 http://lucene.grantingersoll.com/2009/11/10/apache-solr-1-4-0-offically-released/

 I've cleaned up the build quite a bit and am now testing preparing the 
 artifacts w/ the much simpler build (no more installing third party libs, 
 they are all up under o.a.mahout in the Maven repo).  I'd like to have 
 everything ready to go once the artifacts are put up for a vote.

 Thanks,
 Grant



Re: 0.2 status

2009-11-12 Thread Isabel Drost

Adding and revising a little:

Apache Mahout 0.2 has been released and is now available for public
download at http://www.apache.org/dyn/closer.cgi/lucene/mahout

Up to date maven artifacts can be found in the Apache repository at
https://repository.apache.org/content/repositories/releases/org/apache/mahout/


Apache Mahout is a subproject of Apache Lucene with the goal
of delivering scalable machine learning algorithm implementations
under the Apache license. http://www.apache.org/licenses/LICENSE-2.0

 Mahout is a machine learning library meant to scale to the size of
 data we manage today. Built on top of the powerful map/reduce
 paradigm of Apache Hadoop project, Mahout lets you run popular
 machine learning methods like clustering, collaborative filtering,
 classification over Terabytes of data over thousands of computers.

 - We may want to emphasize that using Mahout makes sense also for
 those people that do not have clusters with thousands of nodes?

Mahout is a machine learning library meant to scale: Scale in terms of
community to support anyone interested in using machine learning. Scale
in terms of business by providing the library under a commercially
friendly, free software license. Scale in terms of computation to the
size of data we manage today.

Built on top of the powerful map/reduce paradigm of the Apache Hadoop
project, Mahout lets you solve popular machine learning problem
settings like clustering, collaborative filtering and classification
over Terabytes of data over thousands of computers.

Implemented with scalability in mind the latest release brings many
performance optimizations so that even in a single node setup the
library performs well.

 - As mentioned earlier by Grant, we do need performance benchmarks at
 least for the the next release to prove that.


The complete changelist can be found here:
http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12313278

New Mahout 0.2 features include
 
- Major performance enhancements in Collaborative Filtering,
Classification and Clustering
- New: Latent Dirichlet Allocation(LDA) implementation for topic
modelling
- New: Frequent Itemset Mining for mining top-k patterns from a list
of transactions
- New: Decision Forests implementation for Decision Tree classification
(In Memory  Partial Data)
- New: HBase storage support for Naive Bayes model building and
classification
- New: Generation of vectors from Text documents for use with Mahout
Algorithms
- Performance improvements in various Vector implementations
- Tons of bug fixes and code cleanup

Getting started: New to Mahout? 

1) Download Mahout at http://www.apache.org/dyn/closer.cgi/lucene/mahout
2) Check out the Quick start:
http://cwiki.apache.org/MAHOUT/quickstart.html 

3) Read the Mahout Wiki: http://cwiki.apache.org/MAHOUT
4) Join the community by subscribing to mahout-u...@lucene.apache.org
5) Give back: http://www.apache.org/foundation/getinvolved.html
6) Consider adding yourself to the power by Wiki page:
http://cwiki.apache.org/MAHOUT/poweredby.html

For more information on Apache Mahout, see
http://lucene.apache.org/mahout


Additional comment: I suppose, I will copy this over to my personal
blog once the release is out. I would like to invite those interested
in or using Mahout to do so as well.




[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-11-12 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776945#action_12776945
 ] 

Jake Mannix commented on MAHOUT-165:


Well, I've always had good luck with Colt, but at least Ted seemed to feel that 
Colt was no longer state of the art, but maybe he can chime in and elaborate. 
 

 Using better primitives hash for sparse vector for performance gains
 

 Key: MAHOUT-165
 URL: https://issues.apache.org/jira/browse/MAHOUT-165
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Affects Versions: 0.2
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Fix For: 0.3

 Attachments: colt.jar, mahout-165-trove.patch, 
 MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch


 In SparseVector, we need primitives hash map for index and values. The 
 present implementation of this hash map is not as efficient as some of the 
 other implementations in non-Apache projects. 
 In an experiment, I found that, for get/set operations, the primitive hash of 
  Colt performance an order of magnitude better than OrderedIntDoubleMapping. 
 For iteration it is 2x slower, though. 
 Using Colt in Sparsevector improved performance of canopy generation. For an 
 experimental dataset, the current implementation takes 50 minutes. Using 
 Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the 
 delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776939#action_12776939
 ] 

Ankur commented on MAHOUT-103:
--

Re-post an updated patch 

Sure I'll have the updated code coming by early next week.

If it's basically sound I'd like to mention it 

+10, The more people know about it the better chances it has of being used :-)  

I use the GroupLens, Jester, Netflix data sets regularly. Indeed, just drop 
the rating ...

Simply dropping the rating might introduce too much noise. I was thinking of 
keeoing only those that have ratings  2.5 (or 2 to be more liberal). 

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776951#action_12776951
 ] 

Sean Owen commented on MAHOUT-103:
--

That last point is interesting. Another school of thought is that rating 
something, even negatively, suggests you have a closer association to that 
thing than to the millions of other things you've never heard of.

Let's say you rate Bach a 5 and Brahms a 4 and Mendelssohn a 1.5. Would you 
rather recommend a Mendelssohn recording to this person, or death metal?

This is my understanding of the intuition I've gotten from Ted, and seems to 
bear out somewhat in practice, that ratings have a lot less info than one would 
think.

Well it's obviously something one can evaluate within the framework with the 
evaluator code to decide for sure.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: 0.2 status

2009-11-12 Thread Sean Owen
It all sounds fine to me.

On Thu, Nov 12, 2009 at 9:54 AM, Isabel Drost isa...@apache.org wrote:

 Adding and revising a little:



Re: Dependencies outside Maven central (Was: Oh joy)

2009-11-12 Thread Grant Ingersoll
That's weird, that is the default Maven repository, I wouldn't think  
you would need to add it.


On Nov 12, 2009, at 5:30 AM, Isabel Drost wrote:


On Wed, 11 Nov 2009 18:23:50 -0500
Grant Ingersoll gsing...@apache.org wrote:


https://issues.apache.org/jira/browse/MAHOUT-198 tracks this.  I am
committing now.  Please check out, delete your ~/.m2/repository/org
directory and try mvn clean install!


Trashed my local maven repo, checked out an built again - it did not
find the lucene-2.9.1 release. Adding the following repository to the
pom fixed that problem for me:

Index: maven/pom.xml
===
--- maven/pom.xml   (revision 835320)
+++ maven/pom.xml   (working copy)
@@ -76,6 +76,12 @@
  layoutdefault/layout
/repository
repository
+  idmaven2-repository.maven.org/id
+  nameMaven.org Repository for Maven/name
+  urlhttp://repo1.maven.org/maven2/url
+  layoutdefault/layout
+/repository
+repository
  idApache snapshots/id
  urlhttp://people.apache.org/maven-snapshot-repository/url
  snapshots

After that the build runs smoothly for me. Big Thanks to you Grant,  
for

resolving all those issues.

Isabel


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.

2009-11-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776962#action_12776962
 ] 

Sean Owen commented on MAHOUT-198:
--

The only glitch I see is that the Java Mail 1.4 pom is invalid. It's a doc like 
this:

!DOCTYPE HTML PUBLIC -//IETF//DTD HTML 2.0//EN
htmlhead
title301 Moved Permanently/title
/headbody
h1Moved Permanently/h1
pThe document has moved a 
href=http://download.java.net/maven/1/javax.mail/poms/mail-1.4.pom;here/a./p
hr
addressApache Server at maven-repository.dev.java.net Port 443/address
/body/html

Obviously that's not within our control. I tried manually copying the updated 
pom file, though I still have problems with then other dependencies.

Anyone seeing this? I wonder if there is a way to see why Java Mail is included 
as a dependency? we shouldn't have anything to do with it directly.

 Cleanup pom, remove lib dependencies, etc.
 --

 Key: MAHOUT-198
 URL: https://issues.apache.org/jira/browse/MAHOUT-198
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-198.core-lib.patch


 This patch cleans up the poms to not do install.  It removes the core/lib 
 directory.  I have published the necessary artifacts to our Mahout Maven repo 
 already, so they should be publicly available.
 See http://cwiki.apache.org/confluence/display/MAHOUT/ThirdPartyDependencies 
 and 
 http://www.lucidimagination.com/search/document/6f026182150f0f50/dependencies_outside_maven_central_was_oh_joy

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776966#action_12776966
 ] 

Ankur commented on MAHOUT-103:
--

In that case dropping ratings might not be such a good idea and may lead to bad 
results. Consider the following movies that a user might have seen with the 
scores

Matrix - 4.5
Matrix Reloaded - 2.5
Matrix Revolutions - 2

Assuming that a lot of people have watched these movies and didn't like the 
subsequent two versions, they still will get high similarity scores w.r.t 
Matrix going purely by co-occurrence. IMHO, that leaves us with the following 
2 alternatives :-

1. Add the ratings when counting co-occurrence and hope that better ones will 
stand out even if they co-occur less.
2. Apply a Re-scorer that re-ranks the the similar items for a given item 
based on their average scores.

Point 1 is something I am thinking of trying out.   
 

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776986#action_12776986
 ] 

Sean Owen commented on MAHOUT-103:
--

What's the problem in this example? Two people that have both seen all three 
Matrix films are probably similar. All the more so if they've rated the first 
one highly and the other two poorly. You'd correctly identify them as similar 
with or without ratings here.

The issue, I suppose, comes up when you encounter someone who didn't like the 
first one and liked the other two (strange, I know). Without pref values, we'd 
draw the same conclusion -- they have some similarity. With pref values, most 
metrics would say they are very dissimilar.

I actually think that's the wrong conclusion! The fact that two people bothered 
to watch all three says much more about their similarities than the variance in 
ratings says about their differences. I'd still guess they're sorta-similar, 
and metrics without pref values would tend to draw the more correct conclusion.


Of course there's no one right answer, and we can easily construct situations 
where throwing out pref values indeed hurts the result. I'm only asserting that 
it's entirely possible, in real data sets, for ratings to *hurt* on the whole. 


Let's start by adding the basic approach and then keep going to look at 
variations. I at least have some global knowledge of how the framework is set 
up and could help design in these variations in a way that's consistent with 
the framework.



 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.

2009-11-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776993#action_12776993
 ] 

Sean Owen commented on MAHOUT-198:
--

OK, after another wipe of my .m2 directory, this went away. Then it complained 
about many missing artifacts, some quite basic-looking ones.

I also added that repository stanza -- think we should check that in, cool?

And ran back into the mail .pom issue.

After another wipe of .m2, I was back to just missing the Mahout 0.3 SNAPSHOT 
artifact. OK, that's normal right? So I have to mvn install rather than mvn 
compile still. That works.

 Cleanup pom, remove lib dependencies, etc.
 --

 Key: MAHOUT-198
 URL: https://issues.apache.org/jira/browse/MAHOUT-198
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-198.core-lib.patch


 This patch cleans up the poms to not do install.  It removes the core/lib 
 directory.  I have published the necessary artifacts to our Mahout Maven repo 
 already, so they should be publicly available.
 See http://cwiki.apache.org/confluence/display/MAHOUT/ThirdPartyDependencies 
 and 
 http://www.lucidimagination.com/search/document/6f026182150f0f50/dependencies_outside_maven_central_was_oh_joy

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.

2009-11-12 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777003#action_12777003
 ] 

Grant Ingersoll commented on MAHOUT-198:


Yep, still require mvn install

As for Java Mail, do we know which 3rd party lib has the dependency on Java 
Mail?  We may need to put in an exclusion.

 Cleanup pom, remove lib dependencies, etc.
 --

 Key: MAHOUT-198
 URL: https://issues.apache.org/jira/browse/MAHOUT-198
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-198.core-lib.patch


 This patch cleans up the poms to not do install.  It removes the core/lib 
 directory.  I have published the necessary artifacts to our Mahout Maven repo 
 already, so they should be publicly available.
 See http://cwiki.apache.org/confluence/display/MAHOUT/ThirdPartyDependencies 
 and 
 http://www.lucidimagination.com/search/document/6f026182150f0f50/dependencies_outside_maven_central_was_oh_joy

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.

2009-11-12 Thread Drew Farris (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777004#action_12777004
 ] 

Drew Farris commented on MAHOUT-198:


From the output of 'mvn dependency:tree' is appears that the culprit is log4j 
1.2.15

 Cleanup pom, remove lib dependencies, etc.
 --

 Key: MAHOUT-198
 URL: https://issues.apache.org/jira/browse/MAHOUT-198
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-198.core-lib.patch


 This patch cleans up the poms to not do install.  It removes the core/lib 
 directory.  I have published the necessary artifacts to our Mahout Maven repo 
 already, so they should be publicly available.
 See http://cwiki.apache.org/confluence/display/MAHOUT/ThirdPartyDependencies 
 and 
 http://www.lucidimagination.com/search/document/6f026182150f0f50/dependencies_outside_maven_central_was_oh_joy

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.

2009-11-12 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777005#action_12777005
 ] 

Grant Ingersoll commented on MAHOUT-198:


OK, I have a fix for the mail thing.  Checking in shortly.

 Cleanup pom, remove lib dependencies, etc.
 --

 Key: MAHOUT-198
 URL: https://issues.apache.org/jira/browse/MAHOUT-198
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-198.core-lib.patch


 This patch cleans up the poms to not do install.  It removes the core/lib 
 directory.  I have published the necessary artifacts to our Mahout Maven repo 
 already, so they should be publicly available.
 See http://cwiki.apache.org/confluence/display/MAHOUT/ThirdPartyDependencies 
 and 
 http://www.lucidimagination.com/search/document/6f026182150f0f50/dependencies_outside_maven_central_was_oh_joy

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.

2009-11-12 Thread Drew Farris (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777023#action_12777023
 ] 

Drew Farris commented on MAHOUT-198:


works fine for me.

 Cleanup pom, remove lib dependencies, etc.
 --

 Key: MAHOUT-198
 URL: https://issues.apache.org/jira/browse/MAHOUT-198
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-198.core-lib.patch


 This patch cleans up the poms to not do install.  It removes the core/lib 
 directory.  I have published the necessary artifacts to our Mahout Maven repo 
 already, so they should be publicly available.
 See http://cwiki.apache.org/confluence/display/MAHOUT/ThirdPartyDependencies 
 and 
 http://www.lucidimagination.com/search/document/6f026182150f0f50/dependencies_outside_maven_central_was_oh_joy

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-11-12 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777050#action_12777050
 ] 

Ted Dunning commented on MAHOUT-165:



My issues (which I used for quite some time) were probably either remediable or 
irrelevant.

The remediable problem was that the API was opaque for new-comers and very 
difficult to extend with new matrix implementations.  If we take Colt as a 
starting point and fix some of the extension and opacity issues, then this 
problem goes away.

My second issue is that more modern libraries like MTJ can achieve about 4x the 
raw performance of Colt.  As Grant rightly points out, that probably doesn't 
matter to us right away since the goal here is scaling rather than raw hot-iron 
performance on a single box.  Moreover, as Grant also points out, we will have 
a pluggable interface which should allow us to switch if the commons math guys 
ever come around.



 Using better primitives hash for sparse vector for performance gains
 

 Key: MAHOUT-165
 URL: https://issues.apache.org/jira/browse/MAHOUT-165
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Affects Versions: 0.2
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Fix For: 0.3

 Attachments: colt.jar, mahout-165-trove.patch, 
 MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch


 In SparseVector, we need primitives hash map for index and values. The 
 present implementation of this hash map is not as efficient as some of the 
 other implementations in non-Apache projects. 
 In an experiment, I found that, for get/set operations, the primitive hash of 
  Colt performance an order of magnitude better than OrderedIntDoubleMapping. 
 For iteration it is 2x slower, though. 
 Using Colt in Sparsevector improved performance of canopy generation. For an 
 experimental dataset, the current implementation takes 50 minutes. Using 
 Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the 
 delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-11-12 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777082#action_12777082
 ] 

Jake Mannix commented on MAHOUT-165:


Ok then, let's try out Colt, unless we have a more permissive policy in here 
about MTJ than the c-math guys have: they didn't want MTJ because using it 
required either including a jar file of the output of f2j translations of some 
Fortran code... which is ok for us as long as it's apache-compatible, since we 
don't have the hard no external dependencies requirement that they have.  

What Shashi wrote before was this, when he attached the modified colt jar:

bq. Jar for Colt after removing the LGPL code of hep.aida and the the dependent 
classes. The classes in colt.matrix.* are removed as they require hep.aida.

I actually stripped the hep.aida.* dependencies out of even the colt.matrix.* 
classes in Colt on my local gitrepo, which keeps pretty much all of the 
functionality intact.  I can make an updated patch which has the full source 
code for that, so that we can include it instead of just having a jar.

Do we want to try comparing both MTJ and Colt?

Also: do we think our linear API is complete enough to solidify on as a 
wrapper for whatever is plugged in underneath?  Some of the changes which have 
been discussed in other tickets and on the list are

* pulling Writable off of the interface, so that not every impl is hooked into 
such a coupling to Hadoop, then wrapping it with a Writable wrapper / subclass 
to add that functionality
* the double aggregate(BinaryDoubleFunction aggregator, UnaryFunction map) and 
double aggregate(Vector other, BinaryDoubleFunction aggregator, 
BinaryDoubleFunction map) methods for abstracting away inner products and 
norms.  Not necessary, but very easily implemented in AbstractVector so that 
nobody needs to worry about these methods if they don't like programming that 
way.

 Using better primitives hash for sparse vector for performance gains
 

 Key: MAHOUT-165
 URL: https://issues.apache.org/jira/browse/MAHOUT-165
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Affects Versions: 0.2
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Fix For: 0.3

 Attachments: colt.jar, mahout-165-trove.patch, 
 MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch


 In SparseVector, we need primitives hash map for index and values. The 
 present implementation of this hash map is not as efficient as some of the 
 other implementations in non-Apache projects. 
 In an experiment, I found that, for get/set operations, the primitive hash of 
  Colt performance an order of magnitude better than OrderedIntDoubleMapping. 
 For iteration it is 2x slower, though. 
 Using Colt in Sparsevector improved performance of canopy generation. For an 
 experimental dataset, the current implementation takes 50 minutes. Using 
 Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the 
 delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-11-12 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777089#action_12777089
 ] 

Ted Dunning commented on MAHOUT-165:



bq. pulling Writable off of the interface, so that not every impl is hooked 
into such a coupling to Hadoop, then wrapping it with a Writable wrapper / 
subclass to add that functionality

+1

Same thing should be done with row and column labels.

Not sure how to handle matrices of indefinite dimension which are probably 
important for some of what we do.  Perhaps just declare them as very, very 
large in a wrapper.

bq. the double aggregate(BinaryDoubleFunction aggregator, UnaryFunction map) 
and double aggregate(Vector other, BinaryDoubleFunction aggregator, 
BinaryDoubleFunction map) methods for abstracting away inner products and 
norms.  Not necessary, but very easily implemented in AbstractVector so that 
nobody needs to worry about these methods if they don't like programming that 
way.

These are very handy function.  Row and/or column aggregator functions are also 
important.

Colt gets a big boost in speed by testing in the implementation for special 
combinations of these functional constructs.  That lets it implement dot and 
sum with bespoke code and avoid the function call overhead (with associated 
risk of the JVM not in-lining enough).

Another big change is that Colt makes extensive use of view semantics.  I think 
that this is a really good idea, but it does differ a bit from what we have 
done so far.




 Using better primitives hash for sparse vector for performance gains
 

 Key: MAHOUT-165
 URL: https://issues.apache.org/jira/browse/MAHOUT-165
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Affects Versions: 0.2
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
 Fix For: 0.3

 Attachments: colt.jar, mahout-165-trove.patch, 
 MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch


 In SparseVector, we need primitives hash map for index and values. The 
 present implementation of this hash map is not as efficient as some of the 
 other implementations in non-Apache projects. 
 In an experiment, I found that, for get/set operations, the primitive hash of 
  Colt performance an order of magnitude better than OrderedIntDoubleMapping. 
 For iteration it is 2x slower, though. 
 Using Colt in Sparsevector improved performance of canopy generation. For an 
 experimental dataset, the current implementation takes 50 minutes. Using 
 Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the 
 delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: 0.2 status

2009-11-12 Thread Grant Ingersoll
OK, I think the java mail thing is resolved.  Let me try building the artifacts 
again.


On Nov 12, 2009, at 6:22 AM, Sean Owen wrote:

 It all sounds fine to me.
 
 On Thu, Nov 12, 2009 at 9:54 AM, Isabel Drost isa...@apache.org wrote:
 
 Adding and revising a little:
 



[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.

2009-11-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777173#action_12777173
 ] 

Sean Owen commented on MAHOUT-198:
--

I ran into one more unit test failure. It's due to directly comparing two 
double values and for some reason it fails only on the command line on my 
computer. I added an 'epsilon' param to the unit test and it's fine now. We 
should prolly make that common practice wherever the test compares doubles, 
like I've done in my own unit tests, but it's pretty small.

Shall I commit that change to name the standard repo with a repository tag? I 
needed that too.

I'm still getting some weird build errors but guessing they are my own 
environment's fault.

 Cleanup pom, remove lib dependencies, etc.
 --

 Key: MAHOUT-198
 URL: https://issues.apache.org/jira/browse/MAHOUT-198
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-198.core-lib.patch


 This patch cleans up the poms to not do install.  It removes the core/lib 
 directory.  I have published the necessary artifacts to our Mahout Maven repo 
 already, so they should be publicly available.
 See http://cwiki.apache.org/confluence/display/MAHOUT/ThirdPartyDependencies 
 and 
 http://www.lucidimagination.com/search/document/6f026182150f0f50/dependencies_outside_maven_central_was_oh_joy

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[VOTE] Release 0.2

2009-11-12 Thread Grant Ingersoll
Please vote on releasing the artifacts at:
https://repository.apache.org/content/repositories/orgapachemahout-002/org/apache/mahout/

KEYS file is in the Mahout root trunk.

Things to do before voting:

1. Download and verify signatures on all the artifacts.
2. Try out the tests, examples, etc.
3. Try it out in any apps that you have.
4. See the Apache pages on releases and see what else I'm missing.
5. Others?

Re: [VOTE] Release 0.2

2009-11-12 Thread Grant Ingersoll
Hmm, I'm on a Mac and running on the command line.  What version of OS X and 
what JVM?

On Nov 12, 2009, at 5:49 PM, Sean Owen wrote:

 I still see that test failure I mentioned, but only happens on the
 command line (and perhaps only on a Mac). It is to do with a double
 value being compared for exact equality. I fixed it but it's hardly
 a blocker. Otherwise +1
 
 On Thu, Nov 12, 2009 at 9:58 PM, Grant Ingersoll gsing...@apache.org wrote:
 Please vote on releasing the artifacts at:
 https://repository.apache.org/content/repositories/orgapachemahout-002/org/apache/mahout/
 
 KEYS file is in the Mahout root trunk.
 
 Things to do before voting:
 
 1. Download and verify signatures on all the artifacts.
 2. Try out the tests, examples, etc.
 3. Try it out in any apps that you have.
 4. See the Apache pages on releases and see what else I'm missing.
 5. Others?




Re: [VOTE] Release 0.2

2009-11-12 Thread Ted Dunning
 And which mvn?

On Thu, Nov 12, 2009 at 5:07 PM, Grant Ingersoll gsing...@apache.orgwrote:

 Hmm, I'm on a Mac and running on the command line.  What version of OS X
 and what JVM?

 On Nov 12, 2009, at 5:49 PM, Sean Owen wrote:

  I still see that test failure I mentioned, but only happens on the
  command line (and perhaps only on a Mac). It is to do with a double
  value being compared for exact equality. I fixed it but it's hardly
  a blocker. Otherwise +1
 
  On Thu, Nov 12, 2009 at 9:58 PM, Grant Ingersoll gsing...@apache.org
 wrote:
  Please vote on releasing the artifacts at:
 
 https://repository.apache.org/content/repositories/orgapachemahout-002/org/apache/mahout/
 
  KEYS file is in the Mahout root trunk.
 
  Things to do before voting:
 
  1. Download and verify signatures on all the artifacts.
  2. Try out the tests, examples, etc.
  3. Try it out in any apps that you have.
  4. See the Apache pages on releases and see what else I'm missing.
  5. Others?





-- 
Ted Dunning, CTO
DeepDyve


[jira] Updated: (MAHOUT-199) Parent POM missing in public maven repository

2009-11-12 Thread Matthias Friedrich (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Friedrich updated MAHOUT-199:
--

Description: 
I wanted to play with Taste and thus created a Maven project that referenced 
the mahout-core artifact. But since mahout-parent isn't deployed, I couldn't 
build my project. I had to download the 0.1 release and install the parent POM 
in my local repository (cd mahout-0.1/maven  mvn install).

Steps to reproduce:

$ mvn archetype:create -DgroupId=de.mafr.demo -DartifactId=MahoutDemo
$ cd MahoutDemo
$ vi MahoutDemo
add dependencies listed below
$ mvn package

The dependency section I added:

dependency
  groupIdorg.apache.mahout/groupId
  artifactIdmahout-core/artifactId
  version0.1/version
/dependency

Could you please deploy the parent POM? It would make it a lot easier to play 
with Mahout/Taste. Thanks in advance for your help!

  was:
I wanted to play with Taste and thus created a Maven project that referenced 
the mahout-core artifact. But since mahout-parent isn't deployed, I couldn't 
build my project. I had to download the 0.1 release and install the parent POM 
in my local repository (cd mahout-0.1/maven  mvn install).

Steps to reproduce:

$ mvn archetype:create -DgroupId=de.mafr.demo -DartifactId=MahoutDemo
$ cd MahoutDemo
$ vi MahoutDemo
# add dependencies listed below
$ mvn package

The dependency section I added:

dependency
  groupIdorg.apache.mahout/groupId
  artifactIdmahout-core/artifactId
  version0.1/version
/dependency

Could you please deploy the parent POM? It would make it a lot easier to play 
with Mahout/Taste. Thanks in advance for your help!


 Parent POM missing in public maven repository
 -

 Key: MAHOUT-199
 URL: https://issues.apache.org/jira/browse/MAHOUT-199
 Project: Mahout
  Issue Type: Wish
Affects Versions: 0.1
 Environment: Maven 2.0.9
Reporter: Matthias Friedrich
Priority: Minor

 I wanted to play with Taste and thus created a Maven project that referenced 
 the mahout-core artifact. But since mahout-parent isn't deployed, I couldn't 
 build my project. I had to download the 0.1 release and install the parent 
 POM in my local repository (cd mahout-0.1/maven  mvn install).
 Steps to reproduce:
 $ mvn archetype:create -DgroupId=de.mafr.demo -DartifactId=MahoutDemo
 $ cd MahoutDemo
 $ vi MahoutDemo
 add dependencies listed below
 $ mvn package
 The dependency section I added:
 dependency
   groupIdorg.apache.mahout/groupId
   artifactIdmahout-core/artifactId
   version0.1/version
 /dependency
 Could you please deploy the parent POM? It would make it a lot easier to play 
 with Mahout/Taste. Thanks in advance for your help!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.