date:20091112

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776921#action_12776921
 ] 

Sean Owen commented on MAHOUT-103:
--

Re-post an updated patch and happy to give my comments on it. The more the 
merrier. If it's basically sound I'd like to mention it in the forthcoming book 
which I'm writing now.

I use the GroupLens, Jester, Netflix data sets regularly. Indeed, just drop the 
rating. The framework can do this automatically too if you like in the 
DataModel.

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-11-12 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776924#action_12776924
]

Sean Owen commented on MAHOUT-165:
--

IntDoubleHash right? We could look at that, but I thought the status here was
that Colt worked just fine and fast. Perhaps I miss something but I don't see a
remaining issue with using (part of) Colt.

I somehow strongly suspect we will benefit from not reinventing a wheel here,
and whatever we need can be done with Colt, plus perhaps some contributed
changes, plus a custom implementation here and there.

+1 for Whatever Is Needed To Use Colt?

Using better primitives hash for sparse vector for performance gains

Key: MAHOUT-165
URL: https://issues.apache.org/jira/browse/MAHOUT-165
Project: Mahout
Issue Type: Improvement
Components: Matrix
Affects Versions: 0.2
Reporter: Shashikant Kore
Assignee: Grant Ingersoll
Fix For: 0.3

Attachments: colt.jar, mahout-165-trove.patch,
MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch

In SparseVector, we need primitives hash map for index and values. The
present implementation of this hash map is not as efficient as some of the
other implementations in non-Apache projects.
In an experiment, I found that, for get/set operations, the primitive hash of
Colt performance an order of magnitude better than OrderedIntDoubleMapping.
For iteration it is 2x slower, though.
Using Colt in Sparsevector improved performance of canopy generation. For an
experimental dataset, the current implementation takes 50 minutes. Using
Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the
delay.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: 0.2 status

2009-11-12 Thread deneche abdelhakim

please use Decision Forests instead of Random Forests



On Thu, Nov 12, 2009 at 9:01 AM, Robin Anil robin.a...@gmail.com wrote:
 Please edit/add stuff.

 Robin


 ==

 Apache Mahout 0.2 has been released and is now available for public
 download. Apache Mahout is a subproject of Apache Lucene with the goal
 of delivering scalable machine learning algorithm implementations
 under the Apache license.
 link
 Mahout is a machine learning library meant to scale to the size of
 data we manage today. Built on top of the powerful map/reduce paradigm
 of Apache Hadoop project, Mahout lets you run popular machine learning
 methods like clustering, collaborative filtering, classification over
 Terabytes of data over thousands of computers.

 The complete changelist can be found here:
 http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12313278

 New Mahout 0.2 features include

 - Major performance enhancements in Collaborative Filtering,
 Classification and Clustering
 - New: Latent Dirichlet Allocation(LDA) implementation for topic modelling
 - New: Frequent Itemset Mining for mining top-k patterns from a list
 of transactions
 - New: Random Forests implementation for Decision Tree classification
 (In Memory  Partial Data)
 - New: HBase storage support for Naive Bayes model building and classification
 - New: Generation of vectors from Text documents for use with Mahout 
 Algorithms
 - Performance improvements in various Vector implementations
 - Tons of bug fixes and code cleanup



 On Thu, Nov 12, 2009 at 9:06 AM, Grant Ingersoll gsing...@apache.org wrote:

 Anyone care to writeup a release announcement?  Here's Solr's: 
 http://lucene.grantingersoll.com/2009/11/10/apache-solr-1-4-0-offically-released/

 I've cleaned up the build quite a bit and am now testing preparing the 
 artifacts w/ the much simpler build (no more installing third party libs, 
 they are all up under o.a.mahout in the Maven repo).  I'd like to have 
 everything ready to go once the artifacts are put up for a vote.

 Thanks,
 Grant

Re: 0.2 status

2009-11-12 Thread Isabel Drost


Adding and revising a little:

Apache Mahout 0.2 has been released and is now available for public
download at http://www.apache.org/dyn/closer.cgi/lucene/mahout

Up to date maven artifacts can be found in the Apache repository at
https://repository.apache.org/content/repositories/releases/org/apache/mahout/


Apache Mahout is a subproject of Apache Lucene with the goal
of delivering scalable machine learning algorithm implementations
under the Apache license. http://www.apache.org/licenses/LICENSE-2.0

 Mahout is a machine learning library meant to scale to the size of
 data we manage today. Built on top of the powerful map/reduce
 paradigm of Apache Hadoop project, Mahout lets you run popular
 machine learning methods like clustering, collaborative filtering,
 classification over Terabytes of data over thousands of computers.

 - We may want to emphasize that using Mahout makes sense also for
 those people that do not have clusters with thousands of nodes?

Mahout is a machine learning library meant to scale: Scale in terms of
community to support anyone interested in using machine learning. Scale
in terms of business by providing the library under a commercially
friendly, free software license. Scale in terms of computation to the
size of data we manage today.

Built on top of the powerful map/reduce paradigm of the Apache Hadoop
project, Mahout lets you solve popular machine learning problem
settings like clustering, collaborative filtering and classification
over Terabytes of data over thousands of computers.

Implemented with scalability in mind the latest release brings many
performance optimizations so that even in a single node setup the
library performs well.

 - As mentioned earlier by Grant, we do need performance benchmarks at
 least for the the next release to prove that.


The complete changelist can be found here:
http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12313278

New Mahout 0.2 features include
 
- Major performance enhancements in Collaborative Filtering,
Classification and Clustering
- New: Latent Dirichlet Allocation(LDA) implementation for topic
modelling
- New: Frequent Itemset Mining for mining top-k patterns from a list
of transactions
- New: Decision Forests implementation for Decision Tree classification
(In Memory  Partial Data)
- New: HBase storage support for Naive Bayes model building and
classification
- New: Generation of vectors from Text documents for use with Mahout
Algorithms
- Performance improvements in various Vector implementations
- Tons of bug fixes and code cleanup

Getting started: New to Mahout? 

1) Download Mahout at http://www.apache.org/dyn/closer.cgi/lucene/mahout
2) Check out the Quick start:
http://cwiki.apache.org/MAHOUT/quickstart.html 

3) Read the Mahout Wiki: http://cwiki.apache.org/MAHOUT
4) Join the community by subscribing to mahout-u...@lucene.apache.org
5) Give back: http://www.apache.org/foundation/getinvolved.html
6) Consider adding yourself to the power by Wiki page:
http://cwiki.apache.org/MAHOUT/poweredby.html

For more information on Apache Mahout, see
http://lucene.apache.org/mahout


Additional comment: I suppose, I will copy this over to my personal
blog once the release is out. I would like to invite those interested
in or using Mahout to do so as well.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-11-12 Thread Jake Mannix (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776945#action_12776945
]

Jake Mannix commented on MAHOUT-165:

Well, I've always had good luck with Colt, but at least Ted seemed to feel that
Colt was no longer state of the art, but maybe he can chime in and elaborate.

Using better primitives hash for sparse vector for performance gains

Attachments: colt.jar, mahout-165-trove.patch,
MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Ankur (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776939#action_12776939
 ] 

Ankur commented on MAHOUT-103:
--

Re-post an updated patch 

Sure I'll have the updated code coming by early next week.

If it's basically sound I'd like to mention it 

+10, The more people know about it the better chances it has of being used :-)  

I use the GroupLens, Jester, Netflix data sets regularly. Indeed, just drop 
the rating ...

Simply dropping the rating might introduce too much noise. I was thinking of 
keeoing only those that have ratings  2.5 (or 2 to be more liberal). 

 Co-occurence based nearest neighbourhood
 

 Key: MAHOUT-103
 URL: https://issues.apache.org/jira/browse/MAHOUT-103
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
 Attachments: jira-103.patch


 Nearest neighborhood type queries for users/items can be answered efficiently 
 and effectively by analyzing the co-occurrence model of a user/item w.r.t 
 another. This patch aims at providing an implementation for answering such 
 queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776951#action_12776951
]

Sean Owen commented on MAHOUT-103:
--

That last point is interesting. Another school of thought is that rating
something, even negatively, suggests you have a closer association to that
thing than to the millions of other things you've never heard of.

Let's say you rate Bach a 5 and Brahms a 4 and Mendelssohn a 1.5. Would you
rather recommend a Mendelssohn recording to this person, or death metal?

This is my understanding of the intuition I've gotten from Ted, and seems to
bear out somewhat in practice, that ratings have a lot less info than one would
think.

Well it's obviously something one can evaluate within the framework with the
evaluator code to decide for sure.

Co-occurence based nearest neighbourhood

Key: MAHOUT-103
URL: https://issues.apache.org/jira/browse/MAHOUT-103
Project: Mahout
Issue Type: New Feature
Components: Collaborative Filtering
Reporter: Ankur
Assignee: Ankur
Attachments: jira-103.patch

Nearest neighborhood type queries for users/items can be answered efficiently
and effectively by analyzing the co-occurrence model of a user/item w.r.t
another. This patch aims at providing an implementation for answering such
queries based upon simple co-occurrence counts.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: 0.2 status

2009-11-12 Thread Sean Owen

It all sounds fine to me.

On Thu, Nov 12, 2009 at 9:54 AM, Isabel Drost isa...@apache.org wrote:

 Adding and revising a little:

Re: Dependencies outside Maven central (Was: Oh joy)

2009-11-12 Thread Grant Ingersoll

That's weird, that is the default Maven repository, I wouldn't think  
you would need to add it.


On Nov 12, 2009, at 5:30 AM, Isabel Drost wrote:


On Wed, 11 Nov 2009 18:23:50 -0500
Grant Ingersoll gsing...@apache.org wrote:


https://issues.apache.org/jira/browse/MAHOUT-198 tracks this.  I am
committing now.  Please check out, delete your ~/.m2/repository/org
directory and try mvn clean install!


Trashed my local maven repo, checked out an built again - it did not
find the lucene-2.9.1 release. Adding the following repository to the
pom fixed that problem for me:

Index: maven/pom.xml
===
--- maven/pom.xml   (revision 835320)
+++ maven/pom.xml   (working copy)
@@ -76,6 +76,12 @@
  layoutdefault/layout
/repository
repository
+  idmaven2-repository.maven.org/id
+  nameMaven.org Repository for Maven/name
+  urlhttp://repo1.maven.org/maven2/url
+  layoutdefault/layout
+/repository
+repository
  idApache snapshots/id
  urlhttp://people.apache.org/maven-snapshot-repository/url
  snapshots

After that the build runs smoothly for me. Big Thanks to you Grant,  
for

resolving all those issues.

Isabel


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search

[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.

2009-11-12 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776962#action_12776962
]

Sean Owen commented on MAHOUT-198:
--

The only glitch I see is that the Java Mail 1.4 pom is invalid. It's a doc like
this:

!DOCTYPE HTML PUBLIC -//IETF//DTD HTML 2.0//EN
htmlhead
title301 Moved Permanently/title
/headbody
h1Moved Permanently/h1
pThe document has moved a
href=http://download.java.net/maven/1/javax.mail/poms/mail-1.4.pom;here/a./p
hr
addressApache Server at maven-repository.dev.java.net Port 443/address
/body/html

Obviously that's not within our control. I tried manually copying the updated
pom file, though I still have problems with then other dependencies.

Anyone seeing this? I wonder if there is a way to see why Java Mail is included
as a dependency? we shouldn't have anything to do with it directly.

Cleanup pom, remove lib dependencies, etc.
--

Key: MAHOUT-198
URL: https://issues.apache.org/jira/browse/MAHOUT-198
Project: Mahout
Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Fix For: 0.2

Attachments: mahout-198.core-lib.patch

This patch cleans up the poms to not do install. It removes the core/lib
directory. I have published the necessary artifacts to our Mahout Maven repo
already, so they should be publicly available.
See http://cwiki.apache.org/confluence/display/MAHOUT/ThirdPartyDependencies
and
http://www.lucidimagination.com/search/document/6f026182150f0f50/dependencies_outside_maven_central_was_oh_joy

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776966#action_12776966
]

Ankur commented on MAHOUT-103:
--

In that case dropping ratings might not be such a good idea and may lead to bad
results. Consider the following movies that a user might have seen with the
scores

Matrix - 4.5
Matrix Reloaded - 2.5
Matrix Revolutions - 2

Assuming that a lot of people have watched these movies and didn't like the
subsequent two versions, they still will get high similarity scores w.r.t
Matrix going purely by co-occurrence. IMHO, that leaves us with the following
2 alternatives :-

1. Add the ratings when counting co-occurrence and hope that better ones will
stand out even if they co-occur less.
2. Apply a Re-scorer that re-ranks the the similar items for a given item
based on their average scores.

Point 1 is something I am thinking of trying out.

Co-occurence based nearest neighbourhood

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

2009-11-12 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776986#action_12776986
]

Sean Owen commented on MAHOUT-103:
--

What's the problem in this example? Two people that have both seen all three
Matrix films are probably similar. All the more so if they've rated the first
one highly and the other two poorly. You'd correctly identify them as similar
with or without ratings here.

The issue, I suppose, comes up when you encounter someone who didn't like the
first one and liked the other two (strange, I know). Without pref values, we'd
draw the same conclusion -- they have some similarity. With pref values, most
metrics would say they are very dissimilar.

I actually think that's the wrong conclusion! The fact that two people bothered
to watch all three says much more about their similarities than the variance in
ratings says about their differences. I'd still guess they're sorta-similar,
and metrics without pref values would tend to draw the more correct conclusion.

Of course there's no one right answer, and we can easily construct situations
where throwing out pref values indeed hurts the result. I'm only asserting that
it's entirely possible, in real data sets, for ratings to *hurt* on the whole.

Let's start by adding the basic approach and then keep going to look at
variations. I at least have some global knowledge of how the framework is set
up and could help design in these variations in a way that's consistent with
the framework.

Co-occurence based nearest neighbourhood

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.

2009-11-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776993#action_12776993
 ] 

Sean Owen commented on MAHOUT-198:
--

OK, after another wipe of my .m2 directory, this went away. Then it complained 
about many missing artifacts, some quite basic-looking ones.

I also added that repository stanza -- think we should check that in, cool?

And ran back into the mail .pom issue.

After another wipe of .m2, I was back to just missing the Mahout 0.3 SNAPSHOT 
artifact. OK, that's normal right? So I have to mvn install rather than mvn 
compile still. That works.

 Cleanup pom, remove lib dependencies, etc.
 --

 Key: MAHOUT-198
 URL: https://issues.apache.org/jira/browse/MAHOUT-198
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-198.core-lib.patch


 This patch cleans up the poms to not do install.  It removes the core/lib 
 directory.  I have published the necessary artifacts to our Mahout Maven repo 
 already, so they should be publicly available.
 See http://cwiki.apache.org/confluence/display/MAHOUT/ThirdPartyDependencies 
 and 
 http://www.lucidimagination.com/search/document/6f026182150f0f50/dependencies_outside_maven_central_was_oh_joy

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.

2009-11-12 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777003#action_12777003
 ] 

Grant Ingersoll commented on MAHOUT-198:


Yep, still require mvn install

As for Java Mail, do we know which 3rd party lib has the dependency on Java 
Mail?  We may need to put in an exclusion.

 Cleanup pom, remove lib dependencies, etc.
 --

 Key: MAHOUT-198
 URL: https://issues.apache.org/jira/browse/MAHOUT-198
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-198.core-lib.patch


 This patch cleans up the poms to not do install.  It removes the core/lib 
 directory.  I have published the necessary artifacts to our Mahout Maven repo 
 already, so they should be publicly available.
 See http://cwiki.apache.org/confluence/display/MAHOUT/ThirdPartyDependencies 
 and 
 http://www.lucidimagination.com/search/document/6f026182150f0f50/dependencies_outside_maven_central_was_oh_joy

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.

2009-11-12 Thread Drew Farris (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777004#action_12777004
 ] 

Drew Farris commented on MAHOUT-198:


From the output of 'mvn dependency:tree' is appears that the culprit is log4j 
1.2.15

 Cleanup pom, remove lib dependencies, etc.
 --

 Key: MAHOUT-198
 URL: https://issues.apache.org/jira/browse/MAHOUT-198
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-198.core-lib.patch


 This patch cleans up the poms to not do install.  It removes the core/lib 
 directory.  I have published the necessary artifacts to our Mahout Maven repo 
 already, so they should be publicly available.
 See http://cwiki.apache.org/confluence/display/MAHOUT/ThirdPartyDependencies 
 and 
 http://www.lucidimagination.com/search/document/6f026182150f0f50/dependencies_outside_maven_central_was_oh_joy

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.

2009-11-12 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777005#action_12777005
 ] 

Grant Ingersoll commented on MAHOUT-198:


OK, I have a fix for the mail thing.  Checking in shortly.

 Cleanup pom, remove lib dependencies, etc.
 --

 Key: MAHOUT-198
 URL: https://issues.apache.org/jira/browse/MAHOUT-198
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-198.core-lib.patch


 This patch cleans up the poms to not do install.  It removes the core/lib 
 directory.  I have published the necessary artifacts to our Mahout Maven repo 
 already, so they should be publicly available.
 See http://cwiki.apache.org/confluence/display/MAHOUT/ThirdPartyDependencies 
 and 
 http://www.lucidimagination.com/search/document/6f026182150f0f50/dependencies_outside_maven_central_was_oh_joy

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.

2009-11-12 Thread Drew Farris (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777023#action_12777023
 ] 

Drew Farris commented on MAHOUT-198:


works fine for me.

 Cleanup pom, remove lib dependencies, etc.
 --

 Key: MAHOUT-198
 URL: https://issues.apache.org/jira/browse/MAHOUT-198
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 0.2

 Attachments: mahout-198.core-lib.patch


 This patch cleans up the poms to not do install.  It removes the core/lib 
 directory.  I have published the necessary artifacts to our Mahout Maven repo 
 already, so they should be publicly available.
 See http://cwiki.apache.org/confluence/display/MAHOUT/ThirdPartyDependencies 
 and 
 http://www.lucidimagination.com/search/document/6f026182150f0f50/dependencies_outside_maven_central_was_oh_joy

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-11-12 Thread Ted Dunning (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777050#action_12777050
]

Ted Dunning commented on MAHOUT-165:

My issues (which I used for quite some time) were probably either remediable or
irrelevant.

The remediable problem was that the API was opaque for new-comers and very
difficult to extend with new matrix implementations. If we take Colt as a
starting point and fix some of the extension and opacity issues, then this
problem goes away.

My second issue is that more modern libraries like MTJ can achieve about 4x the
raw performance of Colt. As Grant rightly points out, that probably doesn't
matter to us right away since the goal here is scaling rather than raw hot-iron
performance on a single box. Moreover, as Grant also points out, we will have
a pluggable interface which should allow us to switch if the commons math guys
ever come around.

Using better primitives hash for sparse vector for performance gains

Attachments: colt.jar, mahout-165-trove.patch,
MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-11-12 Thread Jake Mannix (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777082#action_12777082
]

Jake Mannix commented on MAHOUT-165:

Ok then, let's try out Colt, unless we have a more permissive policy in here
about MTJ than the c-math guys have: they didn't want MTJ because using it
required either including a jar file of the output of f2j translations of some
Fortran code... which is ok for us as long as it's apache-compatible, since we
don't have the hard no external dependencies requirement that they have.

What Shashi wrote before was this, when he attached the modified colt jar:

bq. Jar for Colt after removing the LGPL code of hep.aida and the the dependent
classes. The classes in colt.matrix.* are removed as they require hep.aida.

I actually stripped the hep.aida.* dependencies out of even the colt.matrix.*
classes in Colt on my local gitrepo, which keeps pretty much all of the
functionality intact. I can make an updated patch which has the full source
code for that, so that we can include it instead of just having a jar.

Do we want to try comparing both MTJ and Colt?

Also: do we think our linear API is complete enough to solidify on as a
wrapper for whatever is plugged in underneath? Some of the changes which have
been discussed in other tickets and on the list are

* pulling Writable off of the interface, so that not every impl is hooked into
such a coupling to Hadoop, then wrapping it with a Writable wrapper / subclass
to add that functionality
* the double aggregate(BinaryDoubleFunction aggregator, UnaryFunction map) and
double aggregate(Vector other, BinaryDoubleFunction aggregator,
BinaryDoubleFunction map) methods for abstracting away inner products and
norms. Not necessary, but very easily implemented in AbstractVector so that
nobody needs to worry about these methods if they don't like programming that
way.

Using better primitives hash for sparse vector for performance gains

Attachments: colt.jar, mahout-165-trove.patch,
MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-11-12 Thread Ted Dunning (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777089#action_12777089
]

Ted Dunning commented on MAHOUT-165:

bq. pulling Writable off of the interface, so that not every impl is hooked
into such a coupling to Hadoop, then wrapping it with a Writable wrapper /
subclass to add that functionality

Same thing should be done with row and column labels.

Not sure how to handle matrices of indefinite dimension which are probably
important for some of what we do. Perhaps just declare them as very, very
large in a wrapper.

bq. the double aggregate(BinaryDoubleFunction aggregator, UnaryFunction map)
and double aggregate(Vector other, BinaryDoubleFunction aggregator,
BinaryDoubleFunction map) methods for abstracting away inner products and
norms. Not necessary, but very easily implemented in AbstractVector so that
nobody needs to worry about these methods if they don't like programming that
way.

These are very handy function. Row and/or column aggregator functions are also
important.

Colt gets a big boost in speed by testing in the implementation for special
combinations of these functional constructs. That lets it implement dot and
sum with bespoke code and avoid the function call overhead (with associated
risk of the JVM not in-lining enough).

Another big change is that Colt makes extensive use of view semantics. I think
that this is a really good idea, but it does differ a bit from what we have
done so far.

Using better primitives hash for sparse vector for performance gains

Attachments: colt.jar, mahout-165-trove.patch,
MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: 0.2 status

2009-11-12 Thread Grant Ingersoll

OK, I think the java mail thing is resolved.  Let me try building the artifacts 
again.


On Nov 12, 2009, at 6:22 AM, Sean Owen wrote:

 It all sounds fine to me.
 
 On Thu, Nov 12, 2009 at 9:54 AM, Isabel Drost isa...@apache.org wrote:
 
 Adding and revising a little:

[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.

2009-11-12 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777173#action_12777173
]

Sean Owen commented on MAHOUT-198:
--

I ran into one more unit test failure. It's due to directly comparing two
double values and for some reason it fails only on the command line on my
computer. I added an 'epsilon' param to the unit test and it's fine now. We
should prolly make that common practice wherever the test compares doubles,
like I've done in my own unit tests, but it's pretty small.

Shall I commit that change to name the standard repo with a repository tag? I
needed that too.

I'm still getting some weird build errors but guessing they are my own
environment's fault.

Cleanup pom, remove lib dependencies, etc.
--

Key: MAHOUT-198
URL: https://issues.apache.org/jira/browse/MAHOUT-198
Project: Mahout
Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Fix For: 0.2

Attachments: mahout-198.core-lib.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[VOTE] Release 0.2

2009-11-12 Thread Grant Ingersoll

Please vote on releasing the artifacts at:
https://repository.apache.org/content/repositories/orgapachemahout-002/org/apache/mahout/

KEYS file is in the Mahout root trunk.

Things to do before voting:

1. Download and verify signatures on all the artifacts.
2. Try out the tests, examples, etc.
3. Try it out in any apps that you have.
4. See the Apache pages on releases and see what else I'm missing.
5. Others?

Re: [VOTE] Release 0.2

2009-11-12 Thread Grant Ingersoll

Hmm, I'm on a Mac and running on the command line.  What version of OS X and 
what JVM?

On Nov 12, 2009, at 5:49 PM, Sean Owen wrote:

 I still see that test failure I mentioned, but only happens on the
 command line (and perhaps only on a Mac). It is to do with a double
 value being compared for exact equality. I fixed it but it's hardly
 a blocker. Otherwise +1
 
 On Thu, Nov 12, 2009 at 9:58 PM, Grant Ingersoll gsing...@apache.org wrote:
 Please vote on releasing the artifacts at:
 https://repository.apache.org/content/repositories/orgapachemahout-002/org/apache/mahout/
 
 KEYS file is in the Mahout root trunk.
 
 Things to do before voting:
 
 1. Download and verify signatures on all the artifacts.
 2. Try out the tests, examples, etc.
 3. Try it out in any apps that you have.
 4. See the Apache pages on releases and see what else I'm missing.
 5. Others?

Re: [VOTE] Release 0.2

2009-11-12 Thread Ted Dunning

 And which mvn?

On Thu, Nov 12, 2009 at 5:07 PM, Grant Ingersoll gsing...@apache.orgwrote:

 Hmm, I'm on a Mac and running on the command line.  What version of OS X
 and what JVM?

 On Nov 12, 2009, at 5:49 PM, Sean Owen wrote:

  I still see that test failure I mentioned, but only happens on the
  command line (and perhaps only on a Mac). It is to do with a double
  value being compared for exact equality. I fixed it but it's hardly
  a blocker. Otherwise +1
 
  On Thu, Nov 12, 2009 at 9:58 PM, Grant Ingersoll gsing...@apache.org
 wrote:
  Please vote on releasing the artifacts at:
 
 https://repository.apache.org/content/repositories/orgapachemahout-002/org/apache/mahout/
 
  KEYS file is in the Mahout root trunk.
 
  Things to do before voting:
 
  1. Download and verify signatures on all the artifacts.
  2. Try out the tests, examples, etc.
  3. Try it out in any apps that you have.
  4. See the Apache pages on releases and see what else I'm missing.
  5. Others?





-- 
Ted Dunning, CTO
DeepDyve

[jira] Updated: (MAHOUT-199) Parent POM missing in public maven repository

2009-11-12 Thread Matthias Friedrich (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Matthias Friedrich updated MAHOUT-199:
--

Description:
I wanted to play with Taste and thus created a Maven project that referenced
the mahout-core artifact. But since mahout-parent isn't deployed, I couldn't
build my project. I had to download the 0.1 release and install the parent POM
in my local repository (cd mahout-0.1/maven mvn install).

Steps to reproduce:

$ mvn archetype:create -DgroupId=de.mafr.demo -DartifactId=MahoutDemo
$ cd MahoutDemo
$ vi MahoutDemo
add dependencies listed below
$ mvn package

The dependency section I added:

dependency
groupIdorg.apache.mahout/groupId
artifactIdmahout-core/artifactId
version0.1/version
/dependency

Could you please deploy the parent POM? It would make it a lot easier to play
with Mahout/Taste. Thanks in advance for your help!

was:
I wanted to play with Taste and thus created a Maven project that referenced
the mahout-core artifact. But since mahout-parent isn't deployed, I couldn't
build my project. I had to download the 0.1 release and install the parent POM
in my local repository (cd mahout-0.1/maven mvn install).

Steps to reproduce:

$ mvn archetype:create -DgroupId=de.mafr.demo -DartifactId=MahoutDemo
$ cd MahoutDemo
$ vi MahoutDemo
# add dependencies listed below
$ mvn package

The dependency section I added:

dependency
groupIdorg.apache.mahout/groupId
artifactIdmahout-core/artifactId
version0.1/version
/dependency

Could you please deploy the parent POM? It would make it a lot easier to play
with Mahout/Taste. Thanks in advance for your help!

Parent POM missing in public maven repository
-

Key: MAHOUT-199
URL: https://issues.apache.org/jira/browse/MAHOUT-199
Project: Mahout
Issue Type: Wish
Affects Versions: 0.1
Environment: Maven 2.0.9
Reporter: Matthias Friedrich
Priority: Minor

I wanted to play with Taste and thus created a Maven project that referenced
the mahout-core artifact. But since mahout-parent isn't deployed, I couldn't
build my project. I had to download the 0.1 release and install the parent
POM in my local repository (cd mahout-0.1/maven mvn install).
Steps to reproduce:
$ mvn archetype:create -DgroupId=de.mafr.demo -DartifactId=MahoutDemo
$ cd MahoutDemo
$ vi MahoutDemo
add dependencies listed below
$ mvn package
The dependency section I added:
dependency
groupIdorg.apache.mahout/groupId
artifactIdmahout-core/artifactId
version0.1/version
/dependency
Could you please deploy the parent POM? It would make it a lot easier to play
with Mahout/Taste. Thanks in advance for your help!

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Re: 0.2 status

Re: 0.2 status

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

Re: 0.2 status

Re: Dependencies outside Maven central (Was: Oh joy)

[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.

[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.

[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.

[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.

[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Re: 0.2 status

[jira] Commented: (MAHOUT-198) Cleanup pom, remove lib dependencies, etc.

[VOTE] Release 0.2

Re: [VOTE] Release 0.2

Re: [VOTE] Release 0.2

[jira] Updated: (MAHOUT-199) Parent POM missing in public maven repository

26 matches

Site Navigation

Mail list logo

Footer information