[jira] Commented: (MAHOUT-19) Hierarchial clusterer

2009-01-13 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12663319#action_12663319
 ] 

Ankur commented on MAHOUT-19:
-

Hi Karl, Welcome back :-)
Can you share the following few things about this patch?

1. Assuming you are training the tree top-down, what is the division criteria 
you are using ?
2. How well does it scale ?
3. Was the data on which this was tried, sparse ?
4. What is the distance metric that has been used ?

Basically I have a use -case where-in I have a set of 5 - 10 million urls which 
have an inherent hierarchical relationship and a set of user-clicks on them. I 
would like to cluster them in a tree and use the model to answer the near 
neighborhood type queries i.e. what urls are related to what other urls. I did 
implement a sequential bottom-up hierarchical clustering algorithm but the 
complexity is too bad for my data-set. I then thought about implementing a 
top-down hierarchical clustering algorithm using Jaccard co-efficient as my 
distance measure and came across this patch.

Can you suggest if this patch will help?

 Hierarchial clusterer
 -

 Key: MAHOUT-19
 URL: https://issues.apache.org/jira/browse/MAHOUT-19
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Reporter: Karl Wettin
Assignee: Karl Wettin
Priority: Minor
 Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, 
 MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png


 In a hierarchial clusterer the instances are the leaf nodes in a tree where 
 branch nodes contains the mean features of and the distance between its 
 children.
 For performance reasons I always trained trees from the top-down. I have 
 been told that it can cause various effects I never encountered. And I 
 believe Huffman solved his problem by training bottom-up? The thing is, I 
 don't think it is possible to train the tree top-down using map reduce. I do 
 however think it is possible to train it bottom-up. I would very much 
 appreciate any thoughts on this.
 Once this tree is trained one can extract clusters in various ways. The mean 
 distance between all instances is usually a good maximum distance to allow 
 between nodes when navigating the tree in search for a cluster. 
 Navigating the tree and gather nodes that are not too far away from each 
 other is usually instant if the tree is available in memory or persisted in a 
 smart way. In my experience there is not much to win from extracting all 
 clusters from start. Also, it usually makes sense to allow for the user to 
 modify the cluster boundary variables in real time using a slider or perhaps 
 present the named summary of neighbouring clusters, blacklist paths in the 
 tree, etc. It is also not to bad to use secondary classification on the 
 instances to create worm holes in the tree. I always thought it would be cool 
 to visualize it using Touchgraph.
 My focus is on clustering text documents for instant more like this-feature 
 in search engines and use Tanimoto similarity on the vector spaces to 
 calculate the distance.
 See LUCENE-1025 for a single threaded all in memory proof of concept of a 
 hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Inactivity

2009-01-13 Thread Grant Ingersoll

How about we move to Maven and then try producing an RC with it?

The job would be:
Get everything working w/ Maven, including building the necessary WAR,  
etc.  Make sure all tests pass, etc.   I believe Karl has done a lot  
of this already.
Remove the Ant files (or at least mark them as deprecated somehow by  
moving them to build-deprecated.xml so that we can still refer to them  
if we need to quickly remember something)

Create an RC

I will help where I can.

-Grant


On Jan 12, 2009, at 1:37 PM, Isabel Drost wrote:


On Friday 09 January 2009, Grant Ingersoll wrote:

I feel really comfortable at this point using Maven plus the ANT
plugin, when needed.  I think this resolves my concerns about the  
last

10% of Maven that is needed for customization that Ant is really good
at, while allowing for the other 90% to just work, as in Maven.


I am fine with Maven as well.



On Jan 9, 2009, at 1:59 PM, Sean Owen wrote:

Nothing from my perspective. Would definitely like to get 0.1 out!


+1 Karl, if you need any help, just tell me.

Isabel

--
The secret of happiness is total disregard of everybody.
 |\  _,,,---,,_   Web:   http://www.isabel-drost.de
 /,`.-'`'-.  ;-;;,_
|,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  xmpp://main...@spaceboyz.net





Re: Inactivity

2009-01-13 Thread Sean Owen
I'm fine with it. I don't know enough about maven to do it properly. I
would be happy to get used to whatever system is set up.

The only concern I have is that some bits in my taste-build file may
not translate into maven, right? Don't delete it, at least. Is this
where you end up having an Ant build file and integrating into maven
somehow -- that is, is it possible we will retain Ant files anyway for
special tasks? That would be fine by me for sure. If not, as long as
we can leave it, that's fine too.

Sean

On Tue, Jan 13, 2009 at 2:00 PM, Grant Ingersoll gsing...@apache.org wrote:
 How about we move to Maven and then try producing an RC with it?

 The job would be:
 Get everything working w/ Maven, including building the necessary WAR, etc.
  Make sure all tests pass, etc.   I believe Karl has done a lot of this
 already.
 Remove the Ant files (or at least mark them as deprecated somehow by moving
 them to build-deprecated.xml so that we can still refer to them if we need
 to quickly remember something)
 Create an RC

 I will help where I can.

 -Grant


[jira] Updated: (MAHOUT-95) UserSimilarity-based NearestNNeighborhood

2009-01-13 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-95?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated MAHOUT-95:
---

Attachment: MAHOUT-95.patch

Cleaned up version:
* No TopItems modifications
* Addition of NearestNUserNeighborhood.MinSimilarityEstimator
* Additional NearestNUserNeighborhood ctor that takes minSimilarity and uses 
MinSimilarityEstimator if minSimilarity  0.0


 UserSimilarity-based NearestNNeighborhood
 -

 Key: MAHOUT-95
 URL: https://issues.apache.org/jira/browse/MAHOUT-95
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Reporter: Otis Gospodnetic
Priority: Minor
 Fix For: 0.1

 Attachments: MAHOUT-95-diff-against-nearestN.txt, MAHOUT-95.patch, 
 MAHOUT-95.patch, MAHOUT-95.patch


 A variation of NearestNUserNeighborhood.  This version adds the minSimilarity 
 parameter, which is the primary factor for including/excluding other users 
 from the target user's neighbourhood.  Additionally, the 'n' parameter was 
 renamed to maxHoodSize and is used to optionally limit the size of the 
 neighbourhood.
 The patch is for a brand new class, but we may really want just a single 
 class (either keep this one and axe NearestNUserNeighborhood or add this 
 functionality to NearestNUserNeighborhood), if this sounds good.
 I'll update the unit test and provide a patch for that if others think this 
 can go in.
 Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-95) UserSimilarity-based NearestNNeighborhood

2009-01-13 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-95?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated MAHOUT-95:
---

Fix Version/s: 0.1

 UserSimilarity-based NearestNNeighborhood
 -

 Key: MAHOUT-95
 URL: https://issues.apache.org/jira/browse/MAHOUT-95
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Reporter: Otis Gospodnetic
Priority: Minor
 Fix For: 0.1

 Attachments: MAHOUT-95-diff-against-nearestN.txt, MAHOUT-95.patch, 
 MAHOUT-95.patch, MAHOUT-95.patch


 A variation of NearestNUserNeighborhood.  This version adds the minSimilarity 
 parameter, which is the primary factor for including/excluding other users 
 from the target user's neighbourhood.  Additionally, the 'n' parameter was 
 renamed to maxHoodSize and is used to optionally limit the size of the 
 neighbourhood.
 The patch is for a brand new class, but we may really want just a single 
 class (either keep this one and axe NearestNUserNeighborhood or add this 
 functionality to NearestNUserNeighborhood), if this sounds good.
 I'll update the unit test and provide a patch for that if others think this 
 can go in.
 Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-95) UserSimilarity-based NearestNNeighborhood

2009-01-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-95?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12663502#action_12663502
 ] 

Sean Owen commented on MAHOUT-95:
-

Looks OK. I think you don't need two Estimators -- just overload one to serve 
both purposes. With that, feel free to commit.

 UserSimilarity-based NearestNNeighborhood
 -

 Key: MAHOUT-95
 URL: https://issues.apache.org/jira/browse/MAHOUT-95
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Reporter: Otis Gospodnetic
Priority: Minor
 Fix For: 0.1

 Attachments: MAHOUT-95-diff-against-nearestN.txt, MAHOUT-95.patch, 
 MAHOUT-95.patch, MAHOUT-95.patch


 A variation of NearestNUserNeighborhood.  This version adds the minSimilarity 
 parameter, which is the primary factor for including/excluding other users 
 from the target user's neighbourhood.  Additionally, the 'n' parameter was 
 renamed to maxHoodSize and is used to optionally limit the size of the 
 neighbourhood.
 The patch is for a brand new class, but we may really want just a single 
 class (either keep this one and axe NearestNUserNeighborhood or add this 
 functionality to NearestNUserNeighborhood), if this sounds good.
 I'll update the unit test and provide a patch for that if others think this 
 can go in.
 Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



how can I contribute

2009-01-13 Thread Amit Kumar
Hi all,

I am a PhD student, working in AI Planning. I would like to contribute to
Mahout - say around 5hrs per week. Any suggestions?

Best Regards
-Amit


Re: how can I contribute

2009-01-13 Thread Ted Dunning


That sounds great!

One thing mahout doesn't have is a good collection of tree based  
methods. My own favorite lately is random forests.


What ate your interests?

On Jan 13, 2009, at 19:01, Amit Kumar in4tu...@gmail.com wrote:


Hi all,

I am a PhD student, working in AI Planning. I would like to  
contribute to

Mahout - say around 5hrs per week. Any suggestions?

Best Regards
-Amit


Re: how can I contribute

2009-01-13 Thread Isabel Drost
On Wednesday 14 January 2009, Amit Kumar wrote:
 I am a PhD student, working in AI Planning. I would like to contribute to
 Mahout - say around 5hrs per week. Any suggestions?

Great. Welcome Amit.

I am always interested in what problems people are working on. Maybe you can 
tell us a little more on what you have been working on so far and exactly 
which problem settings you want to solve with Mahout?

Isabel

-- 
You are false data.
  |\  _,,,---,,_   Web:   http://www.isabel-drost.de
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  xmpp://main...@spaceboyz.net


signature.asc
Description: This is a digitally signed message part.