date:20090113

[jira] Commented: (MAHOUT-19) Hierarchial clusterer

2009-01-13 Thread Ankur (JIRA)

[
https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12663319#action_12663319
]

Ankur commented on MAHOUT-19:
-

Hi Karl, Welcome back :-)
Can you share the following few things about this patch?

1. Assuming you are training the tree top-down, what is the division criteria
you are using ?
2. How well does it scale ?
3. Was the data on which this was tried, sparse ?
4. What is the distance metric that has been used ?

Basically I have a use -case where-in I have a set of 5 - 10 million urls which
have an inherent hierarchical relationship and a set of user-clicks on them. I
would like to cluster them in a tree and use the model to answer the near
neighborhood type queries i.e. what urls are related to what other urls. I did
implement a sequential bottom-up hierarchical clustering algorithm but the
complexity is too bad for my data-set. I then thought about implementing a
top-down hierarchical clustering algorithm using Jaccard co-efficient as my
distance measure and came across this patch.

Can you suggest if this patch will help?

Hierarchial clusterer
-

Key: MAHOUT-19
URL: https://issues.apache.org/jira/browse/MAHOUT-19
Project: Mahout
Issue Type: New Feature
Components: Clustering
Reporter: Karl Wettin
Assignee: Karl Wettin
Priority: Minor
Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt,
MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png

In a hierarchial clusterer the instances are the leaf nodes in a tree where
branch nodes contains the mean features of and the distance between its
children.
For performance reasons I always trained trees from the top-down. I have
been told that it can cause various effects I never encountered. And I
believe Huffman solved his problem by training bottom-up? The thing is, I
don't think it is possible to train the tree top-down using map reduce. I do
however think it is possible to train it bottom-up. I would very much
appreciate any thoughts on this.
Once this tree is trained one can extract clusters in various ways. The mean
distance between all instances is usually a good maximum distance to allow
between nodes when navigating the tree in search for a cluster.
Navigating the tree and gather nodes that are not too far away from each
other is usually instant if the tree is available in memory or persisted in a
smart way. In my experience there is not much to win from extracting all
clusters from start. Also, it usually makes sense to allow for the user to
modify the cluster boundary variables in real time using a slider or perhaps
present the named summary of neighbouring clusters, blacklist paths in the
tree, etc. It is also not to bad to use secondary classification on the
instances to create worm holes in the tree. I always thought it would be cool
to visualize it using Touchgraph.
My focus is on clustering text documents for instant more like this-feature
in search engines and use Tanimoto similarity on the vector spaces to
calculate the distance.
See LUCENE-1025 for a single threaded all in memory proof of concept of a
hierarchial clusterer.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Inactivity

2009-01-13 Thread Grant Ingersoll


How about we move to Maven and then try producing an RC with it?

The job would be:
Get everything working w/ Maven, including building the necessary WAR,  
etc.  Make sure all tests pass, etc.   I believe Karl has done a lot  
of this already.
Remove the Ant files (or at least mark them as deprecated somehow by  
moving them to build-deprecated.xml so that we can still refer to them  
if we need to quickly remember something)

Create an RC

I will help where I can.

-Grant


On Jan 12, 2009, at 1:37 PM, Isabel Drost wrote:


On Friday 09 January 2009, Grant Ingersoll wrote:

I feel really comfortable at this point using Maven plus the ANT
plugin, when needed.  I think this resolves my concerns about the  
last

10% of Maven that is needed for customization that Ant is really good
at, while allowing for the other 90% to just work, as in Maven.


I am fine with Maven as well.



On Jan 9, 2009, at 1:59 PM, Sean Owen wrote:

Nothing from my perspective. Would definitely like to get 0.1 out!


+1 Karl, if you need any help, just tell me.

Isabel

--
The secret of happiness is total disregard of everybody.
 |\  _,,,---,,_   Web:   http://www.isabel-drost.de
 /,`.-'`'-.  ;-;;,_
|,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  xmpp://main...@spaceboyz.net

Re: Inactivity

2009-01-13 Thread Sean Owen

I'm fine with it. I don't know enough about maven to do it properly. I
would be happy to get used to whatever system is set up.

The only concern I have is that some bits in my taste-build file may
not translate into maven, right? Don't delete it, at least. Is this
where you end up having an Ant build file and integrating into maven
somehow -- that is, is it possible we will retain Ant files anyway for
special tasks? That would be fine by me for sure. If not, as long as
we can leave it, that's fine too.

Sean

On Tue, Jan 13, 2009 at 2:00 PM, Grant Ingersoll gsing...@apache.org wrote:
 How about we move to Maven and then try producing an RC with it?

 The job would be:
 Get everything working w/ Maven, including building the necessary WAR, etc.
  Make sure all tests pass, etc.   I believe Karl has done a lot of this
 already.
 Remove the Ant files (or at least mark them as deprecated somehow by moving
 them to build-deprecated.xml so that we can still refer to them if we need
 to quickly remember something)
 Create an RC

 I will help where I can.

 -Grant

[jira] Updated: (MAHOUT-95) UserSimilarity-based NearestNNeighborhood

2009-01-13 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-95?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated MAHOUT-95:
---

Attachment: MAHOUT-95.patch

Cleaned up version:
* No TopItems modifications
* Addition of NearestNUserNeighborhood.MinSimilarityEstimator
* Additional NearestNUserNeighborhood ctor that takes minSimilarity and uses 
MinSimilarityEstimator if minSimilarity  0.0


 UserSimilarity-based NearestNNeighborhood
 -

 Key: MAHOUT-95
 URL: https://issues.apache.org/jira/browse/MAHOUT-95
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Reporter: Otis Gospodnetic
Priority: Minor
 Fix For: 0.1

 Attachments: MAHOUT-95-diff-against-nearestN.txt, MAHOUT-95.patch, 
 MAHOUT-95.patch, MAHOUT-95.patch


 A variation of NearestNUserNeighborhood.  This version adds the minSimilarity 
 parameter, which is the primary factor for including/excluding other users 
 from the target user's neighbourhood.  Additionally, the 'n' parameter was 
 renamed to maxHoodSize and is used to optionally limit the size of the 
 neighbourhood.
 The patch is for a brand new class, but we may really want just a single 
 class (either keep this one and axe NearestNUserNeighborhood or add this 
 functionality to NearestNUserNeighborhood), if this sounds good.
 I'll update the unit test and provide a patch for that if others think this 
 can go in.
 Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-95) UserSimilarity-based NearestNNeighborhood

2009-01-13 Thread Otis Gospodnetic (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAHOUT-95?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated MAHOUT-95:
---

Fix Version/s: 0.1

 UserSimilarity-based NearestNNeighborhood
 -

 Key: MAHOUT-95
 URL: https://issues.apache.org/jira/browse/MAHOUT-95
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Reporter: Otis Gospodnetic
Priority: Minor
 Fix For: 0.1

 Attachments: MAHOUT-95-diff-against-nearestN.txt, MAHOUT-95.patch, 
 MAHOUT-95.patch, MAHOUT-95.patch


 A variation of NearestNUserNeighborhood.  This version adds the minSimilarity 
 parameter, which is the primary factor for including/excluding other users 
 from the target user's neighbourhood.  Additionally, the 'n' parameter was 
 renamed to maxHoodSize and is used to optionally limit the size of the 
 neighbourhood.
 The patch is for a brand new class, but we may really want just a single 
 class (either keep this one and axe NearestNUserNeighborhood or add this 
 functionality to NearestNUserNeighborhood), if this sounds good.
 I'll update the unit test and provide a patch for that if others think this 
 can go in.
 Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-95) UserSimilarity-based NearestNNeighborhood

2009-01-13 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/MAHOUT-95?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12663502#action_12663502
 ] 

Sean Owen commented on MAHOUT-95:
-

Looks OK. I think you don't need two Estimators -- just overload one to serve 
both purposes. With that, feel free to commit.

 UserSimilarity-based NearestNNeighborhood
 -

 Key: MAHOUT-95
 URL: https://issues.apache.org/jira/browse/MAHOUT-95
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Reporter: Otis Gospodnetic
Priority: Minor
 Fix For: 0.1

 Attachments: MAHOUT-95-diff-against-nearestN.txt, MAHOUT-95.patch, 
 MAHOUT-95.patch, MAHOUT-95.patch


 A variation of NearestNUserNeighborhood.  This version adds the minSimilarity 
 parameter, which is the primary factor for including/excluding other users 
 from the target user's neighbourhood.  Additionally, the 'n' parameter was 
 renamed to maxHoodSize and is used to optionally limit the size of the 
 neighbourhood.
 The patch is for a brand new class, but we may really want just a single 
 class (either keep this one and axe NearestNUserNeighborhood or add this 
 functionality to NearestNUserNeighborhood), if this sounds good.
 I'll update the unit test and provide a patch for that if others think this 
 can go in.
 Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

how can I contribute

2009-01-13 Thread Amit Kumar

Hi all,

I am a PhD student, working in AI Planning. I would like to contribute to
Mahout - say around 5hrs per week. Any suggestions?

Best Regards
-Amit

Re: how can I contribute

2009-01-13 Thread Ted Dunning



That sounds great!

One thing mahout doesn't have is a good collection of tree based  
methods. My own favorite lately is random forests.


What ate your interests?

On Jan 13, 2009, at 19:01, Amit Kumar in4tu...@gmail.com wrote:


Hi all,

I am a PhD student, working in AI Planning. I would like to  
contribute to

Mahout - say around 5hrs per week. Any suggestions?

Best Regards
-Amit

Re: how can I contribute

2009-01-13 Thread Isabel Drost

On Wednesday 14 January 2009, Amit Kumar wrote:
 I am a PhD student, working in AI Planning. I would like to contribute to
 Mahout - say around 5hrs per week. Any suggestions?

Great. Welcome Amit.

I am always interested in what problems people are working on. Maybe you can 
tell us a little more on what you have been working on so far and exactly 
which problem settings you want to solve with Mahout?

Isabel

-- 
You are false data.
  |\  _,,,---,,_   Web:   http://www.isabel-drost.de
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  xmpp://main...@spaceboyz.net


signature.asc
Description: This is a digitally signed message part.

[jira] Commented: (MAHOUT-19) Hierarchial clusterer

Re: Inactivity

Re: Inactivity

[jira] Updated: (MAHOUT-95) UserSimilarity-based NearestNNeighborhood

[jira] Updated: (MAHOUT-95) UserSimilarity-based NearestNNeighborhood

[jira] Commented: (MAHOUT-95) UserSimilarity-based NearestNNeighborhood

how can I contribute

Re: how can I contribute

Re: how can I contribute

9 matches

Site Navigation

Mail list logo

Footer information