[jira] Commented: (MAHOUT-19) Hierarchial clusterer
[ https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12663319#action_12663319 ] Ankur commented on MAHOUT-19: - Hi Karl, Welcome back :-) Can you share the following few things about this patch? 1. Assuming you are training the tree top-down, what is the division criteria you are using ? 2. How well does it scale ? 3. Was the data on which this was tried, sparse ? 4. What is the distance metric that has been used ? Basically I have a use -case where-in I have a set of 5 - 10 million urls which have an inherent hierarchical relationship and a set of user-clicks on them. I would like to cluster them in a tree and use the model to answer the near neighborhood type queries i.e. what urls are related to what other urls. I did implement a sequential bottom-up hierarchical clustering algorithm but the complexity is too bad for my data-set. I then thought about implementing a top-down hierarchical clustering algorithm using Jaccard co-efficient as my distance measure and came across this patch. Can you suggest if this patch will help? Hierarchial clusterer - Key: MAHOUT-19 URL: https://issues.apache.org/jira/browse/MAHOUT-19 Project: Mahout Issue Type: New Feature Components: Clustering Reporter: Karl Wettin Assignee: Karl Wettin Priority: Minor Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png In a hierarchial clusterer the instances are the leaf nodes in a tree where branch nodes contains the mean features of and the distance between its children. For performance reasons I always trained trees from the top-down. I have been told that it can cause various effects I never encountered. And I believe Huffman solved his problem by training bottom-up? The thing is, I don't think it is possible to train the tree top-down using map reduce. I do however think it is possible to train it bottom-up. I would very much appreciate any thoughts on this. Once this tree is trained one can extract clusters in various ways. The mean distance between all instances is usually a good maximum distance to allow between nodes when navigating the tree in search for a cluster. Navigating the tree and gather nodes that are not too far away from each other is usually instant if the tree is available in memory or persisted in a smart way. In my experience there is not much to win from extracting all clusters from start. Also, it usually makes sense to allow for the user to modify the cluster boundary variables in real time using a slider or perhaps present the named summary of neighbouring clusters, blacklist paths in the tree, etc. It is also not to bad to use secondary classification on the instances to create worm holes in the tree. I always thought it would be cool to visualize it using Touchgraph. My focus is on clustering text documents for instant more like this-feature in search engines and use Tanimoto similarity on the vector spaces to calculate the distance. See LUCENE-1025 for a single threaded all in memory proof of concept of a hierarchial clusterer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Inactivity
How about we move to Maven and then try producing an RC with it? The job would be: Get everything working w/ Maven, including building the necessary WAR, etc. Make sure all tests pass, etc. I believe Karl has done a lot of this already. Remove the Ant files (or at least mark them as deprecated somehow by moving them to build-deprecated.xml so that we can still refer to them if we need to quickly remember something) Create an RC I will help where I can. -Grant On Jan 12, 2009, at 1:37 PM, Isabel Drost wrote: On Friday 09 January 2009, Grant Ingersoll wrote: I feel really comfortable at this point using Maven plus the ANT plugin, when needed. I think this resolves my concerns about the last 10% of Maven that is needed for customization that Ant is really good at, while allowing for the other 90% to just work, as in Maven. I am fine with Maven as well. On Jan 9, 2009, at 1:59 PM, Sean Owen wrote: Nothing from my perspective. Would definitely like to get 0.1 out! +1 Karl, if you need any help, just tell me. Isabel -- The secret of happiness is total disregard of everybody. |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://main...@spaceboyz.net
Re: Inactivity
I'm fine with it. I don't know enough about maven to do it properly. I would be happy to get used to whatever system is set up. The only concern I have is that some bits in my taste-build file may not translate into maven, right? Don't delete it, at least. Is this where you end up having an Ant build file and integrating into maven somehow -- that is, is it possible we will retain Ant files anyway for special tasks? That would be fine by me for sure. If not, as long as we can leave it, that's fine too. Sean On Tue, Jan 13, 2009 at 2:00 PM, Grant Ingersoll gsing...@apache.org wrote: How about we move to Maven and then try producing an RC with it? The job would be: Get everything working w/ Maven, including building the necessary WAR, etc. Make sure all tests pass, etc. I believe Karl has done a lot of this already. Remove the Ant files (or at least mark them as deprecated somehow by moving them to build-deprecated.xml so that we can still refer to them if we need to quickly remember something) Create an RC I will help where I can. -Grant
[jira] Updated: (MAHOUT-95) UserSimilarity-based NearestNNeighborhood
[ https://issues.apache.org/jira/browse/MAHOUT-95?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated MAHOUT-95: --- Attachment: MAHOUT-95.patch Cleaned up version: * No TopItems modifications * Addition of NearestNUserNeighborhood.MinSimilarityEstimator * Additional NearestNUserNeighborhood ctor that takes minSimilarity and uses MinSimilarityEstimator if minSimilarity 0.0 UserSimilarity-based NearestNNeighborhood - Key: MAHOUT-95 URL: https://issues.apache.org/jira/browse/MAHOUT-95 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Reporter: Otis Gospodnetic Priority: Minor Fix For: 0.1 Attachments: MAHOUT-95-diff-against-nearestN.txt, MAHOUT-95.patch, MAHOUT-95.patch, MAHOUT-95.patch A variation of NearestNUserNeighborhood. This version adds the minSimilarity parameter, which is the primary factor for including/excluding other users from the target user's neighbourhood. Additionally, the 'n' parameter was renamed to maxHoodSize and is used to optionally limit the size of the neighbourhood. The patch is for a brand new class, but we may really want just a single class (either keep this one and axe NearestNUserNeighborhood or add this functionality to NearestNUserNeighborhood), if this sounds good. I'll update the unit test and provide a patch for that if others think this can go in. Thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-95) UserSimilarity-based NearestNNeighborhood
[ https://issues.apache.org/jira/browse/MAHOUT-95?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated MAHOUT-95: --- Fix Version/s: 0.1 UserSimilarity-based NearestNNeighborhood - Key: MAHOUT-95 URL: https://issues.apache.org/jira/browse/MAHOUT-95 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Reporter: Otis Gospodnetic Priority: Minor Fix For: 0.1 Attachments: MAHOUT-95-diff-against-nearestN.txt, MAHOUT-95.patch, MAHOUT-95.patch, MAHOUT-95.patch A variation of NearestNUserNeighborhood. This version adds the minSimilarity parameter, which is the primary factor for including/excluding other users from the target user's neighbourhood. Additionally, the 'n' parameter was renamed to maxHoodSize and is used to optionally limit the size of the neighbourhood. The patch is for a brand new class, but we may really want just a single class (either keep this one and axe NearestNUserNeighborhood or add this functionality to NearestNUserNeighborhood), if this sounds good. I'll update the unit test and provide a patch for that if others think this can go in. Thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-95) UserSimilarity-based NearestNNeighborhood
[ https://issues.apache.org/jira/browse/MAHOUT-95?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12663502#action_12663502 ] Sean Owen commented on MAHOUT-95: - Looks OK. I think you don't need two Estimators -- just overload one to serve both purposes. With that, feel free to commit. UserSimilarity-based NearestNNeighborhood - Key: MAHOUT-95 URL: https://issues.apache.org/jira/browse/MAHOUT-95 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Reporter: Otis Gospodnetic Priority: Minor Fix For: 0.1 Attachments: MAHOUT-95-diff-against-nearestN.txt, MAHOUT-95.patch, MAHOUT-95.patch, MAHOUT-95.patch A variation of NearestNUserNeighborhood. This version adds the minSimilarity parameter, which is the primary factor for including/excluding other users from the target user's neighbourhood. Additionally, the 'n' parameter was renamed to maxHoodSize and is used to optionally limit the size of the neighbourhood. The patch is for a brand new class, but we may really want just a single class (either keep this one and axe NearestNUserNeighborhood or add this functionality to NearestNUserNeighborhood), if this sounds good. I'll update the unit test and provide a patch for that if others think this can go in. Thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
how can I contribute
Hi all, I am a PhD student, working in AI Planning. I would like to contribute to Mahout - say around 5hrs per week. Any suggestions? Best Regards -Amit
Re: how can I contribute
That sounds great! One thing mahout doesn't have is a good collection of tree based methods. My own favorite lately is random forests. What ate your interests? On Jan 13, 2009, at 19:01, Amit Kumar in4tu...@gmail.com wrote: Hi all, I am a PhD student, working in AI Planning. I would like to contribute to Mahout - say around 5hrs per week. Any suggestions? Best Regards -Amit
Re: how can I contribute
On Wednesday 14 January 2009, Amit Kumar wrote: I am a PhD student, working in AI Planning. I would like to contribute to Mahout - say around 5hrs per week. Any suggestions? Great. Welcome Amit. I am always interested in what problems people are working on. Maybe you can tell us a little more on what you have been working on so far and exactly which problem settings you want to solve with Mahout? Isabel -- You are false data. |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://main...@spaceboyz.net signature.asc Description: This is a digitally signed message part.