[jira] Updated: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-153: --- Attachment: Mahout-153.patch Removed making lengthSquared instance variable to transient. Used AbstractVector.equivalent for comparing two cluster centroids. Kindly review. Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Assignee: Ted Dunning Fix For: 0.4 Attachments: Mahout-153.patch, Mahout-153.patch, Mahout-153.patch, MAHOUT-153_RandomFarthest.patch Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-153: --- Attachment: Mahout-153.patch Kindly find the updated patch which includes test cases. Also,input and output formats are modified to be compatible with other clustering algorithms (kmeans, fuzzy kmeans). The distance measure is given as input parameter. And the float point comparison as suggested by Shashi is taken care. Kindly review Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Assignee: Ted Dunning Fix For: 0.4 Attachments: Mahout-153.patch, Mahout-153.patch, MAHOUT-153_RandomFarthest.patch Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846785#action_12846785 ] Pallavi Palleti commented on MAHOUT-153: Forgot to mention. In this patch, I made the lengthsquared instance variable in AbstractVector to transient. Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Assignee: Ted Dunning Fix For: 0.4 Attachments: Mahout-153.patch, Mahout-153.patch, MAHOUT-153_RandomFarthest.patch Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-153: --- Attachment: Mahout-153.patch Here is the patch for selecting initial clusters for a clustering algorithm. Idea is taken from paper Farthest Point Heuristic Based Initialization Methods for K-Modes Clustering(http://arxiv.org/pdf/cs/0610043). The attached patch follow below steps: The farthest-point heuristic starts with an arbitrary point s1. Pick a point s2 that is as far from s1 as possible. Pick si to maximize the distance to the nearest of all centroids picked so far. That is, maximize the min {dist (si, s1), dist (si, s2), ...}. After all k representatives are chosen we can define the partition of D: cluster Cj consists of all points closer to sj than to any other representative Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Assignee: Ted Dunning Fix For: 0.4 Attachments: Mahout-153.patch Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12831391#action_12831391 ] Pallavi Palleti commented on MAHOUT-153: Forgot to mention. The above patch doesn't include test cases. Kindly review. Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Assignee: Ted Dunning Fix For: 0.4 Attachments: Mahout-153.patch Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-284) In Fuzzy Kmeans, when the distance between centroid and the given point is zero, then it should belong to that cluster with probability 1 and rest with probability zero
In Fuzzy Kmeans, when the distance between centroid and the given point is zero, then it should belong to that cluster with probability 1 and rest with probability zero Key: MAHOUT-284 URL: https://issues.apache.org/jira/browse/MAHOUT-284 Project: Mahout Issue Type: Bug Components: Clustering Reporter: Pallavi Palleti Priority: Minor In Fuzzy Kmeans, when the distance between centroid and the given point is zero, then the point should belong to that cluster with probability 1 and rest with probability zero. However, right now, we are not doing that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-284) In Fuzzy Kmeans, when the distance between centroid and the given point is zero, then it should belong to that cluster with probability 1 and rest with probability zero
[ https://issues.apache.org/jira/browse/MAHOUT-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-284: --- Attachment: Mahout-284.patch This patch fix the issue In Fuzzy Kmeans, when the distance between centroid and the given point is zero, then it should belong to that cluster with probability 1 and rest with probability zero Key: MAHOUT-284 URL: https://issues.apache.org/jira/browse/MAHOUT-284 Project: Mahout Issue Type: Bug Components: Clustering Reporter: Pallavi Palleti Priority: Minor Attachments: Mahout-284.patch In Fuzzy Kmeans, when the distance between centroid and the given point is zero, then the point should belong to that cluster with probability 1 and rest with probability zero. However, right now, we are not doing that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801716#action_12801716 ] Pallavi Palleti commented on MAHOUT-153: Hi all, I am ready with my patch. However, I was trying to see if there is any possible optimizations that can be made. I will share the patch and seek further optimization suggestions from the group. Should I open another jira issue as David might be working on and submit a patch to this jira issue? Kindly suggest. Implement kmeans++ for initial cluster selection in kmeans -- Key: MAHOUT-153 URL: https://issues.apache.org/jira/browse/MAHOUT-153 Project: Mahout Issue Type: New Feature Components: Clustering Affects Versions: 0.2 Environment: OS Independent Reporter: Panagiotis Papadimitriou Fix For: 0.3 Original Estimate: 336h Remaining Estimate: 336h The current implementation of k-means includes the following algorithms for initial cluster selection (seed selection): 1) random selection of k points, 2) use of canopy clusters. I plan to implement k-means++. The details of the algorithm are available here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. Design Outline: I will create an abstract class SeedGenerator and a subclass KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-66) EuclideanDistanceMeasure and ManhattanDistanceMeasure classes are not optimized for Sparse Vectors
[ https://issues.apache.org/jira/browse/MAHOUT-66?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12709260#action_12709260 ] Pallavi Palleti commented on MAHOUT-66: --- Can you please elaborate on this a little bit as I couldn't get it. Essentially, what I did was modified these distance measure classes to use vector operations there by reusing code and depending on the vector type we are using, the respective class methods get called and there by taking care of optimizations at the vector class level. EuclideanDistanceMeasure and ManhattanDistanceMeasure classes are not optimized for Sparse Vectors -- Key: MAHOUT-66 URL: https://issues.apache.org/jira/browse/MAHOUT-66 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Priority: Minor Attachments: MAHOUT-66.patch, MAHOUT-66.patch, MAHOUT-66.patch, MAHOUT-66.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-99) Improving speed of KMeans
[ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683335#action_12683335 ] Pallavi Palleti commented on MAHOUT-99: --- If we need to modify Canopy. We need to modify all depandant classes too where ever canopy is being used. Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-99) Improving speed of KMeans
[ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-99: -- Attachment: MAHOUT-99.patch I have fixed sequencefile issue. Modified code SequenceFile where ever possible. And also, with the new KMeansClusterMapper, we don't need outputMapper code in Job.java in SyntheticControl. So, I commented that. Thanks Pallavi Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, MAHOUT-99.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-99) Improving speed of KMeans
[ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683297#action_12683297 ] Pallavi Palleti commented on MAHOUT-99: --- Yup. That must be the issue. But I am wondering how the test case succeeded? Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-99) Improving speed of KMeans
[ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683312#action_12683312 ] Pallavi Palleti commented on MAHOUT-99: --- I have used KeyValueLineRecordReader internally for my code and forgot to revert back to SequenceFileReader. Will that be sufficient to add another patch on the latest code and modify only KMeansDriver to use SequenceFileReader? Kindly let me know. Thanks Pallavi Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-99) Improving speed of KMeans
[ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-99: -- Attachment: MAHOUT-99.patch this patch takes care of issues with speed. Also, the issues with combiner runs zero or more than once has been taken care. Improving speed of KMeans - Key: MAHOUT-99 URL: https://issues.apache.org/jira/browse/MAHOUT-99 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Attachments: MAHOUT-99.patch Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent. Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-79) Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks
[ https://issues.apache.org/jira/browse/MAHOUT-79?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-79: -- Attachment: FUZZY-79.patch I have made the code compatible with recent updates. please review. Thanks Pallavi Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks - Key: MAHOUT-79 URL: https://issues.apache.org/jira/browse/MAHOUT-79 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: FUZZY-79.patch, FUZZY-79.patch, FUZZY-79.patch, FUZZY-79.patch, FUZZY-79.patch, FUZZY.patch Improve the speed of fuzzy k-Means by passing only the cluster-id info as key output of mapper task and reading the cluster information in reducer task where this info is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-79) Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks
[ https://issues.apache.org/jira/browse/MAHOUT-79?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-79: -- Attachment: FUZZY-79.patch Made sure srowen concern over not using try{} catch{} for control flow in FuzzyKMeansReducer. Made addpoint() and addPoints() parameters order same in SoftCluster. Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks - Key: MAHOUT-79 URL: https://issues.apache.org/jira/browse/MAHOUT-79 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: FUZZY-79.patch, FUZZY-79.patch, FUZZY.patch Improve the speed of fuzzy k-Means by passing only the cluster-id info as key output of mapper task and reading the cluster information in reducer task where this info is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-79) Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks
[ https://issues.apache.org/jira/browse/MAHOUT-79?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-79: -- Attachment: FUZZY-79.patch I have added combiner to Fuzzy. But, this time, I am making sure that the system is aware that a combiner can run zero or more times. And so, respective conditions are added both in combiner and reducer. Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks - Key: MAHOUT-79 URL: https://issues.apache.org/jira/browse/MAHOUT-79 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: FUZZY-79.patch, FUZZY.patch Improve the speed of fuzzy k-Means by passing only the cluster-id info as key output of mapper task and reading the cluster information in reducer task where this info is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-79) Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks
[ https://issues.apache.org/jira/browse/MAHOUT-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12640436#action_12640436 ] Pallavi Palleti commented on MAHOUT-79: --- Hi Grant, latest patch takes care of Ted's concerns. Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks - Key: MAHOUT-79 URL: https://issues.apache.org/jira/browse/MAHOUT-79 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Fix For: 0.1 Attachments: FUZZY-79.patch, FUZZY.patch Improve the speed of fuzzy k-Means by passing only the cluster-id info as key output of mapper task and reading the cluster information in reducer task where this info is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-79) Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks
[ https://issues.apache.org/jira/browse/MAHOUT-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12636615#action_12636615 ] Pallavi Palleti commented on MAHOUT-79: --- Please review the code and let me know if any changes need to be done. Thanks Pallavi Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks - Key: MAHOUT-79 URL: https://issues.apache.org/jira/browse/MAHOUT-79 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Attachments: FUZZY.patch Improve the speed of fuzzy k-Means by passing only the cluster-id info as key output of mapper task and reading the cluster information in reducer task where this info is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-79) Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks
[ https://issues.apache.org/jira/browse/MAHOUT-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12636634#action_12636634 ] Pallavi Palleti commented on MAHOUT-79: --- If possible then It will be good to consider this in 0.1. Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks - Key: MAHOUT-79 URL: https://issues.apache.org/jira/browse/MAHOUT-79 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Attachments: FUZZY.patch Improve the speed of fuzzy k-Means by passing only the cluster-id info as key output of mapper task and reading the cluster information in reducer task where this info is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-79) Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks
[ https://issues.apache.org/jira/browse/MAHOUT-79?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-79: -- Attachment: FUZZY.patch There are three major changes that are done in this implementation: One is related to improving speed: 1. The existing implementation was passing the centroid information as a key to the next tasks (combiner and reducer). When the dimensionality is huge, then passing this huge information as a key throws out of memory error as it is difficult hold the whole data into memory. So, the approach I have taken in this implementation is to send only the cluster-id as the key value in mapper tasks. and In reducer phase we read the cluster information in configure method and accessing cluster information by maintaining a map of id to softcluster object. As we are not changing the cluster values till one single iteration ends. We can optimize the code in this way and there by improving speed. I have personally seen a speed improvement of hours to minutes. Two are related to bugs: 1. Combiner is removed as it is not sure about how many times a combiner run on a dataset. It may run zero to many times. If it runs more than once, it is going to be a big logical bug. So, combiner is removed in new implementation. 2. There was a logical bug where in place of power, I used multiplication in previous implementation. I fixed it in this implementation. NOTE:The above(Combiner, improving speed) can be applicable to K-Means too. Because, 1. K-Means do modify the data points in combiner and as per hadoop specifications, it is not given guarantee that combiner run only once over a data point. So, in this way, it may create a bug. 2. By passing only cluster-id, we can improve the speed as it reduces the amount of data that is being transferred between map and reduce tasks. We can apply this idea of passing cluster-id rather than whole cluster wherever it is applicable in any other mahout implementations. Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks - Key: MAHOUT-79 URL: https://issues.apache.org/jira/browse/MAHOUT-79 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Attachments: FUZZY.patch Improve the speed of fuzzy k-Means by passing only the cluster-id info as key output of mapper task and reading the cluster information in reducer task where this info is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-77) DistanceMeasure calculation slow for SparseVector
[ https://issues.apache.org/jira/browse/MAHOUT-77?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12632914#action_12632914 ] Pallavi Palleti commented on MAHOUT-77: --- Hi Allen, It was suggested to use vector operations in addPoint and computeCentroid so that it makes simple to understand. Also, in distance measure classes too, we can replace the code using Vector operations like plus and minus,dot methods. Detail discussion is present in https://issues.apache.org/jira/browse/MAHOUT-66 Also, I have added plus and divide method specific for sparse vector. The patch which contain this is:https://issues.apache.org/jira/browse/MAHOUT-67 Thanks Pallavi DistanceMeasure calculation slow for SparseVector - Key: MAHOUT-77 URL: https://issues.apache.org/jira/browse/MAHOUT-77 Project: Mahout Issue Type: Improvement Components: Matrix Reporter: Allen Day Priority: Minor Fix For: 0.2 Attachments: sparse.patch, sparse.patch ManhattanDistanceMeasure and TanimotoDistanceMeasure assume all vector indices up to cardinality() must be compared. We can speed this up for SparseVectors (and others) because Vector implements Iterable, so we can consider only non-zero indices. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-74) Fuzzy K-Means clustering
[ https://issues.apache.org/jira/browse/MAHOUT-74?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12623252#action_12623252 ] Pallavi Palleti commented on MAHOUT-74: --- Hi Grant, urlCount is unnecessary variable. It got added mistakenly. SoftCluster.m should be configurable. I am sorry. I forgot to modify it. Fuzzy K-Means clustering Key: MAHOUT-74 URL: https://issues.apache.org/jira/browse/MAHOUT-74 Project: Mahout Issue Type: New Feature Components: Clustering Reporter: Pallavi Palleti Assignee: Grant Ingersoll Attachments: mahout-74.patch, mahout-74.patch Fuzzy KMeans clustering algorithm is an extension to traditional K Means clustering algorithm and performs soft clustering. More details about fuzzy k-means can be found here :http://en.wikipedia.org/wiki/Data_clustering#Fuzzy_c-means_clustering I have implemented fuzzy K-Means prototype and tests in org.apache.mahout.clustering.fuzzykmeans -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-74) Fuzzy K-Means clustering
[ https://issues.apache.org/jira/browse/MAHOUT-74?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-74: -- Attachment: mahout-74.patch I have implemented Fuzzy K-Means prototype and tests. Please review the code. Fuzzy K-Means clustering Key: MAHOUT-74 URL: https://issues.apache.org/jira/browse/MAHOUT-74 Project: Mahout Issue Type: New Feature Reporter: Pallavi Palleti Attachments: mahout-74.patch Fuzzy KMeans clustering algorithm is an extension to traditional K Means clustering algorithm and performs soft clustering. More details about fuzzy k-means can be found here :http://en.wikipedia.org/wiki/Data_clustering#Fuzzy_c-means_clustering I have implemented fuzzy K-Means prototype and tests in org.apache.mahout.clustering.fuzzykmeans -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-74) Fuzzy K-Means clustering
Fuzzy K-Means clustering Key: MAHOUT-74 URL: https://issues.apache.org/jira/browse/MAHOUT-74 Project: Mahout Issue Type: New Feature Reporter: Pallavi Palleti Attachments: mahout-74.patch Fuzzy KMeans clustering algorithm is an extension to traditional K Means clustering algorithm and performs soft clustering. More details about fuzzy k-means can be found here :http://en.wikipedia.org/wiki/Data_clustering#Fuzzy_c-means_clustering I have implemented fuzzy K-Means prototype and tests in org.apache.mahout.clustering.fuzzykmeans -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-67) plus method and divide method in AbstractVector doesn't work for SparseVectors
[ https://issues.apache.org/jira/browse/MAHOUT-67?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-67: -- Attachment: MAHOUT-67.patch I have added unit tests to show that the current implementation breaks for sparse vector plus, dot and divide operation. and my version works. plus method and divide method in AbstractVector doesn't work for SparseVectors -- Key: MAHOUT-67 URL: https://issues.apache.org/jira/browse/MAHOUT-67 Project: Mahout Issue Type: Bug Components: Matrix Reporter: Pallavi Palleti Priority: Minor Attachments: MAHOUT-67.patch, MAHOUT-67.patch, MAHOUT-67.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-67) plus method and divide method in AbstractVector can be optimized for SparseVectors
[ https://issues.apache.org/jira/browse/MAHOUT-67?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-67: -- Summary: plus method and divide method in AbstractVector can be optimized for SparseVectors (was: plus method and divide method in AbstractVector doesn't work for SparseVectors) I misunderstood the sparse vector representation. I agree that the cardinality exception should be thrown if two sparse vectors' cardinality is not same. But, my implementation still holds and optimizes the computation of divide and plus operation over sparse vectors. plus method and divide method in AbstractVector can be optimized for SparseVectors -- Key: MAHOUT-67 URL: https://issues.apache.org/jira/browse/MAHOUT-67 Project: Mahout Issue Type: Bug Components: Matrix Reporter: Pallavi Palleti Priority: Minor Attachments: MAHOUT-67.patch, MAHOUT-67.patch, MAHOUT-67.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-67) plus method and divide method in AbstractVector can be optimized for SparseVectors
[ https://issues.apache.org/jira/browse/MAHOUT-67?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-67: -- Attachment: MAHOUT-67.patch I have restored the cardinality check in dot method and also added cardinality check for plus method and so updated unit tests accordingly. As this code is optimization of existing code, the previous unit tests holds. plus method and divide method in AbstractVector can be optimized for SparseVectors -- Key: MAHOUT-67 URL: https://issues.apache.org/jira/browse/MAHOUT-67 Project: Mahout Issue Type: Bug Components: Matrix Reporter: Pallavi Palleti Priority: Minor Attachments: MAHOUT-67.patch, MAHOUT-67.patch, MAHOUT-67.patch, MAHOUT-67.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-66) EuclideanDistanceMeasure and ManhattanDistanceMeasure classes are not optimized for Sparse Vectors
[ https://issues.apache.org/jira/browse/MAHOUT-66?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-66: -- Attachment: MAHOUT-66.patch As this is not a bug but an improvement, existing unit tests hold here. I have added unit tests for Manhattan and Euclidean distance measures.Please review the code. Thanks Pallavi EuclideanDistanceMeasure and ManhattanDistanceMeasure classes are not optimized for Sparse Vectors -- Key: MAHOUT-66 URL: https://issues.apache.org/jira/browse/MAHOUT-66 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Priority: Minor Attachments: MAHOUT-66.patch, MAHOUT-66.patch, MAHOUT-66.patch, MAHOUT-66.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-68) addPoint, computeCentroid can be represented with Vector operations
[ https://issues.apache.org/jira/browse/MAHOUT-68?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-68: -- Issue Type: Improvement (was: Bug) Summary: addPoint, computeCentroid can be represented with Vector operations (was: addPoint, computeCentroid does not work for SparseVectors) In this way, we can hide the implementation and can optimize the code at vector level. addPoint, computeCentroid can be represented with Vector operations --- Key: MAHOUT-68 URL: https://issues.apache.org/jira/browse/MAHOUT-68 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Priority: Minor Attachments: MAHOUT-68.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-68) addPoint, computeCentroid can use vector operators to do the respective task
[ https://issues.apache.org/jira/browse/MAHOUT-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12620211#action_12620211 ] Pallavi Palleti commented on MAHOUT-68: --- All this came because of my confusion in understanding the cardinality in Sparse Vector. So, this is not a bug and so existing unit tests hold even after the changes. addPoint, computeCentroid can use vector operators to do the respective task Key: MAHOUT-68 URL: https://issues.apache.org/jira/browse/MAHOUT-68 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Priority: Minor Attachments: MAHOUT-68.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-73) This is still an improvement over existing implementations.
This is still an improvement over existing implementations. --- Key: MAHOUT-73 URL: https://issues.apache.org/jira/browse/MAHOUT-73 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Priority: Minor -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-73) addPoint, computeCentroid can be optimized by using vector operators
[ https://issues.apache.org/jira/browse/MAHOUT-73?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-73: -- Summary: addPoint, computeCentroid can be optimized by using vector operators (was: This is still an improvement over existing implementations.) addPoint, computeCentroid can be optimized by using vector operators Key: MAHOUT-73 URL: https://issues.apache.org/jira/browse/MAHOUT-73 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Priority: Minor -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-73) addPoint, computeCentroid can be optimized by using vector operators
[ https://issues.apache.org/jira/browse/MAHOUT-73?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-73: -- Attachment: MAHOUT-73.patch We can use Vector operators in addPoint and computeCentroid methods of Canopy class. There by, this can be optimized for Sparse vector.Also, by using vector operators, we are hiding the implementation details and reusing the existing code in Sparse Vector, Dense Vector classes. addPoint, computeCentroid can be optimized by using vector operators Key: MAHOUT-73 URL: https://issues.apache.org/jira/browse/MAHOUT-73 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Priority: Minor Attachments: MAHOUT-73.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-67) plus method in AbstractVector doesn't work for SparseVectors
[ https://issues.apache.org/jira/browse/MAHOUT-67?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12611906#action_12611906 ] Pallavi Palleti commented on MAHOUT-67: --- Isabel: Sure, I will make those changes. Karl: I couldn't get. Can you please elaborate it? plus method in AbstractVector doesn't work for SparseVectors Key: MAHOUT-67 URL: https://issues.apache.org/jira/browse/MAHOUT-67 Project: Mahout Issue Type: Bug Components: Matrix Reporter: Pallavi Palleti Priority: Minor Attachments: MAHOUT-67.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-67) plus method and divide method in AbstractVector doesn't work for SparseVectors
[ https://issues.apache.org/jira/browse/MAHOUT-67?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-67: -- Attachment: MAHOUT-67.patch changed to Java5 notation style. Also added divide method for sparse vector. plus method and divide method in AbstractVector doesn't work for SparseVectors -- Key: MAHOUT-67 URL: https://issues.apache.org/jira/browse/MAHOUT-67 Project: Mahout Issue Type: Bug Components: Matrix Reporter: Pallavi Palleti Priority: Minor Attachments: MAHOUT-67.patch, MAHOUT-67.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-66) EuclideanDistanceMeasure and ManhattanDistanceMeasure classes does not compute distance for Sparse Vectors
[ https://issues.apache.org/jira/browse/MAHOUT-66?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-66: -- Attachment: MAHOUT-66.patch I have refactored the code as per Isabel instructions. And overridden minus method in SparseVector and have only one distance(Vector v1, Vector v2) method in both EuclideanDistanceMeasure and ManhattanDistanceMeasure. Please review the code. Thanks Pallavi EuclideanDistanceMeasure and ManhattanDistanceMeasure classes does not compute distance for Sparse Vectors -- Key: MAHOUT-66 URL: https://issues.apache.org/jira/browse/MAHOUT-66 Project: Mahout Issue Type: Bug Components: Clustering Reporter: Pallavi Palleti Priority: Minor Attachments: MAHOUT-66.patch, MAHOUT-66.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-67) plus method in AbstractVector doesn't work for SparseVectors
plus method in AbstractVector doesn't work for SparseVectors Key: MAHOUT-67 URL: https://issues.apache.org/jira/browse/MAHOUT-67 Project: Mahout Issue Type: Bug Components: Matrix Reporter: Pallavi Palleti Priority: Minor -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-67) plus method in AbstractVector doesn't work for SparseVectors
[ https://issues.apache.org/jira/browse/MAHOUT-67?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-67: -- Attachment: MAHOUT-67.patch I have overridden plus() method in SparseVector. Inorder to do this, I need to add a method called addCardinality(). Also, I found that in dot() method, it shouldn't throw cardinalityException. So, I removed that condition. Please review the code. Thanks Pallavi plus method in AbstractVector doesn't work for SparseVectors Key: MAHOUT-67 URL: https://issues.apache.org/jira/browse/MAHOUT-67 Project: Mahout Issue Type: Bug Components: Matrix Reporter: Pallavi Palleti Priority: Minor Attachments: MAHOUT-67.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-66) EuclideanDistanceMeasure and ManhattanDistanceMeasure classes does not compute distance for Sparse Vectors
EuclideanDistanceMeasure and ManhattanDistanceMeasure classes does not compute distance for Sparse Vectors -- Key: MAHOUT-66 URL: https://issues.apache.org/jira/browse/MAHOUT-66 Project: Mahout Issue Type: Bug Components: Clustering Reporter: Pallavi Palleti Priority: Minor -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-66) EuclideanDistanceMeasure and ManhattanDistanceMeasure classes does not compute distance for Sparse Vectors
[ https://issues.apache.org/jira/browse/MAHOUT-66?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pallavi Palleti updated MAHOUT-66: -- Attachment: MAHOUT-66.patch I added a condition to the actual distance method such that if the vectors are sparse, a different distance method gets called which is specific for SparseVector. EuclideanDistanceMeasure and ManhattanDistanceMeasure classes does not compute distance for Sparse Vectors -- Key: MAHOUT-66 URL: https://issues.apache.org/jira/browse/MAHOUT-66 Project: Mahout Issue Type: Bug Components: Clustering Reporter: Pallavi Palleti Priority: Minor Attachments: MAHOUT-66.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.