[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840574#action_12840574 ] Rohini Uppuluri commented on MAHOUT-153: Hi, Please find a brief description on input and output below: Hope this helps: -- Input Format: documentId\tdocument vector Example line: 338 [s1682, 275:5.0, 478:3.0, 479:5.0, 1:3.0, 474:4.0, 143:2.0, 197:5.0, 196:2.0, 286:4.0, 135:5.0, 86:4.0, 216:4.0, 83:2.0, 213:5.0, 215:3.0, 208:3.0, 269:4.0, 517:5.0, 169:5.0, 654:5.0, 443:5.0, 990:4.0, 175:4.0, 513:5.0, 514:5.0, 650:5.0, 525:4.0, 1124:4.0, 382:5.0, 708:5.0, 497:3.0, 498:4.0, 523:3.0, 427:4.0, 488:5.0, 490:5.0, 189:4.0, 52:5.0, 301:4.0, 607:4.0, 180:4.0, ] Output Format: ClusterIdentifier\tClusterIdentifier: clusterCenterVector Example line: C0 C0: [s1682, 275:3.0, 1:4.0, 273:5.0, 272:2.0, 3:1.0, 546:4.0, 277:3.0, 276:3.0, 7:5.0, 283:3.0, 282:4.0, 9:1.0, 281:4.0, 12:5.0, 1089:2.0, 13:1.0, 286:1.0, 14:1.0, 15:2.0, 284:3.0, 258:4.0, 17:3.0, 257:5.0, 23:5.0, 25:2.0, 264:4.0, 270:5.0, 271:3.0, 31:5.0, 305:1.0, 1405:3.0, 307:4.0, 39:5.0, 311:3.0, 310:3.0, 515:5.0, 313:5.0, 525:5.0, 315:3.0, 316:4.0, 288:3.0, 50:5.0, 532:4.0, 291:5.0, 292:4.0, 55:5.0, 293:4.0, 294:3.0, 295:5.0, 298:4.0, 56:5.0, 300:5.0, 539:2.0, 302:4.0, 343:4.0, 882:4.0, 340:1.0, 887:4.0, 1025:4.0, 619:3.0, 79:5.0, 347:2.0, 346:4.0, 345:1.0, 344:1.0, 326:4.0, 327:3.0, 1051:4.0, 322:4.0, 323:4.0, 628:2.0, 333:4.0, 331:4.0, 1047:4.0, 328:4.0, 636:4.0, 100:5.0, 98:5.0, 581:4.0, 370:3.0, 591:3.0, 118:5.0, 595:4.0, 117:4.0, 358:2.0, 597:4.0, 127:5.0, 1073:4.0, 603:5.0, 121:5.0, 683:4.0, 413:3.0, 678:4.0, 950:4.0, 405:4.0, 156:5.0, 696:4.0, 1244:4.0, 147:5.0, 690:3.0, 928:3.0, 151:1.0, 924:3.0, 443:4.0, 654:5.0, 925:2.0, 649:4.0, 164:5.0, 642:4.0, 185:5.0, 431:5.0, 905:4.0, 1278:4.0, 176:4.0, 183:5.0, 657:5.0, 898:1.0, 181:4.0, 659:4.0, 1016:4.0, 477:1.0, 751:4.0, 475:4.0, 750:4.0, 203:5.0, 472:2.0, 748:3.0, 471:5.0, 1011:2.0, 466:5.0, 742:5.0, 1013:3.0, 1014:4.0, 762:4.0, 222:5.0, 760:3.0, 460:4.0, 458:3.0, 218:4.0, 237:4.0, 235:3.0, 504:5.0, 717:4.0, 234:4.0, 991:1.0, 233:5.0, 978:2.0, 229:5.0, 226:5.0, 254:1.0, 255:4.0, 252:3.0, 250:5.0, 248:4.0, 245:4.0, ] > Implement kmeans++ for initial cluster selection in kmeans > -- > > Key: MAHOUT-153 > URL: https://issues.apache.org/jira/browse/MAHOUT-153 > Project: Mahout > Issue Type: New Feature > Components: Clustering >Affects Versions: 0.2 > Environment: OS Independent >Reporter: Panagiotis Papadimitriou >Assignee: Ted Dunning > Fix For: 0.4 > > Attachments: Mahout-153.patch, MAHOUT-153_RandomFarthest.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > The current implementation of k-means includes the following algorithms for > initial cluster selection (seed selection): 1) random selection of k points, > 2) use of canopy clusters. > I plan to implement k-means++. The details of the algorithm are available > here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. > Design Outline: I will create an abstract class SeedGenerator and a subclass > KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will > become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohini Uppuluri updated MAHOUT-153: --- Comment: was deleted (was: Hi, This is the patch for creating random farthest cluster initialization. This does not have junit test cases yet. Thanks, -Rohini) > Implement kmeans++ for initial cluster selection in kmeans > -- > > Key: MAHOUT-153 > URL: https://issues.apache.org/jira/browse/MAHOUT-153 > Project: Mahout > Issue Type: New Feature > Components: Clustering >Affects Versions: 0.2 > Environment: OS Independent >Reporter: Panagiotis Papadimitriou >Assignee: Ted Dunning > Fix For: 0.4 > > Attachments: Mahout-153.patch, MAHOUT-153_RandomFarthest.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > The current implementation of k-means includes the following algorithms for > initial cluster selection (seed selection): 1) random selection of k points, > 2) use of canopy clusters. > I plan to implement k-means++. The details of the algorithm are available > here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. > Design Outline: I will create an abstract class SeedGenerator and a subclass > KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will > become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohini Uppuluri updated MAHOUT-153: --- Attachment: MAHOUT-153_RandomFarthest.patch Hi, This is the patch for creating random farthest cluster initialization. This does not have junit test cases yet. Thanks, -Rohini > Implement kmeans++ for initial cluster selection in kmeans > -- > > Key: MAHOUT-153 > URL: https://issues.apache.org/jira/browse/MAHOUT-153 > Project: Mahout > Issue Type: New Feature > Components: Clustering >Affects Versions: 0.2 > Environment: OS Independent >Reporter: Panagiotis Papadimitriou >Assignee: Ted Dunning > Fix For: 0.4 > > Attachments: Mahout-153.patch, MAHOUT-153_RandomFarthest.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > The current implementation of k-means includes the following algorithms for > initial cluster selection (seed selection): 1) random selection of k points, > 2) use of canopy clusters. > I plan to implement k-means++. The details of the algorithm are available > here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. > Design Outline: I will create an abstract class SeedGenerator and a subclass > KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will > become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohini Uppuluri updated MAHOUT-153: --- Attachment: MAHOUT-153_RandomFarthest.patch Hi, This is the patch for creating random farthest cluster initialization as I have discussed before. Kindly note that this does not have junit test cases yet. Thanks, -Rohini > Implement kmeans++ for initial cluster selection in kmeans > -- > > Key: MAHOUT-153 > URL: https://issues.apache.org/jira/browse/MAHOUT-153 > Project: Mahout > Issue Type: New Feature > Components: Clustering >Affects Versions: 0.2 > Environment: OS Independent >Reporter: Panagiotis Papadimitriou >Assignee: Ted Dunning > Fix For: 0.4 > > Attachments: Mahout-153.patch, MAHOUT-153_RandomFarthest.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > The current implementation of k-means includes the following algorithms for > initial cluster selection (seed selection): 1) random selection of k points, > 2) use of canopy clusters. > I plan to implement k-means++. The details of the algorithm are available > here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. > Design Outline: I will create an abstract class SeedGenerator and a subclass > KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will > become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohini Uppuluri updated MAHOUT-153: --- Attachment: (was: MAHOUT-153_RandomFarthest.patch) > Implement kmeans++ for initial cluster selection in kmeans > -- > > Key: MAHOUT-153 > URL: https://issues.apache.org/jira/browse/MAHOUT-153 > Project: Mahout > Issue Type: New Feature > Components: Clustering >Affects Versions: 0.2 > Environment: OS Independent >Reporter: Panagiotis Papadimitriou >Assignee: Ted Dunning > Fix For: 0.4 > > Attachments: Mahout-153.patch, MAHOUT-153_RandomFarthest.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > The current implementation of k-means includes the following algorithms for > initial cluster selection (seed selection): 1) random selection of k points, > 2) use of canopy clusters. > I plan to implement k-means++. The details of the algorithm are available > here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. > Design Outline: I will create an abstract class SeedGenerator and a subclass > KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will > become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832380#action_12832380 ] Rohini Uppuluri commented on MAHOUT-153: Hi all, I have implemented an extension to the algorithm Pallavi had mentioned. The extension is to add some randomness in picking the farthest point. With this there is a chance of over coming the problem of picking some noise points as centroids which are very far away. Summary: 1. Pick the first centroid randomly 2. for the rest of the centroids do -> compute a few candidate centroids which are far off Candidate centroid computation: Divide the data into few parts. For each part compute the point which is farthest from the current list of centroids -> Select one of the candidate centroids randomly done I will soon submit a patch on this. Please let me know your feeback. > Implement kmeans++ for initial cluster selection in kmeans > -- > > Key: MAHOUT-153 > URL: https://issues.apache.org/jira/browse/MAHOUT-153 > Project: Mahout > Issue Type: New Feature > Components: Clustering >Affects Versions: 0.2 > Environment: OS Independent >Reporter: Panagiotis Papadimitriou >Assignee: Ted Dunning > Fix For: 0.4 > > Attachments: Mahout-153.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > The current implementation of k-means includes the following algorithms for > initial cluster selection (seed selection): 1) random selection of k points, > 2) use of canopy clusters. > I plan to implement k-means++. The details of the algorithm are available > here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. > Design Outline: I will create an abstract class SeedGenerator and a subclass > KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will > become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-99) Improving speed of KMeans
[ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohini Uppuluri updated MAHOUT-99: -- Attachment: MAHOUT-99-1.patch Hi Grant, I have set them as optional arguments. I set those parameters to some reasonable defaults incase they are not given as input. I will be uploading the updated patch reflecting the change. It is a config thing already set up in hadoop but it gives us flexibility to change incase we want to increase the map tasks. Thanks, -Rohini > Improving speed of KMeans > - > > Key: MAHOUT-99 > URL: https://issues.apache.org/jira/browse/MAHOUT-99 > Project: Mahout > Issue Type: Improvement > Components: Clustering >Reporter: Pallavi Palleti >Assignee: Grant Ingersoll > Attachments: MAHOUT-99-1.patch, MAHOUT-99.patch > > > Improved the speed of KMeans by passing only cluster ID from mapper to > reducer. Previously, whole Cluster Info as formatted s`tring was being sent. > Also removed the implicit assumption of Combiner runs only once approach and > the code is modified accordingly so that it won't create a bug when combiner > runs zero or more than once. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-79) Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks
[ https://issues.apache.org/jira/browse/MAHOUT-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654375#action_12654375 ] Rohini Uppuluri commented on MAHOUT-79: --- Hi Grant, Regarding, FuzzyKmeansUtil, there doesn't seem to be any change. > Improving the speed of Fuzzy K-Means by optimizing data transfer between map > and reduce tasks > - > > Key: MAHOUT-79 > URL: https://issues.apache.org/jira/browse/MAHOUT-79 > Project: Mahout > Issue Type: Improvement > Components: Clustering >Reporter: Pallavi Palleti >Assignee: Grant Ingersoll > Fix For: 0.1 > > Attachments: FUZZY-79.patch, FUZZY-79.patch, FUZZY-79.patch, > FUZZY-79.patch, FUZZY-79.patch, FUZZY.patch > > > Improve the speed of fuzzy k-Means by passing only the cluster-id info as key > output of mapper task and reading the cluster information in reducer task > where this info is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-79) Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks
[ https://issues.apache.org/jira/browse/MAHOUT-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654374#action_12654374 ] Rohini Uppuluri commented on MAHOUT-79: --- Hi Grant, I am Rohini and work in the same team as Pallavi is. Pallavi is out of Office till the end of this month. I will be taking care of this issue now. I am not quite sure as to why the TestKMeansClusterer tests are all commented out. However, I tried uncommenting them and running unit tests and it seems to run fine. I suppose that solves the issue? Please correct me if I am wrong. > Improving the speed of Fuzzy K-Means by optimizing data transfer between map > and reduce tasks > - > > Key: MAHOUT-79 > URL: https://issues.apache.org/jira/browse/MAHOUT-79 > Project: Mahout > Issue Type: Improvement > Components: Clustering >Reporter: Pallavi Palleti >Assignee: Grant Ingersoll > Fix For: 0.1 > > Attachments: FUZZY-79.patch, FUZZY-79.patch, FUZZY-79.patch, > FUZZY-79.patch, FUZZY-79.patch, FUZZY.patch > > > Improve the speed of fuzzy k-Means by passing only the cluster-id info as key > output of mapper task and reading the cluster information in reducer task > where this info is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.