[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-03-03 Thread Rohini Uppuluri (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840574#action_12840574
 ] 

Rohini Uppuluri commented on MAHOUT-153:



 Hi, 

Please find a brief description on input and output below:
Hope this helps:

--

Input Format:
documentId\tdocument vector


Example line:

338  [s1682, 275:5.0, 478:3.0, 479:5.0, 1:3.0, 474:4.0, 143:2.0, 
197:5.0, 196:2.0, 286:4.0, 135:5.0, 

86:4.0, 216:4.0, 83:2.0, 213:5.0, 215:3.0, 208:3.0, 269:4.0, 517:5.0, 169:5.0, 
654:5.0, 443:5.0, 

990:4.0, 175:4.0, 513:5.0, 514:5.0, 650:5.0, 525:4.0, 1124:4.0, 382:5.0, 
708:5.0, 497:3.0, 498:4.0, 

523:3.0, 427:4.0, 488:5.0, 490:5.0, 189:4.0, 52:5.0, 301:4.0, 607:4.0, 180:4.0, 
] 


Output Format:
ClusterIdentifier\tClusterIdentifier: clusterCenterVector

Example line:
C0  C0: [s1682, 275:3.0, 1:4.0, 273:5.0, 272:2.0, 3:1.0, 546:4.0, 277:3.0, 
276:3.0, 7:5.0, 283:3.0, 

282:4.0, 9:1.0, 281:4.0, 12:5.0, 1089:2.0, 13:1.0, 286:1.0, 14:1.0, 15:2.0, 
284:3.0, 258:4.0, 17:3.0, 

257:5.0, 23:5.0, 25:2.0, 264:4.0, 270:5.0, 271:3.0, 31:5.0, 305:1.0, 1405:3.0, 
307:4.0, 39:5.0, 

311:3.0, 310:3.0, 515:5.0, 313:5.0, 525:5.0, 315:3.0, 316:4.0, 288:3.0, 50:5.0, 
532:4.0, 291:5.0, 

292:4.0, 55:5.0, 293:4.0, 294:3.0, 295:5.0, 298:4.0, 56:5.0, 300:5.0, 539:2.0, 
302:4.0, 343:4.0, 

882:4.0, 340:1.0, 887:4.0, 1025:4.0, 619:3.0, 79:5.0, 347:2.0, 346:4.0, 
345:1.0, 344:1.0, 326:4.0, 

327:3.0, 1051:4.0, 322:4.0, 323:4.0, 628:2.0, 333:4.0, 331:4.0, 1047:4.0, 
328:4.0, 636:4.0, 100:5.0, 

98:5.0, 581:4.0, 370:3.0, 591:3.0, 118:5.0, 595:4.0, 117:4.0, 358:2.0, 597:4.0, 
127:5.0, 1073:4.0, 

603:5.0, 121:5.0, 683:4.0, 413:3.0, 678:4.0, 950:4.0, 405:4.0, 156:5.0, 
696:4.0, 1244:4.0, 147:5.0, 

690:3.0, 928:3.0, 151:1.0, 924:3.0, 443:4.0, 654:5.0, 925:2.0, 649:4.0, 
164:5.0, 642:4.0, 185:5.0, 

431:5.0, 905:4.0, 1278:4.0, 176:4.0, 183:5.0, 657:5.0, 898:1.0, 181:4.0, 
659:4.0, 1016:4.0, 477:1.0, 

751:4.0, 475:4.0, 750:4.0, 203:5.0, 472:2.0, 748:3.0, 471:5.0, 1011:2.0, 
466:5.0, 742:5.0, 1013:3.0, 

1014:4.0, 762:4.0, 222:5.0, 760:3.0, 460:4.0, 458:3.0, 218:4.0, 237:4.0, 
235:3.0, 504:5.0, 717:4.0, 

234:4.0, 991:1.0, 233:5.0, 978:2.0, 229:5.0, 226:5.0, 254:1.0, 255:4.0, 
252:3.0, 250:5.0, 248:4.0, 

245:4.0, ] 





> Implement kmeans++ for initial cluster selection in kmeans
> --
>
> Key: MAHOUT-153
> URL: https://issues.apache.org/jira/browse/MAHOUT-153
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.2
> Environment: OS Independent
>Reporter: Panagiotis Papadimitriou
>Assignee: Ted Dunning
> Fix For: 0.4
>
> Attachments: Mahout-153.patch, MAHOUT-153_RandomFarthest.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The current implementation of k-means includes the following algorithms for 
> initial cluster selection (seed selection): 1) random selection of k points, 
> 2) use of canopy clusters.
> I plan to implement k-means++. The details of the algorithm are available 
> here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
> Design Outline: I will create an abstract class SeedGenerator and a subclass 
> KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
> become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-03-03 Thread Rohini Uppuluri (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Uppuluri updated MAHOUT-153:
---

Comment: was deleted

(was: Hi, 

This is the patch for creating random farthest cluster initialization. This 
does not have junit test cases yet.

Thanks,
-Rohini)

> Implement kmeans++ for initial cluster selection in kmeans
> --
>
> Key: MAHOUT-153
> URL: https://issues.apache.org/jira/browse/MAHOUT-153
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.2
> Environment: OS Independent
>Reporter: Panagiotis Papadimitriou
>Assignee: Ted Dunning
> Fix For: 0.4
>
> Attachments: Mahout-153.patch, MAHOUT-153_RandomFarthest.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The current implementation of k-means includes the following algorithms for 
> initial cluster selection (seed selection): 1) random selection of k points, 
> 2) use of canopy clusters.
> I plan to implement k-means++. The details of the algorithm are available 
> here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
> Design Outline: I will create an abstract class SeedGenerator and a subclass 
> KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
> become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-03-03 Thread Rohini Uppuluri (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Uppuluri updated MAHOUT-153:
---

Attachment: MAHOUT-153_RandomFarthest.patch

Hi, 

This is the patch for creating random farthest cluster initialization. This 
does not have junit test cases yet.

Thanks,
-Rohini

> Implement kmeans++ for initial cluster selection in kmeans
> --
>
> Key: MAHOUT-153
> URL: https://issues.apache.org/jira/browse/MAHOUT-153
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.2
> Environment: OS Independent
>Reporter: Panagiotis Papadimitriou
>Assignee: Ted Dunning
> Fix For: 0.4
>
> Attachments: Mahout-153.patch, MAHOUT-153_RandomFarthest.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The current implementation of k-means includes the following algorithms for 
> initial cluster selection (seed selection): 1) random selection of k points, 
> 2) use of canopy clusters.
> I plan to implement k-means++. The details of the algorithm are available 
> here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
> Design Outline: I will create an abstract class SeedGenerator and a subclass 
> KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
> become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-03-03 Thread Rohini Uppuluri (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Uppuluri updated MAHOUT-153:
---

Attachment: MAHOUT-153_RandomFarthest.patch

Hi, 

This is the patch for creating random farthest cluster initialization as I have 
discussed before. Kindly note that this does not have junit test cases yet. 


Thanks,
-Rohini

> Implement kmeans++ for initial cluster selection in kmeans
> --
>
> Key: MAHOUT-153
> URL: https://issues.apache.org/jira/browse/MAHOUT-153
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.2
> Environment: OS Independent
>Reporter: Panagiotis Papadimitriou
>Assignee: Ted Dunning
> Fix For: 0.4
>
> Attachments: Mahout-153.patch, MAHOUT-153_RandomFarthest.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The current implementation of k-means includes the following algorithms for 
> initial cluster selection (seed selection): 1) random selection of k points, 
> 2) use of canopy clusters.
> I plan to implement k-means++. The details of the algorithm are available 
> here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
> Design Outline: I will create an abstract class SeedGenerator and a subclass 
> KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
> become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-03-03 Thread Rohini Uppuluri (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Uppuluri updated MAHOUT-153:
---

Attachment: (was: MAHOUT-153_RandomFarthest.patch)

> Implement kmeans++ for initial cluster selection in kmeans
> --
>
> Key: MAHOUT-153
> URL: https://issues.apache.org/jira/browse/MAHOUT-153
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.2
> Environment: OS Independent
>Reporter: Panagiotis Papadimitriou
>Assignee: Ted Dunning
> Fix For: 0.4
>
> Attachments: Mahout-153.patch, MAHOUT-153_RandomFarthest.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The current implementation of k-means includes the following algorithms for 
> initial cluster selection (seed selection): 1) random selection of k points, 
> 2) use of canopy clusters.
> I plan to implement k-means++. The details of the algorithm are available 
> here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
> Design Outline: I will create an abstract class SeedGenerator and a subclass 
> KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
> become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-02-10 Thread Rohini Uppuluri (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832380#action_12832380
 ] 

Rohini Uppuluri commented on MAHOUT-153:


Hi all,


I have implemented an extension to the algorithm Pallavi had mentioned. 
The extension is to add some randomness in picking the farthest point. With this
there is a chance of over coming the problem of picking some noise points as 
centroids which are very far away.

Summary:

 
1. Pick the first centroid randomly
2. for the rest of the centroids
do
-> compute a few candidate centroids which are far off
Candidate centroid computation:
Divide the data into few parts. 
For each part compute the point which is farthest from  
the current list of centroids

-> Select one of the candidate centroids randomly
done

I will soon submit a patch on this. Please let me know your feeback.


> Implement kmeans++ for initial cluster selection in kmeans
> --
>
> Key: MAHOUT-153
> URL: https://issues.apache.org/jira/browse/MAHOUT-153
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.2
> Environment: OS Independent
>Reporter: Panagiotis Papadimitriou
>Assignee: Ted Dunning
> Fix For: 0.4
>
> Attachments: Mahout-153.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The current implementation of k-means includes the following algorithms for 
> initial cluster selection (seed selection): 1) random selection of k points, 
> 2) use of canopy clusters.
> I plan to implement k-means++. The details of the algorithm are available 
> here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
> Design Outline: I will create an abstract class SeedGenerator and a subclass 
> KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
> become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-99) Improving speed of KMeans

2008-12-10 Thread Rohini Uppuluri (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Uppuluri updated MAHOUT-99:
--

Attachment: MAHOUT-99-1.patch

Hi Grant,

I have set them as optional arguments. I set those parameters to some 
reasonable defaults incase they are not given as input. I will be uploading the 
updated patch reflecting the change.

It is a config thing already set up in hadoop but it gives us flexibility to 
change incase we want to increase the map tasks.




Thanks,
-Rohini


> Improving speed of KMeans
> -
>
> Key: MAHOUT-99
> URL: https://issues.apache.org/jira/browse/MAHOUT-99
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Reporter: Pallavi Palleti
>Assignee: Grant Ingersoll
> Attachments: MAHOUT-99-1.patch, MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to 
> reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and 
> the code is modified accordingly so that it won't create a bug when combiner 
> runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-79) Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks

2008-12-08 Thread Rohini Uppuluri (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654375#action_12654375
 ] 

Rohini Uppuluri commented on MAHOUT-79:
---

Hi Grant, 

Regarding, FuzzyKmeansUtil, there doesn't seem to be any change. 

> Improving the speed of Fuzzy K-Means by optimizing data transfer between map 
> and reduce tasks
> -
>
> Key: MAHOUT-79
> URL: https://issues.apache.org/jira/browse/MAHOUT-79
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Reporter: Pallavi Palleti
>Assignee: Grant Ingersoll
> Fix For: 0.1
>
> Attachments: FUZZY-79.patch, FUZZY-79.patch, FUZZY-79.patch, 
> FUZZY-79.patch, FUZZY-79.patch, FUZZY.patch
>
>
> Improve the speed of fuzzy k-Means by passing only the cluster-id info as key 
> output of mapper task and reading the cluster information in reducer task 
> where this info is needed. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-79) Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks

2008-12-08 Thread Rohini Uppuluri (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654374#action_12654374
 ] 

Rohini Uppuluri commented on MAHOUT-79:
---

Hi Grant, 

I am Rohini and work in the same team as Pallavi is. Pallavi is out of Office 
till the end of this month. I will be taking care of this issue now. 

I am not quite sure as to why the TestKMeansClusterer tests are all commented 
out. However, I tried uncommenting them and running unit tests and it seems to 
run fine. I suppose that solves the issue? Please correct me if I am wrong.





> Improving the speed of Fuzzy K-Means by optimizing data transfer between map 
> and reduce tasks
> -
>
> Key: MAHOUT-79
> URL: https://issues.apache.org/jira/browse/MAHOUT-79
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Reporter: Pallavi Palleti
>Assignee: Grant Ingersoll
> Fix For: 0.1
>
> Attachments: FUZZY-79.patch, FUZZY-79.patch, FUZZY-79.patch, 
> FUZZY-79.patch, FUZZY-79.patch, FUZZY.patch
>
>
> Improve the speed of fuzzy k-Means by passing only the cluster-id info as key 
> output of mapper task and reading the cluster information in reducer task 
> where this info is needed. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.