[jira] Commented: (MAHOUT-236) Cluster Evaluation Tools

Jeff Eastman (JIRA) Tue, 20 Apr 2010 12:11:16 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859027#action_12859027
 ]


Jeff Eastman commented on MAHOUT-236:
-------------------------------------

I'm running into a challenge integrating Fuzzy KMeans (and Dirichlet) into this 
evaluator. Currently the clustering step of the fuzzyK emits the vector as key 
and a FuzzyKMeansOutput writable as the value of the sequence file. This is 
backwards from the [clusterId :: VectorWritable] encoding that the patch uses 
for Canopy and KMeans. Also the Fuzzy...Output bean contains all of the 
clusters and the probability the vector is a member of each; rather large to be 
a key. 

For CDbw to find the reference points it really needs to iterate over 
[clusterId :: VectorWritable] pairs and this begs the question of what to do 
with fuzzy membership. I don't know if CDbw can be adjusted to handle fuzzyness 
in general but it will probably will work with some points assigned to more 
than one cluster. Does it make sense to apply a settable threshold to the 
clustering step so that all points with cluster membership probability > 
threshold would be assigned to that cluster?

This would work also for Dirichlet. To implement in fuzzyK I would need to 
change the FuzzyKMeansClusterer and FuzzyKMeansClusterMapper to match the other 
clustering jobs.

Does this make sense?

> Cluster Evaluation Tools
> ------------------------
>
>                 Key: MAHOUT-236
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-236
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Grant Ingersoll
>         Attachments: MAHOUT-236.patch
>
>
> Per 
> http://www.lucidimagination.com/search/document/10b562f10288993c/validating_clustering_output#9d3f6a55f4a91cb6,
>  it would be great to have some utilities to help evaluate the effectiveness 
> of clustering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-236) Cluster Evaluation Tools

Reply via email to