[jira] Updated: (MAHOUT-160) ClusterDumper utility to output all the clusters in all sequence files and points

Shashikant Kore (JIRA) Thu, 06 Aug 2009 06:32:42 -0700

     [ 
https://issues.apache.org/jira/browse/MAHOUT-160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shashikant Kore updated MAHOUT-160:
-----------------------------------

    Attachment: mahout-160-dict.patch

This patch accepts the term dictionary (created while creating document 
vectors) and along with the centroid vector and points of the cluster, it also 
prints top 10 features of the vector. 

Couple of improvements can be done to this patch (a) delimiter is hard coded to 
tab. That could come as user input (b) Currently top 10 terms of the vector are 
printed. It could be a configurable number. 

I will update the patch again, if this feature is deemed useful. 


> ClusterDumper utility to output all the clusters in all sequence files and 
> points
> ---------------------------------------------------------------------------------
>
>                 Key: MAHOUT-160
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-160
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Shashikant Kore
>         Attachments: mahout-160-dict.patch, mahout-160.patch
>
>
> The current ClusterDumper utility takes a sequence file and points file as 
> input and prints the cluster vector along with the points that belong to the 
> clusters in the sequence file. This utility doesn't produce correct results 
> in case there are multiple sequence files and points. 
> To avoid this problem, all the point to cluster mappings need to be read 
> first and then iterate on the sequence files.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-160) ClusterDumper utility to output all the clusters in all sequence files and points

Reply via email to