[ 
https://issues.apache.org/jira/browse/MAHOUT-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050302#comment-13050302
 ] 

Vasil Vasilev commented on MAHOUT-479:
--------------------------------------

Hi Jeff,

I am running Dirichlet clustering over the Reuters data set. Due to issues with 
the resulting clusters I found that there is a problem with the function 
calculating the pdf in 
org.apache.mahout.clustering.dirichlet.models.GaussianCluster and with the 
change you made recently:
1. Part of the clusters have very small radius << 0. This leads to 
UncommonDistributions.dNorm returning 0.0 in case the point is at a bigger 
distance from the mean
2. dNorm returns probability density, not probability, which means that for the 
cases where radius << 0 and the number of dimensions of the feature vectors is 
very big (~50000) the pdf goes quickly to infinity.
3. In case 1 and 2 happen the result for the pdf is NaN

> Streamline classification/ clustering data structures
> -----------------------------------------------------
>
>                 Key: MAHOUT-479
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-479
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.1, 0.2, 0.3, 0.4
>            Reporter: Isabel Drost
>            Assignee: Isabel Drost
>
> Opening this JIRA issue to collect ideas on how to streamline our 
> classification and clustering algorithms to make integration for users easier 
> as per mailing list thread http://markmail.org/message/pnzvrqpv5226twfs
> {quote}
> Jake and Robin and I were talking the other evening and a common lament was 
> that our classification (and clustering) stuff was all over the map in terms 
> of data structures.  Driving that to rest and getting those comments even 
> vaguely as plug and play as our much more advanced recommendation components 
> would be very, very helpful.
> {quote}
> This issue probably also realates to MAHOUT-287 (intention there is to make 
> naive bayes run on vectors as input).
> Ted, Jake, Robin: Would be great if someone of you could add a comment on 
> some of the issues you discussed "the other evening" and (if applicable) any 
> minor or major changes you think could help solve this issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to