[ 
https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176227#comment-13176227
 ] 

Paritosh Ranjan commented on MAHOUT-931:
----------------------------------------

I am a bit confused.

Are we planning to get rid of the way clustering is being done currently, which 
is algorithms specific? i.e. the code in CanopyClusterer.
Will the new clustering strategy be "only" what is implemented in 
ClusterClassifier? i.e. Calculating probabilities of vectors belonging to 
different models (clusters) and choose the model with highest probability?

If yes, then Implementing Clustering policy for different clustering algorithms 
is all that is needed. And for outlier removal, just a threshold probability 
will be needed. All vectors below that probability won't be clustered. Am I 
correct?

Till now, I have been thinking that the clustering code just needs to be 
refactored out ( without changing the implementation ). If this is the case, 
then, I think, I have been proceeding in the correct direction ( in terms of 
design ). 

However, I am doubting that we are not in sync regarding the way of 
implementation. I think you want to change the clustering implementation to a 
cluster classification implementation, with outlier removal ( and completely 
get rid of the algorithm specific implementation, which makes sense ). 

So, it would be really helpful if you can clarify my doubts.



                
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: MAHOUT-931
>
>
> A pluggable outlier removal capability while classifying the clusters is 
> needed. The classification and outlier removal implementations, both should 
> be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to