[jira] [Updated] (MAHOUT-930) Refactor Vector Classifaction out of Clustering - Make Classification abstract

Paritosh Ranjan (Updated) (JIRA) Mon, 26 Dec 2011 09:56:53 -0800

     [ 
https://issues.apache.org/jira/browse/MAHOUT-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paritosh Ranjan updated MAHOUT-930:
-----------------------------------

    Description: 
Right now, each clustering algorithm has its own runClustering ( -cp ) 
implementation which produces clusteredPoints. The current design lacks :
1) Extensibility - No place to plugin new features like outlier removal while 
classification
2) Uniformity in design - as new algorithms don't have a pattern to follow.
3) Abstraction - the clusterData should only bother about classifying vectors 
i.e. assigning different vectors to clusters. Currently it lacks a bit of 
abstraction. It should not care about how to classify. That should be the work 
of a separate entity, which can have features like outlier removal.

The new implementation factor out & implement an independent entity to perform 
the classification step independently of the various clustering 
implementations. The new design would start with ClusterClassifier, 
ClusteringPolicy and ClusterIterator whose experimental versions are available 
and committed. The currently committed version seems to work for all the 
iterative clustering algorithms.

The ClusterClassifier provides probability of any vector belonging to the 
different clusters available. These probabilities are converted into weights by 
different ClusteringPolicy implementations, which are for respective clustering 
algorithms. This is the place where the outlier removal implementation can be 
plugged in. In future, different implementations of ClusteringPolicy can be 
provided (configured) for different type of classification.

The ClusteringPolicy can be initialized with the ClusterConfig objects. These 
ClusterConfig objects would hold the Clustering Algorithm parameters which will 
help in classifying the Clusters.

The ClusterClassifier also gives the capability to train the existing 
classifiers (clusters), by the input. This is the place where 
clustering/classification will converge.

The execution is done by a ClusterIterator for now, which runs a clustering 
policy on the input and tries to classify the vectors to different clusters. It 
can simultaneously train the classifiers, as it can run for given number of 
iterations and each iteration would improve the quality of the classifiers.



  was:
Right now, each clustering algorithm has its own runClustering ( -cp ) 
implementation which produces clusteredPoints. The current design lacks :
1) Extensibility - No place to plugin new features like outlier removal while 
classification
2) Uniformity in design - as new algorithms don't have a pattern to follow.
3) Abstraction - the clusterData should only bother about classifying vectors 
i.e. assigning different vectors to clusters. Currently it lacks a bit of 
abstraction. It should not care about how to classify. That should be the work 
of a separate entity, which can have features like outlier removal.

The new implementation factor out & implement an independent entity to perform 
the classification step independently of the various clustering 
implementations. The new design would start with ClusterClassifier, 
ClusteringPolicy and ClusterIterator whose experimental versions are available 
and committed. The currently committed version seems to work for all the 
iterative clustering algorithms.

The ClusterClassifier provides probability of any vector belonging to the 
different clusters available. These probabilities are converted into weights by 
different ClusteringPolicy implementations, which are for respective clustering 
algorithms. This is the place where the outlier removal implementation can be 
plugged in. In future, different implementations of ClusteringPolicy can be 
provided (configured) for different type of classification.

The ClusterClassifier also gives the capability to train the existing 
classifiers (clusters), by the input. This is the place where 
clustering/classification will converge.

The execution is done by a ClusterIterator for now, which runs a clustering 
policy on the input and tries to classify the vectors to different clusters. It 
can simultaneously train the classifiers, as it can run for given number of 
iterations and each iteration would improve the quality of the classifiers.



    
> Refactor Vector Classifaction out of Clustering - Make Classification abstract
> ------------------------------------------------------------------------------
>
>                 Key: MAHOUT-930
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-930
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Right now, each clustering algorithm has its own runClustering ( -cp ) 
> implementation which produces clusteredPoints. The current design lacks :
> 1) Extensibility - No place to plugin new features like outlier removal while 
> classification
> 2) Uniformity in design - as new algorithms don't have a pattern to follow.
> 3) Abstraction - the clusterData should only bother about classifying vectors 
> i.e. assigning different vectors to clusters. Currently it lacks a bit of 
> abstraction. It should not care about how to classify. That should be the 
> work of a separate entity, which can have features like outlier removal.
> The new implementation factor out & implement an independent entity to 
> perform the classification step independently of the various clustering 
> implementations. The new design would start with ClusterClassifier, 
> ClusteringPolicy and ClusterIterator whose experimental versions are 
> available and committed. The currently committed version seems to work for 
> all the iterative clustering algorithms.
> The ClusterClassifier provides probability of any vector belonging to the 
> different clusters available. These probabilities are converted into weights 
> by different ClusteringPolicy implementations, which are for respective 
> clustering algorithms. This is the place where the outlier removal 
> implementation can be plugged in. In future, different implementations of 
> ClusteringPolicy can be provided (configured) for different type of 
> classification.
> The ClusteringPolicy can be initialized with the ClusterConfig objects. These 
> ClusterConfig objects would hold the Clustering Algorithm parameters which 
> will help in classifying the Clusters.
> The ClusterClassifier also gives the capability to train the existing 
> classifiers (clusters), by the input. This is the place where 
> clustering/classification will converge.
> The execution is done by a ClusterIterator for now, which runs a clustering 
> policy on the input and tries to classify the vectors to different clusters. 
> It can simultaneously train the classifiers, as it can run for given number 
> of iterations and each iteration would improve the quality of the classifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-930) Refactor Vector Classifaction out of Clustering - Make Classification abstract

Reply via email to