[
https://issues.apache.org/jira/browse/MAHOUT-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paritosh Ranjan updated MAHOUT-930:
-----------------------------------
Description:
Right now, each clustering algorithm has its own runClustering ( -cp )
implementation which produces clusteredPoints. The current design lacks :
1) Extensibility - No place to plugin new features like outlier removal while
classification
2) Uniformity in design - as new algorithms don't have a pattern to follow.
3) Abstraction - the clusterData should only bother about classifying vectors
i.e. assigning different vectors to clusters. Currently it lacks a bit of
abstraction. It should not care about how to classify. That should be the work
of a separate entity, which can have features like outlier removal.
The new implementation factor out & implement an independent entity to perform
the classification step independently of the various clustering
implementations. The new design would start with ClusterClassifier,
ClusteringPolicy and ClusterIterator whose experimental versions are available
and committed. The currently committed version seems to work for all the
iterative clustering algorithms.
The ClusterClassifier provides probability of any vector belonging to the
different clusters available. These probabilities are converted into weights by
different ClusteringPolicy implementations, which are for respective clustering
algorithms. This is the place where the outlier removal implementation can be
plugged in. In future, different implementations of ClusteringPolicy can be
provided (configured) for different type of classification.
The ClusteringPolicy can be initialized with the ClusterConfig objects. These
ClusterConfig objects would hold the Clustering Algorithm parameters which will
help in classifying the Clusters.
The ClusterClassifier also gives the capability to train the existing
classifiers (clusters), by the input. This is the place where
clustering/classification will converge.
The execution is done by a ClusterIterator for now, which runs a clustering
policy on the input and tries to classify the vectors to different clusters. It
can simultaneously train the classifiers, as it can run for given number of
iterations and each iteration would improve the quality of the classifiers.
was:
Right now, each clustering algorithm has its own runClustering ( -cp )
implementation which produces clusteredPoints. The current design lacks :
1) Extensibility - No place to plugin new features like outlier removal while
classification
2) Uniformity in design - as new algorithms don't have a pattern to follow.
3) Abstraction - the clusterData should only bother about classifying vectors
i.e. assigning different vectors to clusters. Currently it lacks a bit of
abstraction. It should not care about how to classify. That should be the work
of a separate entity, which can have features like outlier removal.
The new implementation factor out & implement an independent entity to perform
the classification step independently of the various clustering
implementations. The new design would start with ClusterClassifier,
ClusteringPolicy and ClusterIterator whose experimental versions are available
and committed. The currently committed version seems to work for all the
iterative clustering algorithms.
The ClusterClassifier provides probability of any vector belonging to the
different clusters available. These probabilities are converted into weights by
different ClusteringPolicy implementations, which are for respective clustering
algorithms. This is the place where the outlier removal implementation can be
plugged in. In future, different implementations of ClusteringPolicy can be
provided (configured) for different type of classification.
The ClusterClassifier also gives the capability to train the existing
classifiers (clusters), by the input. This is the place where
clustering/classification will converge.
The execution is done by a ClusterIterator for now, which runs a clustering
policy on the input and tries to classify the vectors to different clusters. It
can simultaneously train the classifiers, as it can run for given number of
iterations and each iteration would improve the quality of the classifiers.
> Refactor Vector Classifaction out of Clustering - Make Classification abstract
> ------------------------------------------------------------------------------
>
> Key: MAHOUT-930
> URL: https://issues.apache.org/jira/browse/MAHOUT-930
> Project: Mahout
> Issue Type: Improvement
> Components: Classification, Clustering
> Affects Versions: 0.6
> Reporter: Paritosh Ranjan
> Fix For: 0.7
>
>
> Right now, each clustering algorithm has its own runClustering ( -cp )
> implementation which produces clusteredPoints. The current design lacks :
> 1) Extensibility - No place to plugin new features like outlier removal while
> classification
> 2) Uniformity in design - as new algorithms don't have a pattern to follow.
> 3) Abstraction - the clusterData should only bother about classifying vectors
> i.e. assigning different vectors to clusters. Currently it lacks a bit of
> abstraction. It should not care about how to classify. That should be the
> work of a separate entity, which can have features like outlier removal.
> The new implementation factor out & implement an independent entity to
> perform the classification step independently of the various clustering
> implementations. The new design would start with ClusterClassifier,
> ClusteringPolicy and ClusterIterator whose experimental versions are
> available and committed. The currently committed version seems to work for
> all the iterative clustering algorithms.
> The ClusterClassifier provides probability of any vector belonging to the
> different clusters available. These probabilities are converted into weights
> by different ClusteringPolicy implementations, which are for respective
> clustering algorithms. This is the place where the outlier removal
> implementation can be plugged in. In future, different implementations of
> ClusteringPolicy can be provided (configured) for different type of
> classification.
> The ClusteringPolicy can be initialized with the ClusterConfig objects. These
> ClusterConfig objects would hold the Clustering Algorithm parameters which
> will help in classifying the Clusters.
> The ClusterClassifier also gives the capability to train the existing
> classifiers (clusters), by the input. This is the place where
> clustering/classification will converge.
> The execution is done by a ClusterIterator for now, which runs a clustering
> policy on the input and tries to classify the vectors to different clusters.
> It can simultaneously train the classifiers, as it can run for given number
> of iterations and each iteration would improve the quality of the classifiers.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira