Re: [jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

Shannon Quinn Thu, 02 May 2013 14:51:15 -0700

This sounds excellent. I'd be happy to assist in unifying the interfacesof the spectral methods in particular.


On 5/2/13 3:54 PM, Yu Lee (JIRA) wrote:

     [ 
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647841#comment-13647841
 ]


Yu Lee commented on MAHOUT-1177:
--------------------------------

Hello Robin Anil, Jeff Eastman, Dan Filimon, and Ted Dunning,

Yexi and I (Yu Lee) are new to this Mahout community. We want to contribute to 
the improvement of Mahout by reforming and simplifying the clustering APIs per 
the following link:
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644120#comment-13644120

We have gone through the code of Mahout clustering. Now we have some ideas 
about improving it:

=========================================================================================
Addressing the problems in the current interface:

Testing cases are missing. For example, in spectral kmeans clustering, the run 
methods of SpectralKmeansDriver and EigencutsDriver are not tested

Documentations are missing for some methods. For example: in the run method of 
DirichletDriver, the description of parameter 'numModels' is missing; in the 
run method of SpectralKmeansDriver, the description of some arguments are 
missing

Some testing methods do not contain the specific description of some arguments. For example: in the 
run method of FuzzyKmeansDriver, the description of an argument of "m" (fuzzification 
factor) is missing. Although a wiki link regarding "Clustering Analysis" is given, it is 
not clear enough.

-----------------------------------------------------------------------------------------

Implementing some new clustering algorithms

Agglomerative hierarchical clustering, which will cluster the data points into 
a dendragram, so that user could indicate whatever number of clusters as they 
want. (http://en.wikipedia.org/wiki/Hierarchical_clustering)

Dbscan, which is a density based clustering method being able to identify 
clusters with arbitrary shapes, and is useful in spatial clustering. 
(http://en.wikipedia.org/wiki/DBSCAN)

-----------------------------------------------------------------------------------------

Providing a new unified interface

Currently, each clustering algorithm has its own implemented class with 
different interfaces (i.e., run methods in different Drivers have different 
argument list). However, it is better to have a unified interface to execute 
all available clustering methods, and an example interface is as follows:

Clustering-run(input, output, methodClass,clusteringConfig)

Here, the "methodClass" indicates a specific clustering method, while 
"clusteringConfig" indicates the configuration for this specific clustering method.

=========================================================================================

Could you please let us know what you think about our ideas?

GSOC 2013: Reform and simplify the clustering APIs
--------------------------------------------------

                 Key: MAHOUT-1177
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1177
             Project: Mahout
          Issue Type: Improvement
            Reporter: Dan Filimon
              Labels: gsoc2013, mentor

Clustering is one of the most used features in Mahout and has many applications 
[http://en.wikipedia.org/wiki/Cluster_analysis#Applications].
We have of lots clustering algorithms. There's:
- basic k-means
- canopy clustering
- Dirichlet clustering
- Fuzzy k-means
- Spectral k-means
- Streaming k-means [coming soon]
We want to make them easier to use by updating the APIs and make sure they all 
work in the same way have consistent inputs, outputs, diagnostics and 
documentation.
This is a great way to gain an in-depth understanding of clustering algorithms, 
familiarize yourself with Hadoop, Mahout clustering and good software 
engineering principles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

Reply via email to