Re: [jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

姜页希 Fri, 03 May 2013 05:11:26 -0700

Is there other comments about this issue?



2013/5/2 Shannon Quinn <[email protected]>

> This sounds excellent. I'd be happy to assist in unifying the interfaces
> of the spectral methods in particular.
>
>
> On 5/2/13 3:54 PM, Yu Lee (JIRA) wrote:
>
>>      [ https://issues.apache.org/**jira/browse/MAHOUT-1177?page=**
>> com.atlassian.jira.plugin.**system.issuetabpanels:comment-**
>> tabpanel&focusedCommentId=**13647841#comment-13647841<https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647841#comment-13647841>]
>>
>> Yu Lee commented on MAHOUT-1177:
>> ------------------------------**--
>>
>> Hello Robin Anil, Jeff Eastman, Dan Filimon, and Ted Dunning,
>>
>> Yexi and I (Yu Lee) are new to this Mahout community. We want to
>> contribute to the improvement of Mahout by reforming and simplifying the
>> clustering APIs per the following link:
>> https://issues.apache.org/**jira/browse/MAHOUT-1177?page=**
>> com.atlassian.jira.plugin.**system.issuetabpanels:comment-**
>> tabpanel&focusedCommentId=**13644120#comment-13644120<https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644120#comment-13644120>
>>
>> We have gone through the code of Mahout clustering. Now we have some
>> ideas about improving it:
>>
>> ==============================**==============================**
>> =============================
>> Addressing the problems in the current interface:
>>
>> Testing cases are missing. For example, in spectral kmeans clustering,
>> the run methods of SpectralKmeansDriver and EigencutsDriver are not tested
>>
>> Documentations are missing for some methods. For example: in the run
>> method of DirichletDriver, the description of parameter 'numModels' is
>> missing; in the run method of SpectralKmeansDriver, the description of some
>> arguments are missing
>>
>> Some testing methods do not contain the specific description of some
>> arguments. For example: in the run method of FuzzyKmeansDriver, the
>> description of an argument of "m" (fuzzification factor) is missing.
>> Although a wiki link regarding "Clustering Analysis" is given, it is not
>> clear enough.
>>
>> ------------------------------**------------------------------**
>> -----------------------------
>>
>> Implementing some new clustering algorithms
>>
>> Agglomerative hierarchical clustering, which will cluster the data points
>> into a dendragram, so that user could indicate whatever number of clusters
>> as they want. 
>> (http://en.wikipedia.org/wiki/**Hierarchical_clustering<http://en.wikipedia.org/wiki/Hierarchical_clustering>
>> )
>>
>> Dbscan, which is a density based clustering method being able to identify
>> clusters with arbitrary shapes, and is useful in spatial clustering. (
>> http://en.wikipedia.org/wiki/**DBSCAN<http://en.wikipedia.org/wiki/DBSCAN>
>> )
>>
>> ------------------------------**------------------------------**
>> -----------------------------
>>
>> Providing a new unified interface
>>
>> Currently, each clustering algorithm has its own implemented class with
>> different interfaces (i.e., run methods in different Drivers have different
>> argument list). However, it is better to have a unified interface to
>> execute all available clustering methods, and an example interface is as
>> follows:
>>
>> Clustering-run(input, output, methodClass,clusteringConfig)
>>
>> Here, the "methodClass" indicates a specific clustering method, while
>> "clusteringConfig" indicates the configuration for this specific clustering
>> method.
>>
>> ==============================**==============================**
>> =============================
>>
>> Could you please let us know what you think about our ideas?
>>
>>
>>
>>
>>> GSOC 2013: Reform and simplify the clustering APIs
>>> ------------------------------**--------------------
>>>
>>>                  Key: MAHOUT-1177
>>>                  URL: https://issues.apache.org/**
>>> jira/browse/MAHOUT-1177<https://issues.apache.org/jira/browse/MAHOUT-1177>
>>>              Project: Mahout
>>>           Issue Type: Improvement
>>>             Reporter: Dan Filimon
>>>               Labels: gsoc2013, mentor
>>>
>>> Clustering is one of the most used features in Mahout and has many
>>> applications [http://en.wikipedia.org/wiki/**
>>> Cluster_analysis#Applications<http://en.wikipedia.org/wiki/Cluster_analysis#Applications>
>>> ]**.
>>> We have of lots clustering algorithms. There's:
>>> - basic k-means
>>> - canopy clustering
>>> - Dirichlet clustering
>>> - Fuzzy k-means
>>> - Spectral k-means
>>> - Streaming k-means [coming soon]
>>> We want to make them easier to use by updating the APIs and make sure
>>> they all work in the same way have consistent inputs, outputs, diagnostics
>>> and documentation.
>>> This is a great way to gain an in-depth understanding of clustering
>>> algorithms, familiarize yourself with Hadoop, Mahout clustering and good
>>> software engineering principles.
>>>
>> --
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA
>> administrators
>> For more information on JIRA, see: http://www.atlassian.com/**
>> software/jira <http://www.atlassian.com/software/jira>
>>
>
>


-- 
------
Yexi Jiang,
ECS 251,  [email protected]
School of Computer and Information Science,
Florida International University
Homepage: http://users.cis.fiu.edu/~yjian004/

Re: [jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

Reply via email to