[jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

2014-02-01 Thread Nilesh Chakraborty (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13888789#comment-13888789
 ] 

Nilesh Chakraborty commented on MAHOUT-1177:


This is a pretty important issue and I think it'd be awesome to implement, 
quite keen on it myself. Was any code written for this? Any progress made?

> GSOC 2013: Reform and simplify the clustering APIs
> --
>
> Key: MAHOUT-1177
> URL: https://issues.apache.org/jira/browse/MAHOUT-1177
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
> Fix For: Backlog
>
>
> Clustering is one of the most used features in Mahout and has many 
> applications [http://en.wikipedia.org/wiki/Cluster_analysis#Applications].
> We have of lots clustering algorithms. There's:
> - basic k-means
> - canopy clustering
> - Dirichlet clustering
> - Fuzzy k-means
> - Spectral k-means
> - Streaming k-means [coming soon]
> We want to make them easier to use by updating the APIs and make sure they 
> all work in the same way have consistent inputs, outputs, diagnostics and 
> documentation.
> This is a great way to gain an in-depth understanding of clustering 
> algorithms, familiarize yourself with Hadoop, Mahout clustering and good 
> software engineering principles.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: [jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

2013-05-23 Thread 姜页希
Shannon,

Do you mean that we need to give a specific plan right now? Or wait until
you finish your work?


2013/5/23 Shannon Quinn (JIRA) 

>
> [
> https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665203#comment-13665203]
>
> Shannon Quinn commented on MAHOUT-1177:
> ---
>
> Yu Lee and Yexi: For the time being, I'd be on board with shelving the
> addition of any new clustering algorithms, and instead focusing on
> improving documentation and unifying the APIs for the existing ones. I
> think that would help scope your work a little more effectively, while
> still providing an extremely valuable body of work. Plus, it would greatly
> aid the development of new algorithms to have a specific interface to build
> into. Beyond that, I think your ideas are good and would encourage you to
> start laying out your specific plans.
>
> Ravi: I would suggest browsing the open JIRAs for Mahout and to submit a
> patch for one you think you can tackle. Please feel free to ping our email
> list if you have specific questions, though for general ones please submit
> them to the list rather than on JIRA.
>
>
> > GSOC 2013: Reform and simplify the clustering APIs
> > --
> >
> > Key: MAHOUT-1177
> > URL: https://issues.apache.org/jira/browse/MAHOUT-1177
> > Project: Mahout
> >  Issue Type: Improvement
> >Reporter: Dan Filimon
> >  Labels: gsoc2013, mentor
> >
> > Clustering is one of the most used features in Mahout and has many
> applications [http://en.wikipedia.org/wiki/Cluster_analysis#Applications].
> > We have of lots clustering algorithms. There's:
> > - basic k-means
> > - canopy clustering
> > - Dirichlet clustering
> > - Fuzzy k-means
> > - Spectral k-means
> > - Streaming k-means [coming soon]
> > We want to make them easier to use by updating the APIs and make sure
> they all work in the same way have consistent inputs, outputs, diagnostics
> and documentation.
> > This is a great way to gain an in-depth understanding of clustering
> algorithms, familiarize yourself with Hadoop, Mahout clustering and good
> software engineering principles.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>



-- 
--
Yexi Jiang,
ECS 251,  yjian...@cs.fiu.edu
School of Computer and Information Science,
Florida International University
Homepage: http://users.cis.fiu.edu/~yjian004/


[jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

2013-05-23 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665203#comment-13665203
 ] 

Shannon Quinn commented on MAHOUT-1177:
---

Yu Lee and Yexi: For the time being, I'd be on board with shelving the addition 
of any new clustering algorithms, and instead focusing on improving 
documentation and unifying the APIs for the existing ones. I think that would 
help scope your work a little more effectively, while still providing an 
extremely valuable body of work. Plus, it would greatly aid the development of 
new algorithms to have a specific interface to build into. Beyond that, I think 
your ideas are good and would encourage you to start laying out your specific 
plans.

Ravi: I would suggest browsing the open JIRAs for Mahout and to submit a patch 
for one you think you can tackle. Please feel free to ping our email list if 
you have specific questions, though for general ones please submit them to the 
list rather than on JIRA.


> GSOC 2013: Reform and simplify the clustering APIs
> --
>
> Key: MAHOUT-1177
> URL: https://issues.apache.org/jira/browse/MAHOUT-1177
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
>
> Clustering is one of the most used features in Mahout and has many 
> applications [http://en.wikipedia.org/wiki/Cluster_analysis#Applications].
> We have of lots clustering algorithms. There's:
> - basic k-means
> - canopy clustering
> - Dirichlet clustering
> - Fuzzy k-means
> - Spectral k-means
> - Streaming k-means [coming soon]
> We want to make them easier to use by updating the APIs and make sure they 
> all work in the same way have consistent inputs, outputs, diagnostics and 
> documentation.
> This is a great way to gain an in-depth understanding of clustering 
> algorithms, familiarize yourself with Hadoop, Mahout clustering and good 
> software engineering principles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

2013-05-22 Thread Ravi Mummulla (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664874#comment-13664874
 ] 

Ravi Mummulla commented on MAHOUT-1177:
---

Hi Folks,
I am new to this project, but have experience with Hadoop. I work in the 
Seattle are in the Big Data space and I am also working on my second graduate 
degree (in Math and Stat.) My intent is not GSoC participation, I just want to 
contribute to the Mahout project. Please let me know how I can help.

Thanks.

> GSOC 2013: Reform and simplify the clustering APIs
> --
>
> Key: MAHOUT-1177
> URL: https://issues.apache.org/jira/browse/MAHOUT-1177
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
>
> Clustering is one of the most used features in Mahout and has many 
> applications [http://en.wikipedia.org/wiki/Cluster_analysis#Applications].
> We have of lots clustering algorithms. There's:
> - basic k-means
> - canopy clustering
> - Dirichlet clustering
> - Fuzzy k-means
> - Spectral k-means
> - Streaming k-means [coming soon]
> We want to make them easier to use by updating the APIs and make sure they 
> all work in the same way have consistent inputs, outputs, diagnostics and 
> documentation.
> This is a great way to gain an in-depth understanding of clustering 
> algorithms, familiarize yourself with Hadoop, Mahout clustering and good 
> software engineering principles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: [jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

2013-05-06 Thread 姜页希
Hi, All,

We have been waiting for the comments for a couple of days. We have no idea
how to move on to the next step. Can anyone advice?

If this idea is not good enough, what else can we do to contribute to this
community?

Regards,
Yexi


2013/5/3 yu lee 

> Co-ask.
>
> Shannon: we'd be happy if you are going to help us!
>
> Ted: what do you think about our (Yexi's and my) ideas? Shall we move on to
> the proposal?
>
>
> On Fri, May 3, 2013 at 8:10 AM, 姜页希  wrote:
>
> > Is there other comments about this issue?
> >
> >
> >
> > 2013/5/2 Shannon Quinn 
> >
> > > This sounds excellent. I'd be happy to assist in unifying the
> interfaces
> > > of the spectral methods in particular.
> > >
> > >
> > > On 5/2/13 3:54 PM, Yu Lee (JIRA) wrote:
> > >
> > >>  [ https://issues.apache.org/**jira/browse/MAHOUT-1177?page=**
> > >> com.atlassian.jira.plugin.**system.issuetabpanels:comment-**
> > >> tabpanel&focusedCommentId=**13647841#comment-13647841<
> >
> https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647841#comment-13647841
> > >]
> > >>
> > >> Yu Lee commented on MAHOUT-1177:
> > >> --**--
> > >>
> > >> Hello Robin Anil, Jeff Eastman, Dan Filimon, and Ted Dunning,
> > >>
> > >> Yexi and I (Yu Lee) are new to this Mahout community. We want to
> > >> contribute to the improvement of Mahout by reforming and simplifying
> the
> > >> clustering APIs per the following link:
> > >> https://issues.apache.org/**jira/browse/MAHOUT-1177?page=**
> > >> com.atlassian.jira.plugin.**system.issuetabpanels:comment-**
> > >> tabpanel&focusedCommentId=**13644120#comment-13644120<
> >
> https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644120#comment-13644120
> > >
> > >>
> > >> We have gone through the code of Mahout clustering. Now we have some
> > >> ideas about improving it:
> > >>
> > >> ==**==**
> > >> =
> > >> Addressing the problems in the current interface:
> > >>
> > >> Testing cases are missing. For example, in spectral kmeans clustering,
> > >> the run methods of SpectralKmeansDriver and EigencutsDriver are not
> > tested
> > >>
> > >> Documentations are missing for some methods. For example: in the run
> > >> method of DirichletDriver, the description of parameter 'numModels' is
> > >> missing; in the run method of SpectralKmeansDriver, the description of
> > some
> > >> arguments are missing
> > >>
> > >> Some testing methods do not contain the specific description of some
> > >> arguments. For example: in the run method of FuzzyKmeansDriver, the
> > >> description of an argument of "m" (fuzzification factor) is missing.
> > >> Although a wiki link regarding "Clustering Analysis" is given, it is
> not
> > >> clear enough.
> > >>
> > >> --**--**
> > >> -
> > >>
> > >> Implementing some new clustering algorithms
> > >>
> > >> Agglomerative hierarchical clustering, which will cluster the data
> > points
> > >> into a dendragram, so that user could indicate whatever number of
> > clusters
> > >> as they want. (http://en.wikipedia.org/wiki/**Hierarchical_clustering
> <
> > http://en.wikipedia.org/wiki/Hierarchical_clustering>
> > >> )
> > >>
> > >> Dbscan, which is a density based clustering method being able to
> > identify
> > >> clusters with arbitrary shapes, and is useful in spatial clustering. (
> > >> http://en.wikipedia.org/wiki/**DBSCAN<
> > http://en.wikipedia.org/wiki/DBSCAN>
> > >> )
> > >>
> > >> --**--**
> > >> -
> > >>
> > >> Providing a new unified interface
> > >>
> > >> Currently, each clustering algorithm has its own implemented class
> with
> > >> different interfaces (i.e., run methods in different Drivers have
> > different
> > >> argument list). However, it is better to have a unified interface to
> > >> execute all available clustering methods, and an example interface is
> as
> > >> follows:
> > >>
> > >> Clustering-run(input, output, methodClass,clusteringConfig)
> > >>
> > >> Here, the "methodClass" indicates a specific clustering method, while
> > >> "clusteringConfig" indicates the configuration for this specific
> > clustering
> > >> method.
> > >>
> > >> ==**==**
> > >> =
> > >>
> > >> Could you please let us know what you think about our ideas?
> > >>
> > >>
> > >>
> > >>
> > >>> GSOC 2013: Reform and simplify the clustering APIs
> > >>> --**
> > >>>
> > >>>  Key: MAHOUT-1177
> > >>>  URL: https://issues.apache.org/**
> > >>> jira/browse/MAHOUT-1177<
> > https://issues.apache.

Re: [jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

2013-05-03 Thread yu lee
Co-ask.

Shannon: we'd be happy if you are going to help us!

Ted: what do you think about our (Yexi's and my) ideas? Shall we move on to
the proposal?


On Fri, May 3, 2013 at 8:10 AM, 姜页希  wrote:

> Is there other comments about this issue?
>
>
>
> 2013/5/2 Shannon Quinn 
>
> > This sounds excellent. I'd be happy to assist in unifying the interfaces
> > of the spectral methods in particular.
> >
> >
> > On 5/2/13 3:54 PM, Yu Lee (JIRA) wrote:
> >
> >>  [ https://issues.apache.org/**jira/browse/MAHOUT-1177?page=**
> >> com.atlassian.jira.plugin.**system.issuetabpanels:comment-**
> >> tabpanel&focusedCommentId=**13647841#comment-13647841<
> https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647841#comment-13647841
> >]
> >>
> >> Yu Lee commented on MAHOUT-1177:
> >> --**--
> >>
> >> Hello Robin Anil, Jeff Eastman, Dan Filimon, and Ted Dunning,
> >>
> >> Yexi and I (Yu Lee) are new to this Mahout community. We want to
> >> contribute to the improvement of Mahout by reforming and simplifying the
> >> clustering APIs per the following link:
> >> https://issues.apache.org/**jira/browse/MAHOUT-1177?page=**
> >> com.atlassian.jira.plugin.**system.issuetabpanels:comment-**
> >> tabpanel&focusedCommentId=**13644120#comment-13644120<
> https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644120#comment-13644120
> >
> >>
> >> We have gone through the code of Mahout clustering. Now we have some
> >> ideas about improving it:
> >>
> >> ==**==**
> >> =
> >> Addressing the problems in the current interface:
> >>
> >> Testing cases are missing. For example, in spectral kmeans clustering,
> >> the run methods of SpectralKmeansDriver and EigencutsDriver are not
> tested
> >>
> >> Documentations are missing for some methods. For example: in the run
> >> method of DirichletDriver, the description of parameter 'numModels' is
> >> missing; in the run method of SpectralKmeansDriver, the description of
> some
> >> arguments are missing
> >>
> >> Some testing methods do not contain the specific description of some
> >> arguments. For example: in the run method of FuzzyKmeansDriver, the
> >> description of an argument of "m" (fuzzification factor) is missing.
> >> Although a wiki link regarding "Clustering Analysis" is given, it is not
> >> clear enough.
> >>
> >> --**--**
> >> -
> >>
> >> Implementing some new clustering algorithms
> >>
> >> Agglomerative hierarchical clustering, which will cluster the data
> points
> >> into a dendragram, so that user could indicate whatever number of
> clusters
> >> as they want. (http://en.wikipedia.org/wiki/**Hierarchical_clustering<
> http://en.wikipedia.org/wiki/Hierarchical_clustering>
> >> )
> >>
> >> Dbscan, which is a density based clustering method being able to
> identify
> >> clusters with arbitrary shapes, and is useful in spatial clustering. (
> >> http://en.wikipedia.org/wiki/**DBSCAN<
> http://en.wikipedia.org/wiki/DBSCAN>
> >> )
> >>
> >> --**--**
> >> -
> >>
> >> Providing a new unified interface
> >>
> >> Currently, each clustering algorithm has its own implemented class with
> >> different interfaces (i.e., run methods in different Drivers have
> different
> >> argument list). However, it is better to have a unified interface to
> >> execute all available clustering methods, and an example interface is as
> >> follows:
> >>
> >> Clustering-run(input, output, methodClass,clusteringConfig)
> >>
> >> Here, the "methodClass" indicates a specific clustering method, while
> >> "clusteringConfig" indicates the configuration for this specific
> clustering
> >> method.
> >>
> >> ==**==**
> >> =
> >>
> >> Could you please let us know what you think about our ideas?
> >>
> >>
> >>
> >>
> >>> GSOC 2013: Reform and simplify the clustering APIs
> >>> --**
> >>>
> >>>  Key: MAHOUT-1177
> >>>  URL: https://issues.apache.org/**
> >>> jira/browse/MAHOUT-1177<
> https://issues.apache.org/jira/browse/MAHOUT-1177>
> >>>  Project: Mahout
> >>>   Issue Type: Improvement
> >>> Reporter: Dan Filimon
> >>>   Labels: gsoc2013, mentor
> >>>
> >>> Clustering is one of the most used features in Mahout and has many
> >>> applications [http://en.wikipedia.org/wiki/**
> >>> Cluster_analysis#Applications<
> http://en.wikipedia.org/wiki/Cluster_analysis#Applications>
> >>> ]**.
> >>> We have of lots clustering algorithms. There's:
> >>> - basic k-means
> >>

Re: [jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

2013-05-03 Thread 姜页希
Is there other comments about this issue?



2013/5/2 Shannon Quinn 

> This sounds excellent. I'd be happy to assist in unifying the interfaces
> of the spectral methods in particular.
>
>
> On 5/2/13 3:54 PM, Yu Lee (JIRA) wrote:
>
>>  [ https://issues.apache.org/**jira/browse/MAHOUT-1177?page=**
>> com.atlassian.jira.plugin.**system.issuetabpanels:comment-**
>> tabpanel&focusedCommentId=**13647841#comment-13647841]
>>
>> Yu Lee commented on MAHOUT-1177:
>> --**--
>>
>> Hello Robin Anil, Jeff Eastman, Dan Filimon, and Ted Dunning,
>>
>> Yexi and I (Yu Lee) are new to this Mahout community. We want to
>> contribute to the improvement of Mahout by reforming and simplifying the
>> clustering APIs per the following link:
>> https://issues.apache.org/**jira/browse/MAHOUT-1177?page=**
>> com.atlassian.jira.plugin.**system.issuetabpanels:comment-**
>> tabpanel&focusedCommentId=**13644120#comment-13644120
>>
>> We have gone through the code of Mahout clustering. Now we have some
>> ideas about improving it:
>>
>> ==**==**
>> =
>> Addressing the problems in the current interface:
>>
>> Testing cases are missing. For example, in spectral kmeans clustering,
>> the run methods of SpectralKmeansDriver and EigencutsDriver are not tested
>>
>> Documentations are missing for some methods. For example: in the run
>> method of DirichletDriver, the description of parameter 'numModels' is
>> missing; in the run method of SpectralKmeansDriver, the description of some
>> arguments are missing
>>
>> Some testing methods do not contain the specific description of some
>> arguments. For example: in the run method of FuzzyKmeansDriver, the
>> description of an argument of "m" (fuzzification factor) is missing.
>> Although a wiki link regarding "Clustering Analysis" is given, it is not
>> clear enough.
>>
>> --**--**
>> -
>>
>> Implementing some new clustering algorithms
>>
>> Agglomerative hierarchical clustering, which will cluster the data points
>> into a dendragram, so that user could indicate whatever number of clusters
>> as they want. 
>> (http://en.wikipedia.org/wiki/**Hierarchical_clustering
>> )
>>
>> Dbscan, which is a density based clustering method being able to identify
>> clusters with arbitrary shapes, and is useful in spatial clustering. (
>> http://en.wikipedia.org/wiki/**DBSCAN
>> )
>>
>> --**--**
>> -
>>
>> Providing a new unified interface
>>
>> Currently, each clustering algorithm has its own implemented class with
>> different interfaces (i.e., run methods in different Drivers have different
>> argument list). However, it is better to have a unified interface to
>> execute all available clustering methods, and an example interface is as
>> follows:
>>
>> Clustering-run(input, output, methodClass,clusteringConfig)
>>
>> Here, the "methodClass" indicates a specific clustering method, while
>> "clusteringConfig" indicates the configuration for this specific clustering
>> method.
>>
>> ==**==**
>> =
>>
>> Could you please let us know what you think about our ideas?
>>
>>
>>
>>
>>> GSOC 2013: Reform and simplify the clustering APIs
>>> --**
>>>
>>>  Key: MAHOUT-1177
>>>  URL: https://issues.apache.org/**
>>> jira/browse/MAHOUT-1177
>>>  Project: Mahout
>>>   Issue Type: Improvement
>>> Reporter: Dan Filimon
>>>   Labels: gsoc2013, mentor
>>>
>>> Clustering is one of the most used features in Mahout and has many
>>> applications [http://en.wikipedia.org/wiki/**
>>> Cluster_analysis#Applications
>>> ]**.
>>> We have of lots clustering algorithms. There's:
>>> - basic k-means
>>> - canopy clustering
>>> - Dirichlet clustering
>>> - Fuzzy k-means
>>> - Spectral k-means
>>> - Streaming k-means [coming soon]
>>> We want to make them easier to use by updating the APIs and make sure
>>> they all work in the same way have consistent inputs, outputs, diagnostics
>>> and documentation.
>>> This is a great way to gain an in-depth understanding of clustering
>>> algorithms, familiarize yourself with Hadoop, Mahout clusteri

Re: [jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

2013-05-02 Thread Shannon Quinn
This sounds excellent. I'd be happy to assist in unifying the interfaces 
of the spectral methods in particular.


On 5/2/13 3:54 PM, Yu Lee (JIRA) wrote:

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647841#comment-13647841
 ]

Yu Lee commented on MAHOUT-1177:


Hello Robin Anil, Jeff Eastman, Dan Filimon, and Ted Dunning,

Yexi and I (Yu Lee) are new to this Mahout community. We want to contribute to 
the improvement of Mahout by reforming and simplifying the clustering APIs per 
the following link:
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644120#comment-13644120

We have gone through the code of Mahout clustering. Now we have some ideas 
about improving it:

=
Addressing the problems in the current interface:

Testing cases are missing. For example, in spectral kmeans clustering, the run 
methods of SpectralKmeansDriver and EigencutsDriver are not tested

Documentations are missing for some methods. For example: in the run method of 
DirichletDriver, the description of parameter 'numModels' is missing; in the 
run method of SpectralKmeansDriver, the description of some arguments are 
missing

Some testing methods do not contain the specific description of some arguments. For example: in the 
run method of FuzzyKmeansDriver, the description of an argument of "m" (fuzzification 
factor) is missing. Although a wiki link regarding "Clustering Analysis" is given, it is 
not clear enough.

-

Implementing some new clustering algorithms

Agglomerative hierarchical clustering, which will cluster the data points into 
a dendragram, so that user could indicate whatever number of clusters as they 
want. (http://en.wikipedia.org/wiki/Hierarchical_clustering)

Dbscan, which is a density based clustering method being able to identify 
clusters with arbitrary shapes, and is useful in spatial clustering. 
(http://en.wikipedia.org/wiki/DBSCAN)

-

Providing a new unified interface

Currently, each clustering algorithm has its own implemented class with 
different interfaces (i.e., run methods in different Drivers have different 
argument list). However, it is better to have a unified interface to execute 
all available clustering methods, and an example interface is as follows:

Clustering-run(input, output, methodClass,clusteringConfig)

Here, the "methodClass" indicates a specific clustering method, while 
"clusteringConfig" indicates the configuration for this specific clustering method.

=

Could you please let us know what you think about our ideas?


 

GSOC 2013: Reform and simplify the clustering APIs
--

 Key: MAHOUT-1177
 URL: https://issues.apache.org/jira/browse/MAHOUT-1177
 Project: Mahout
  Issue Type: Improvement
Reporter: Dan Filimon
  Labels: gsoc2013, mentor

Clustering is one of the most used features in Mahout and has many applications 
[http://en.wikipedia.org/wiki/Cluster_analysis#Applications].
We have of lots clustering algorithms. There's:
- basic k-means
- canopy clustering
- Dirichlet clustering
- Fuzzy k-means
- Spectral k-means
- Streaming k-means [coming soon]
We want to make them easier to use by updating the APIs and make sure they all 
work in the same way have consistent inputs, outputs, diagnostics and 
documentation.
This is a great way to gain an in-depth understanding of clustering algorithms, 
familiarize yourself with Hadoop, Mahout clustering and good software 
engineering principles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

2013-05-02 Thread Yu Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647841#comment-13647841
 ] 

Yu Lee commented on MAHOUT-1177:


Hello Robin Anil, Jeff Eastman, Dan Filimon, and Ted Dunning,

Yexi and I (Yu Lee) are new to this Mahout community. We want to contribute to 
the improvement of Mahout by reforming and simplifying the clustering APIs per 
the following link:
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644120#comment-13644120

We have gone through the code of Mahout clustering. Now we have some ideas 
about improving it:

=
Addressing the problems in the current interface:

Testing cases are missing. For example, in spectral kmeans clustering, the run 
methods of SpectralKmeansDriver and EigencutsDriver are not tested

Documentations are missing for some methods. For example: in the run method of 
DirichletDriver, the description of parameter 'numModels' is missing; in the 
run method of SpectralKmeansDriver, the description of some arguments are 
missing

Some testing methods do not contain the specific description of some arguments. 
For example: in the run method of FuzzyKmeansDriver, the description of an 
argument of "m" (fuzzification factor) is missing. Although a wiki link 
regarding "Clustering Analysis" is given, it is not clear enough.

-

Implementing some new clustering algorithms

Agglomerative hierarchical clustering, which will cluster the data points into 
a dendragram, so that user could indicate whatever number of clusters as they 
want. (http://en.wikipedia.org/wiki/Hierarchical_clustering)

Dbscan, which is a density based clustering method being able to identify 
clusters with arbitrary shapes, and is useful in spatial clustering. 
(http://en.wikipedia.org/wiki/DBSCAN)

-

Providing a new unified interface

Currently, each clustering algorithm has its own implemented class with 
different interfaces (i.e., run methods in different Drivers have different 
argument list). However, it is better to have a unified interface to execute 
all available clustering methods, and an example interface is as follows:

Clustering-run(input, output, methodClass,clusteringConfig)

Here, the "methodClass" indicates a specific clustering method, while 
"clusteringConfig" indicates the configuration for this specific clustering 
method.

=

Could you please let us know what you think about our ideas?



> GSOC 2013: Reform and simplify the clustering APIs
> --
>
> Key: MAHOUT-1177
> URL: https://issues.apache.org/jira/browse/MAHOUT-1177
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
>
> Clustering is one of the most used features in Mahout and has many 
> applications [http://en.wikipedia.org/wiki/Cluster_analysis#Applications].
> We have of lots clustering algorithms. There's:
> - basic k-means
> - canopy clustering
> - Dirichlet clustering
> - Fuzzy k-means
> - Spectral k-means
> - Streaming k-means [coming soon]
> We want to make them easier to use by updating the APIs and make sure they 
> all work in the same way have consistent inputs, outputs, diagnostics and 
> documentation.
> This is a great way to gain an in-depth understanding of clustering 
> algorithms, familiarize yourself with Hadoop, Mahout clustering and good 
> software engineering principles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

2013-04-28 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644120#comment-13644120
 ] 

Ted Dunning commented on MAHOUT-1177:
-

Before you do much planning you should organize two groups of people:

a) your co-contributors

b) the existing clustering stake-holders.  These include Robin Anil and Jeff 
Eastman because of their authorship of the previous interface as well as Dan 
Filimon and myself because of our authorship of the new streaming k-means 
clustering code.

The first step should be to listen to what people think before you decide on 
what to do.

The things you should deliver first include:

1) a survey of the current interface and the problems with it

2) a survey of the algorithms to be supported in the new unified interface

This should spark considerable discussion on the mailing list.

Only then should you consider making a proposal for changes.



> GSOC 2013: Reform and simplify the clustering APIs
> --
>
> Key: MAHOUT-1177
> URL: https://issues.apache.org/jira/browse/MAHOUT-1177
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
>
> Clustering is one of the most used features in Mahout and has many 
> applications [http://en.wikipedia.org/wiki/Cluster_analysis#Applications].
> We have of lots clustering algorithms. There's:
> - basic k-means
> - canopy clustering
> - Dirichlet clustering
> - Fuzzy k-means
> - Spectral k-means
> - Streaming k-means [coming soon]
> We want to make them easier to use by updating the APIs and make sure they 
> all work in the same way have consistent inputs, outputs, diagnostics and 
> documentation.
> This is a great way to gain an in-depth understanding of clustering 
> algorithms, familiarize yourself with Hadoop, Mahout clustering and good 
> software engineering principles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

2013-04-28 Thread Yexi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644044#comment-13644044
 ] 

Yexi commented on MAHOUT-1177:
--

Hi, Ted,

What is the next step? Do I need to first give a brief idea about what I plan 
to do for this project?

Regards,
Yexi

> GSOC 2013: Reform and simplify the clustering APIs
> --
>
> Key: MAHOUT-1177
> URL: https://issues.apache.org/jira/browse/MAHOUT-1177
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
>
> Clustering is one of the most used features in Mahout and has many 
> applications [http://en.wikipedia.org/wiki/Cluster_analysis#Applications].
> We have of lots clustering algorithms. There's:
> - basic k-means
> - canopy clustering
> - Dirichlet clustering
> - Fuzzy k-means
> - Spectral k-means
> - Streaming k-means [coming soon]
> We want to make them easier to use by updating the APIs and make sure they 
> all work in the same way have consistent inputs, outputs, diagnostics and 
> documentation.
> This is a great way to gain an in-depth understanding of clustering 
> algorithms, familiarize yourself with Hadoop, Mahout clustering and good 
> software engineering principles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

2013-04-27 Thread Zhivko Lazarov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13643718#comment-13643718
 ] 

Zhivko Lazarov commented on MAHOUT-1177:


Hello, 
I am an undergraduate student who is very interested in this project. I am 
already familiar with clustering algorithms, since I've take classes like Data 
Mining, Pattern Recognition, Machine Learning, Artificial Intelligence and the 
closest to this project is a news aggregation project I've done where I was 
required to implement different clustering algorithms K-Means, Hierarchical 
Clustering on which I had some own improvements given the aggregated news 
database(NoSQL MongoDB). I have competed on ACM ICPC 2011 and 2012 and have 
been a laboratory assistant on Algorithms and Data Structures at my faculty. I 
would like to be involved in this project but as part of GSOC if it is 
possible. Regards, Zivko.

> GSOC 2013: Reform and simplify the clustering APIs
> --
>
> Key: MAHOUT-1177
> URL: https://issues.apache.org/jira/browse/MAHOUT-1177
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
>
> Clustering is one of the most used features in Mahout and has many 
> applications [http://en.wikipedia.org/wiki/Cluster_analysis#Applications].
> We have of lots clustering algorithms. There's:
> - basic k-means
> - canopy clustering
> - Dirichlet clustering
> - Fuzzy k-means
> - Spectral k-means
> - Streaming k-means [coming soon]
> We want to make them easier to use by updating the APIs and make sure they 
> all work in the same way have consistent inputs, outputs, diagnostics and 
> documentation.
> This is a great way to gain an in-depth understanding of clustering 
> algorithms, familiarize yourself with Hadoop, Mahout clustering and good 
> software engineering principles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

2013-04-26 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13642986#comment-13642986
 ] 

Ted Dunning commented on MAHOUT-1177:
-

It is a grand idea for you guys to work together.


> GSOC 2013: Reform and simplify the clustering APIs
> --
>
> Key: MAHOUT-1177
> URL: https://issues.apache.org/jira/browse/MAHOUT-1177
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
>
> Clustering is one of the most used features in Mahout and has many 
> applications [http://en.wikipedia.org/wiki/Cluster_analysis#Applications].
> We have of lots clustering algorithms. There's:
> - basic k-means
> - canopy clustering
> - Dirichlet clustering
> - Fuzzy k-means
> - Spectral k-means
> - Streaming k-means [coming soon]
> We want to make them easier to use by updating the APIs and make sure they 
> all work in the same way have consistent inputs, outputs, diagnostics and 
> documentation.
> This is a great way to gain an in-depth understanding of clustering 
> algorithms, familiarize yourself with Hadoop, Mahout clustering and good 
> software engineering principles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

2013-04-26 Thread Yu Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13642881#comment-13642881
 ] 

Yu Lee commented on MAHOUT-1177:


Hello Guys,

I am also a graduate student with research interests in data mining and big 
data analytics. 

I am familiar with programming in Hadoop/Mahout for addressing large scale data 
analysis problems. I think I can collaborate with Yexi on this project.

The only thing is I cannot apply CPT for GSOC neither...

If it is ok for me to work with Yexi, what would be our next step?

Looking forward to your earliest replies. Thank you!

Best,


> GSOC 2013: Reform and simplify the clustering APIs
> --
>
> Key: MAHOUT-1177
> URL: https://issues.apache.org/jira/browse/MAHOUT-1177
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
>
> Clustering is one of the most used features in Mahout and has many 
> applications [http://en.wikipedia.org/wiki/Cluster_analysis#Applications].
> We have of lots clustering algorithms. There's:
> - basic k-means
> - canopy clustering
> - Dirichlet clustering
> - Fuzzy k-means
> - Spectral k-means
> - Streaming k-means [coming soon]
> We want to make them easier to use by updating the APIs and make sure they 
> all work in the same way have consistent inputs, outputs, diagnostics and 
> documentation.
> This is a great way to gain an in-depth understanding of clustering 
> algorithms, familiarize yourself with Hadoop, Mahout clustering and good 
> software engineering principles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

2013-04-25 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641904#comment-13641904
 ] 

Ted Dunning commented on MAHOUT-1177:
-

Yexi,

We would love to have you contribute with or without GSOC participation.

> GSOC 2013: Reform and simplify the clustering APIs
> --
>
> Key: MAHOUT-1177
> URL: https://issues.apache.org/jira/browse/MAHOUT-1177
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
>
> Clustering is one of the most used features in Mahout and has many 
> applications [http://en.wikipedia.org/wiki/Cluster_analysis#Applications].
> We have of lots clustering algorithms. There's:
> - basic k-means
> - canopy clustering
> - Dirichlet clustering
> - Fuzzy k-means
> - Spectral k-means
> - Streaming k-means [coming soon]
> We want to make them easier to use by updating the APIs and make sure they 
> all work in the same way have consistent inputs, outputs, diagnostics and 
> documentation.
> This is a great way to gain an in-depth understanding of clustering 
> algorithms, familiarize yourself with Hadoop, Mahout clustering and good 
> software engineering principles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

2013-04-25 Thread Yexi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641893#comment-13641893
 ] 

Yexi commented on MAHOUT-1177:
--

Hi, 

I am a graduate student majored in data mining, I am very interested in this 
project.
I have used some experiences on distributed data mining using hadoop, so I 
believe I can handle this project.

In order to work on this project, is it necessary for me to join the GSOC 
program?
As the GSOC requires the international student who studies in the US to apply 
for the CPT, and I almost used up the CPT due to previous internships, so I am 
not be able to apply CPT for the GSOC. 

> GSOC 2013: Reform and simplify the clustering APIs
> --
>
> Key: MAHOUT-1177
> URL: https://issues.apache.org/jira/browse/MAHOUT-1177
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
>
> Clustering is one of the most used features in Mahout and has many 
> applications [http://en.wikipedia.org/wiki/Cluster_analysis#Applications].
> We have of lots clustering algorithms. There's:
> - basic k-means
> - canopy clustering
> - Dirichlet clustering
> - Fuzzy k-means
> - Spectral k-means
> - Streaming k-means [coming soon]
> We want to make them easier to use by updating the APIs and make sure they 
> all work in the same way have consistent inputs, outputs, diagnostics and 
> documentation.
> This is a great way to gain an in-depth understanding of clustering 
> algorithms, familiarize yourself with Hadoop, Mahout clustering and good 
> software engineering principles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1177) GSOC 2013: Reform and simplify the clustering APIs

2013-04-19 Thread Franklin Oliveira da Veiga (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636363#comment-13636363
 ] 

Franklin Oliveira da Veiga commented on MAHOUT-1177:


Hi you all, 
I'm student from Brazil, if I want to participate in GSoC with this project how 
do I find a mentor and where do I find any examples of the APIs being used?

Thanks.

> GSOC 2013: Reform and simplify the clustering APIs
> --
>
> Key: MAHOUT-1177
> URL: https://issues.apache.org/jira/browse/MAHOUT-1177
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
>
> Clustering is one of the most used features in Mahout and has many 
> applications [http://en.wikipedia.org/wiki/Cluster_analysis#Applications].
> We have of lots clustering algorithms. There's:
> - basic k-means
> - canopy clustering
> - Dirichlet clustering
> - Fuzzy k-means
> - Spectral k-means
> - Streaming k-means [coming soon]
> We want to make them easier to use by updating the APIs and make sure they 
> all work in the same way have consistent inputs, outputs, diagnostics and 
> documentation.
> This is a great way to gain an in-depth understanding of clustering 
> algorithms, familiarize yourself with Hadoop, Mahout clustering and good 
> software engineering principles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira