Hi Aditya,

Welcome to the MADlib community!

Gaussian Mixture models is extrememly useful and we would heartily welcome
a contribution for it. The SQLEM paper might be oversimplifying the
capabilities of the database (e.g. assuming there is no array type is
unnecessary for Postgresql). You could speed things (both dev time and
execution time) by writing some of the functions in C++. K-means is an
example of how clustering is implemented.
IMO, assuming the same covariance matrix is reasonable. We could extend the
capabilities after the initial implementation is complete.

There was some work started a long time ago that built perceptrons using
the convex framework (link <https://github.com/iyerr3/madlib/tree/mlp>).
There are still some bugs in that code since the trained network isn't
converging. You could start there or build a new module - either ways an
MLP module is frequently demanded by the data science community.

I would suggest starting with Gaussian mixtures and then moving to
perceptrons if GMM work is completed.

Feel free to ask questions on this forum. Looking forward to collaborating
with you.

Best,
Rahul

On Thu, Mar 17, 2016 at 2:08 PM, Aditya Nain <adityana...@gmail.com> wrote:

> Hi,
>
> My name is Aditya Nain, and I am a graduate student at University of
> Florida.
> I have been learning MADLib for a while and want to contribute to MADLib.
> I went through some of the open stories in JIRA and started working on
> MADLIB-410  :
>
> https://issues.apache.org/jira/browse/MADLIB-410?jql=project%20%3D%20MADLIB
>
> which is about implementing Gaussian Mixture Model using Expectation
> Maximization (EM) algorithm.
>
> I came across the following paper while searching for distributed EM
> algorithm which can be implemented in MADLib.
>
> Carlos Ordonez, Paul Cereghini "SQLEM: fast clustering in SQL using the EM
> algorithm" ACM SIGMOD Record, Volume 29 Issue 2, June 2000 Pages 559-570.
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.7564
>
> I thought of implementing the approach discussed in the paper, but the
> paper makes an assumption that the covariance martix is the same for all
> the clusters ( i.e covariance matrix is same for all the Gaussian
> distributions). So, I wanted to know the opinion of the community if it's
> fine to go with the assumption made in the paper and implement it in
> MADLib.
>
> Also, currently MADLib doesn't have an implementation of a perceptron, nor
> did I find any open story related to it in JIRA. I came across the
> following paper, which talks about a distributed algorithm for perceptron :
>
> Ryan McDonald, Keith Hall, Gideon Mann "Distributed training strategies for
> the structured perceptron"
> http://dl.acm.org/citation.cfm?id=1858068
>
> Would it useful to have a distributed implementaion of perceptron in
> MADlib?
>
> Thanks,
> Aditya
>

Reply via email to