Thank you Ram and Joseph.
I am also hoping to contribute to MLib once my Scala gets up to snuff, this
is the guidance I needed for how to proceed when ready.
Best wishes,
Trevor
On Wed, May 20, 2015 at 1:55 PM, Joseph Bradley jos...@databricks.com
wrote:
Hi Trevor,
I may be repeating what
Hey Ram,
I'm not speaking to Tarek's package specifically but to the spirit of
MLib. There are a number of method/algorithms for PCA, I'm not sure by
what criterion the current one is considered 'standard'.
It is rare to find ANY machine learning algo that is 'clearly better' than
any other.
Hi Trevor
Good point, I didn't mean that some algorithm has to be clearly better than
another in every scenario to be included in MLLib. However, even if someone
is willing to be the maintainer of a piece of code, it does not make sense
to accept every possible algorithm into the core library.
Hi Trevor
I'm attaching the MLLib contribution guideline here:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
It speaks to widely known and accepted algorithms but not to whether an
algorithm has to be better than
Hi Trevor,
I may be repeating what Ram said, but to 2nd it, a few points:
We do want MLlib to become an extensive and rich ML library; as you said,
scikit-learn is a great example. To make that happen, we of course need to
include important algorithms. Important is hazy, but roughly means
There are most likely advantages and disadvantages to Tarek's algorithm
against the current implementation, and different scenarios where each is
more appropriate.
Would we not offer multiple PCA algorithms and let the user choose?
Trevor
Trevor Grant
Data Scientist
*Fortunate is he, who is
Hi Tarek,
Thanks for your interest for checking the guidelines first! On 2 points:
Algorithm: PCA is of course a critical algorithm. The main question is how
your algorithm/implementation differs from the current PCA. If it's
different and potentially better, I'd recommend opening up a JIRA
Hi,
I would like to contribute an algorithm to the MLlib project. I have
implemented a scalable PCA algorithm on spark. It is scalable for both tall
and fat matrices and the paper around it is accepted for publication in
SIGMOD 2015 conference. I looked at the guidelines in the following link: