Hey Ram,

I'm not speaking to Tarek's package specifically but to the spirit of
MLib.  There are a number of method/algorithms for PCA, I'm not sure by
what criterion the current one is considered 'standard'.

It is rare to find ANY machine learning algo that is 'clearly better' than
any other.  They are all tools, they have their place and time.  I agree
that it makes sense to field new algorithms as packages and then integrate
into MLib once they are 'proven' (in terms of stability/performance/anyone
cares).  That being said, if MLib takes the stance that 'what we have is
good enough unless something is *clearly* better', then it will never grow
into a suite with the depth and richness of sklearn. From a practitioner's
stand point, its nice to have everything I could ever want ready in an
'off-the-shelf' form.

'A large number of use cases better than existing' shouldn't be a criteria
when selecting what to include in MLib.  The important question should be,
'Are you willing to take on responsibility for maintaining this because you
may be the only person on earth who understands the mechanics AND how to
code it?'.   Obviously we don't want any random junk algo included.  But
trying to say, 'this way of doing PCA is better than that way in a large
class of cases' is like trying to say 'geometry is more important than
calculus in large class of cases", maybe its true- but geometry won't help
you if you are in a case where you need calculus.

This all relies on the assumption that MLib is destined to be a rich data
science/machine learning package.  It may be that the goal is to make the
project as lightweight and parsimonious as possible, if so excuse me for
speaking out of turn.


On Tue, May 19, 2015 at 10:41 AM, Ram Sriharsha <sriharsha....@gmail.com>
wrote:

> Hi Trevor, Tarek
>
> You make non standard algorithms (PCA or otherwise) available to users of
> Spark as Spark Packages.
> http://spark-packages.org
> https://databricks.com/blog/2014/12/22/announcing-spark-packages.html
>
> With the availability of spark packages, adding powerful experimental /
> alternative machine learning algorithms to the pipeline has never been
> easier. I would suggest that route in scenarios where one machine learning
> algorithm is not clearly better in the common scenarios than an existing
> implementation in MLLib.
>
> If your algorithm is for a large class of use cases better than the
> existing PCA implementation, then we should open a JIRA and discuss the
> relative strengths/ weaknesses (perhaps with some benchmarks) so we can
> better understand if it makes sense to switch out the existing PCA
> implementation and make yours the default.
>
> Ram
>
> On Tue, May 19, 2015 at 6:56 AM, Trevor Grant <trevor.d.gr...@gmail.com>
> wrote:
>
>>  There are most likely advantages and disadvantages to Tarek's algorithm
>> against the current implementation, and different scenarios where each is
>> more appropriate.
>>
>> Would we not offer multiple PCA algorithms and let the user choose?
>>
>> Trevor
>>
>> Trevor Grant
>> Data Scientist
>>
>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>
>>
>> On Mon, May 18, 2015 at 4:18 PM, Joseph Bradley <jos...@databricks.com>
>> wrote:
>>
>>> Hi Tarek,
>>>
>>> Thanks for your interest & for checking the guidelines first!  On 2
>>> points:
>>>
>>> Algorithm: PCA is of course a critical algorithm.  The main question is
>>> how your algorithm/implementation differs from the current PCA.  If it's
>>> different and potentially better, I'd recommend opening up a JIRA for
>>> explaining & discussing it.
>>>
>>> Java/Scala: We really do require that algorithms be in Scala, for the
>>> sake of maintainability.  The conversion should be doable if you're willing
>>> since Scala is a pretty friendly language.  If you create the JIRA, you
>>> could also ask for help there to see if someone can collaborate with you to
>>> convert the code to Scala.
>>>
>>> Thanks!
>>> Joseph
>>>
>>> On Mon, May 18, 2015 at 3:13 AM, Tarek Elgamal <tarek.elga...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I would like to contribute an algorithm to the MLlib project. I have
>>>> implemented a scalable PCA algorithm on spark. It is scalable for both tall
>>>> and fat matrices and the paper around it is accepted for publication in
>>>> SIGMOD 2015 conference. I looked at the guidelines in the following link:
>>>>
>>>>
>>>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
>>>>
>>>> I believe that most of the guidelines applies in my case, however, the
>>>> code is written in java and it was not clear in the guidelines whether
>>>> MLLib project accepts java code or not.
>>>> My algorithm can be found under this repository:
>>>> https://github.com/Qatar-Computing-Research-Institute/sPCA
>>>>
>>>> Any help on how to make it suitable for MLlib project will be greatly
>>>> appreciated.
>>>>
>>>> Best Regards,
>>>> Tarek Elgamal
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to