Re: Contribute code to MLlib

Ram Sriharsha Wed, 20 May 2015 10:02:47 -0700

Hi Trevor

I'm attaching the MLLib contribution guideline here:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines


It speaks to widely known and accepted algorithms but not to whether an 
algorithm has to be better than another in every scenario etc

I think the guideline explains what a good contribution to the core library 
should look like better than I initially attempted to !

Sent from my iPhone

> On May 20, 2015, at 9:31 AM, Ram Sriharsha <sriharsha....@gmail.com> wrote:
> 
> Hi Trevor
> 
> Good point, I didn't mean that some algorithm has to be clearly better than 
> another in every scenario to be included in MLLib. However, even if someone 
> is willing to be the maintainer of a piece of code, it does not make sense to 
> accept every possible algorithm into the core library.
> 
> That said, the specific algorithms should be discussed in the JIRA: as you 
> point out, there is no clear way to decide what algorithm to include and what 
> not to, and usually mature algorithms that serve a wide variety of scenarios 
> are easier to argue about but nothing prevents anyone from opening a ticket 
> to discuss any specific machine learning algorithm.
> 
> My suggestion was simply that for purposes of making experimental or newer 
> algorithms available to Spark users, it doesn't necessarily have to be in the 
> core library. Spark packages are good enough in this respect.
> 
> Isn't it better for newer algorithms to take this route and prove themselves 
> before we bring them into the core library? Especially given the barrier to 
> using spark packages is very low.
> 
> Ram
> 
> 
> 
>> On Wed, May 20, 2015 at 9:05 AM, Trevor Grant <trevor.d.gr...@gmail.com> 
>> wrote:
>> Hey Ram,
>> 
>> I'm not speaking to Tarek's package specifically but to the spirit of MLib.  
>> There are a number of method/algorithms for PCA, I'm not sure by what 
>> criterion the current one is considered 'standard'.  
>> 
>> It is rare to find ANY machine learning algo that is 'clearly better' than 
>> any other.  They are all tools, they have their place and time.  I agree 
>> that it makes sense to field new algorithms as packages and then integrate 
>> into MLib once they are 'proven' (in terms of stability/performance/anyone 
>> cares).  That being said, if MLib takes the stance that 'what we have is 
>> good enough unless something is clearly better', then it will never grow 
>> into a suite with the depth and richness of sklearn. From a practitioner's 
>> stand point, its nice to have everything I could ever want ready in an 
>> 'off-the-shelf' form. 
>> 
>> 'A large number of use cases better than existing' shouldn't be a criteria 
>> when selecting what to include in MLib.  The important question should be, 
>> 'Are you willing to take on responsibility for maintaining this because you 
>> may be the only person on earth who understands the mechanics AND how to 
>> code it?'.   Obviously we don't want any random junk algo included.  But 
>> trying to say, 'this way of doing PCA is better than that way in a large 
>> class of cases' is like trying to say 'geometry is more important than 
>> calculus in large class of cases", maybe its true- but geometry won't help 
>> you if you are in a case where you need calculus. 
>> 
>> This all relies on the assumption that MLib is destined to be a rich data 
>> science/machine learning package.  It may be that the goal is to make the 
>> project as lightweight and parsimonious as possible, if so excuse me for 
>> speaking out of turn. 
>>   
>> 
>>> On Tue, May 19, 2015 at 10:41 AM, Ram Sriharsha <sriharsha....@gmail.com> 
>>> wrote:
>>> Hi Trevor, Tarek
>>> 
>>> You make non standard algorithms (PCA or otherwise) available to users of 
>>> Spark as Spark Packages.
>>> http://spark-packages.org
>>> https://databricks.com/blog/2014/12/22/announcing-spark-packages.html
>>> 
>>> With the availability of spark packages, adding powerful experimental / 
>>> alternative machine learning algorithms to the pipeline has never been 
>>> easier. I would suggest that route in scenarios where one machine learning 
>>> algorithm is not clearly better in the common scenarios than an existing 
>>> implementation in MLLib.
>>> 
>>> If your algorithm is for a large class of use cases better than the 
>>> existing PCA implementation, then we should open a JIRA and discuss the 
>>> relative strengths/ weaknesses (perhaps with some benchmarks) so we can 
>>> better understand if it makes sense to switch out the existing PCA 
>>> implementation and make yours the default.
>>> 
>>> Ram
>>> 
>>>> On Tue, May 19, 2015 at 6:56 AM, Trevor Grant <trevor.d.gr...@gmail.com> 
>>>> wrote:
>>>>  There are most likely advantages and disadvantages to Tarek's algorithm 
>>>> against the current implementation, and different scenarios where each is 
>>>> more appropriate.
>>>> 
>>>> Would we not offer multiple PCA algorithms and let the user choose?
>>>> 
>>>> Trevor
>>>> 
>>>> Trevor Grant
>>>> Data Scientist
>>>> 
>>>> "Fortunate is he, who is able to know the causes of things."  -Virgil
>>>> 
>>>> 
>>>>> On Mon, May 18, 2015 at 4:18 PM, Joseph Bradley <jos...@databricks.com> 
>>>>> wrote:
>>>>> Hi Tarek,
>>>>> 
>>>>> Thanks for your interest & for checking the guidelines first!  On 2 
>>>>> points:
>>>>> 
>>>>> Algorithm: PCA is of course a critical algorithm.  The main question is 
>>>>> how your algorithm/implementation differs from the current PCA.  If it's 
>>>>> different and potentially better, I'd recommend opening up a JIRA for 
>>>>> explaining & discussing it.
>>>>> 
>>>>> Java/Scala: We really do require that algorithms be in Scala, for the 
>>>>> sake of maintainability.  The conversion should be doable if you're 
>>>>> willing since Scala is a pretty friendly language.  If you create the 
>>>>> JIRA, you could also ask for help there to see if someone can collaborate 
>>>>> with you to convert the code to Scala.
>>>>> 
>>>>> Thanks!
>>>>> Joseph
>>>>> 
>>>>>> On Mon, May 18, 2015 at 3:13 AM, Tarek Elgamal <tarek.elga...@gmail.com> 
>>>>>> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I would like to contribute an algorithm to the MLlib project. I have 
>>>>>> implemented a scalable PCA algorithm on spark. It is scalable for both 
>>>>>> tall and fat matrices and the paper around it is accepted for 
>>>>>> publication in SIGMOD 2015 conference. I looked at the guidelines in the 
>>>>>> following link:
>>>>>> 
>>>>>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
>>>>>> 
>>>>>> I believe that most of the guidelines applies in my case, however, the 
>>>>>> code is written in java and it was not clear in the guidelines whether 
>>>>>> MLLib project accepts java code or not. 
>>>>>> My algorithm can be found under this repository:
>>>>>> https://github.com/Qatar-Computing-Research-Institute/sPCA
>>>>>> 
>>>>>> Any help on how to make it suitable for MLlib project will be greatly 
>>>>>> appreciated. 
>>>>>> 
>>>>>> Best Regards,
>>>>>> Tarek Elgamal
>

Re: Contribute code to MLlib

Reply via email to