[ 
https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368109#comment-16368109
 ] 

Seth Hendrickson commented on SPARK-23437:
------------------------------------------

TBH, this seems like a pretty reasonable request. While I agree we do seem to 
tell people that the "standard" practice is to implement as a third party 
package and then integrate later, I don't see this happen in practice. I don't 
know that we've even validated that the "implement as third party package, then 
in Spark later on" approach even really works. Perhaps an even stronger reason 
for resisting new algorithms is just lack of reviewer/developer support on 
Spark ML. It's hard to predict if there will be anyone to review the PR within 
a reasonable amount of time, even if the code is well-designed. AFAIK, we 
haven't added any major algos since GeneralizedLinearRegression, which has to 
have been a couple years ago. 

That said, I think this is something to at least consider. We can start by 
discussing what algorithms exist, and why we'd choose a particular one. Strong 
arguments for why we need GPs in Spark ML are also beneficial. The fact that 
there isn't a non-parametric regression algo in Spark has some merit, but we 
don't write new algorithms just for the sake of filling in gaps - there needs 
to be user demand (which, unfortunately, is often hard to prove). It also helps 
to point to a package that already implements the algo you're proposing, but 
for example I don't believe scikit implements the linear-time version so we 
can't really leverage their experience. Providing more information on any/all 
of these categories will help make a stronger case, and I do think GPs can be a 
useful addition. Thanks for leading the discussion!

> [ML] Distributed Gaussian Process Regression for MLlib
> ------------------------------------------------------
>
>                 Key: SPARK-23437
>                 URL: https://issues.apache.org/jira/browse/SPARK-23437
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML, MLlib
>    Affects Versions: 2.2.1
>            Reporter: Valeriy Avanesov
>            Priority: Major
>
> Gaussian Process Regression (GP) is a well known black box non-linear 
> regression approach [1]. For years the approach remained inapplicable to 
> large samples due to its cubic computational complexity, however, more recent 
> techniques (Sparse GP) allowed for only linear complexity. The field 
> continues to attracts interest of the researches – several papers devoted to 
> GP were present on NIPS 2017. 
> Unfortunately, non-parametric regression techniques coming with mllib are 
> restricted to tree-based approaches.
> I propose to create and include an implementation (which I am going to work 
> on) of so-called robust Bayesian Committee Machine proposed and investigated 
> in [2].
> [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian 
> Processes for Machine Learning (Adaptive Computation and Machine Learning)_. 
> The MIT Press.
> [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian 
> processes. In _Proceedings of the 32nd International Conference on 
> International Conference on Machine Learning - Volume 37_ (ICML'15), Francis 
> Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to