There are several high bars to getting a new algorithm adopted. * It needs to be deemed by the MLLib committers/shepherds as widely useful to the community. Algorithms offered by larger companies after having demonstrated usefulness at scale for use cases likely to be encountered by many other companies stand a better chance * There is quite limited bandwidth for consideration of new algorithms: there has been a dearth of new ones accepted since early 2015 . So prioritization is a challenge. * The code must demonstrate high quality standards especially wrt testability, maintainability, computational performance, and scalability. * The chosen algorithms and options should be well documented and include comparisons/ tradeoffs with state of the art described in relevant papers. These questions will typically be asked during design/code reviews - i.e. did you consider the approach shown *here * * There is also luck and timing involved. The review process might start in a given month A but reviewers become busy or higher priorities intervene .. and then when will the reviewing continue.. * At the point that the above are complete then there are intricacies with integrating with a particular Spark release
Am Mo., 5. Aug. 2019 um 05:58 Uhr schrieb chagas <cha...@gta.ufrj.br>: > Hi, > > After searching the machine learning library for streaming algorithms, I > found two that fit the criteria: Streaming Linear Regression > ( > https://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression) > > and Streaming K-Means > ( > https://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means > ). > > However, both use the RDD-based API MLlib instead of the DataFrame-based > API ML; are there any plans for bringing them both to ML? > > Also, is there any technical reason why there are so few incremental > algorithms on the machine learning library? There's only 1 algorithm for > regression and clustering each, with nothing for classification, > dimensionality reduction or feature extraction. > > If there is a reason, how were those two algorithms implemented? If > there isn't, what is the general consensus on adding new online machine > learning algorithms? > > Regards, > Lucas Chagas > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >