zhengruifeng created SPARK-30286:
------------------------------------

             Summary: Some thoughts on new features for MLLIB
                 Key: SPARK-30286
                 URL: https://issues.apache.org/jira/browse/SPARK-30286
             Project: Spark
          Issue Type: Wish
          Components: ML
    Affects Versions: 3.0.0
            Reporter: zhengruifeng


Some thoughts on new features for ML:


1, clustering: *mini-batch KMeans*: KMeans maybe one of the most widely used 
algs in MLLIB, mini-batch KMeans is much faster than KMeans with [compareable 
results|https://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#sphx-glr-auto-examples-cluster-plot-mini-batch-kmeans-py];
 in SKLearn it is a seperate estimator, in MLLIB we may add it as one/two 
params in existing KMeans.

2, classification & regression:
 2.1 ExtraTrees (Extremely Randomized Trees): a even more randomized version of 
tree ensamble, it has a lower variance than its brother RandomForest, it seems 
that in online contests extratrees are more and more used; It seems that it can 
be easily impled atop existing ensamble impls;
 2.2 Categorical Naive Bayes: new NB just released in SKLearn 0.22, it should 
be easy to impl it as a new modelType in MLLIB's NB;

3, features:
 3.1 *vector validator*: a new UnaryTransformer that check whether a vector 
column meets some requirements, like non-NaN, non-negative, positive, all 
values are binary/int, all vectors are dense/sparse, numFetures; Current some 
impls deal with invalid values, but most have not. For example, we first scaler 
the input by MinMaxScaler, however MinMaxScaler will ignore NaN in training and 
keep the NaN in transformation, then the scaled dataset is feed into 
LinearRegression, at the end I obtain a LinearRegressionModel with NaN 
LinearRegression. In the whole pipeline, no exception is thrown. With this 
validator, the pipeline can fail ahead.
 3.2 inverse transform for models/transformers: we may add a new bool param 
HasInverseTransform;
 3.3 non-linear transformation: quantile transforms and power transforms 
(including famous Box-Cox method), map data from any distribution to as close 
to another distribution (mostly Gaussian); _I am working on this, since I need 
this feature recently_;
 3.4 similarity search: in my experience, Approximate Nearest Neighbors based 
on KMeans provides more accurate result than LSH, can we follow some famous 
libraries like Facebook-FAISS to impl a new ANN?

4, warm start: initialize the model from a previous model, ONLY the 
coefficients are used (the params related to the previous model are ignored), 
maybe a new string param HasInitialModelPath can be added at first.

5, linalg: *Vectors support more methods, like:* *iterator,* *activeIterator, 
nonZeroIterator*; so that we can impl some method based on Iterator[Int, 
Double] instead of ml.Vector/mllib.Vector, and reuse it in both sides without 
vector conversions.

6, parameter server: there were several tickets for it. It should be super 
useful and will provide efficient gradient-based solver for many algs. I also 
know there were some efforts to impl it atop spark, like Tencent-Angel &[ 
Glint|https://github.com/Angel-ML/angel]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to