Liangcai Li created SPARK-29327:
-----------------------------------

             Summary: Support specifying features via multiple columns in 
Predictor and PredictionModel
                 Key: SPARK-29327
                 URL: https://issues.apache.org/jira/browse/SPARK-29327
             Project: Spark
          Issue Type: Improvement
          Components: ML, MLlib
    Affects Versions: 3.0.0
            Reporter: Liangcai Li


There are always more features than one in a classification/regression task, 
however the current API to specify features columns in Predictor of Spark MLLib 
only supports one single column, which requires users to assemble the multiple 
features columns into a "org.apache.spark.ml.linalg.Vector" before fitting to 
Spark ML pipeline. 

This improvement is going to let users specify the features columns directly 
without vectorization. To support this, we can introduce two new APIs in both 
"Predictor" and "PredictionModel", and a new parameter named "featuresCols" 
storing the features columns names as an Array. ( PR is ready here 
[https://github.com/apache/spark/pull/25983])
*APIs:*
{{def setFeaturesCol(value: Array[String]): M = ...}}
{{protected def isSupportMultiColumnsForFeatures: Boolean = false}}
*Parameter:*
{{final val featuresCols: StringArrayParam = new StringArrayParam(this, 
"featuresCols",   ...)}}

Then ML implementations can get and use the features columns names from this 
new parameter "featuresCols", along with the raw data of features in separate 
columns directly in dataset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to