Does Spark.ml LogisticRegression assumes only Double valued features?

njoshi Thu, 03 Sep 2015 16:42:07 -0700

Hi,

I was looking at the `Spark 1.5`  dataframe/row api
<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row>
  
and the  implementation for the logistic regression
<https://github.com/apache/spark/blob/branch-1.5/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala>
 
. As I understand, the /train/ method therein first converts the /data
frame/ to /RDD[LabeledPoint]/ as,


    override protected def train(dataset: DataFrame):
LogisticRegressionModel = {
         // Extract columns from data.  If dataset is persisted, do not
persist oldDataset.
         val instances = extractLabeledPoints(dataset).map {
               case LabeledPoint(label: Double, features: Vector) => (label,
features)
         }
    ...

And then it proceeds to feature standardization, etc. 

What I am confused with is, the /DataFrame/ is of type /RDD[Row]/ and /Row/
is allowed to have any /valueTypes/, for e.g. /(1, true, "a string", null)/
seems a valid row of a dataframe. If that is so, what does the
/extractLabeledPoints/ above mean? It seems it is selecting only
/Array[Double]/ as the feature values in /Vector/ (dense or sparse). What
happens if a column in the data-frame was a set of /strings/? Also, what
happens to the integer categorical values?

Thanks in advance,
Nikhil



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-ml-LogisticRegression-assumes-only-Double-valued-features-tp24575.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Does Spark.ml LogisticRegression assumes only Double valued features?

Reply via email to