Hi, I was looking at the `Spark 1.5` dataframe/row api <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row> and the implementation for the logistic regression <https://github.com/apache/spark/blob/branch-1.5/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala> . As I understand, the /train/ method therein first converts the /data frame/ to /RDD[LabeledPoint]/ as,
override protected def train(dataset: DataFrame): LogisticRegressionModel = { // Extract columns from data. If dataset is persisted, do not persist oldDataset. val instances = extractLabeledPoints(dataset).map { case LabeledPoint(label: Double, features: Vector) => (label, features) } ... And then it proceeds to feature standardization, etc. What I am confused with is, the /DataFrame/ is of type /RDD[Row]/ and /Row/ is allowed to have any /valueTypes/, for e.g. /(1, true, "a string", null)/ seems a valid row of a dataframe. If that is so, what does the /extractLabeledPoints/ above mean? It seems it is selecting only /Array[Double]/ as the feature values in /Vector/ (dense or sparse). What happens if a column in the data-frame was a set of /strings/? Also, what happens to the integer categorical values? Thanks in advance, Nikhil -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-ml-LogisticRegression-assumes-only-Double-valued-features-tp24575.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org