Hey Sandy, The work should be done by a VectorAssembler, which combines multiple columns (double/int/vector) into a vector column, which becomes the features column for regression. We can going to create JIRAs for each of these standard feature transformers. It would be great if you can help implement some of them.
Best, Xiangrui On Wed, Feb 11, 2015 at 7:55 PM, Patrick Wendell <pwend...@gmail.com> wrote: > I think there is a minor error here in that the first example needs a > "tail" after the seq: > > df.map { row => > (row.getDouble(0), row.toSeq.tail.map(_.asInstanceOf[Double])) > }.toDataFrame("label", "features") > > On Wed, Feb 11, 2015 at 7:46 PM, Michael Armbrust > <mich...@databricks.com> wrote: >> It sounds like you probably want to do a standard Spark map, that results in >> a tuple with the structure you are looking for. You can then just assign >> names to turn it back into a dataframe. >> >> Assuming the first column is your label and the rest are features you can do >> something like this: >> >> val df = sc.parallelize( >> (1.0, 2.3, 2.4) :: >> (1.2, 3.4, 1.2) :: >> (1.2, 2.3, 1.2) :: Nil).toDataFrame("a", "b", "c") >> >> df.map { row => >> (row.getDouble(0), row.toSeq.map(_.asInstanceOf[Double])) >> }.toDataFrame("label", "features") >> >> df: org.apache.spark.sql.DataFrame = [label: double, features: >> array<double>] >> >> If you'd prefer to stick closer to SQL you can define a UDF: >> >> val createArray = udf((a: Double, b: Double) => Seq(a, b)) >> df.select('a as 'label, createArray('b,'c) as 'features) >> >> df: org.apache.spark.sql.DataFrame = [label: double, features: >> array<double>] >> >> We'll add createArray as a first class member of the DSL. >> >> Michael >> >> On Wed, Feb 11, 2015 at 6:37 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote: >>> >>> Hey All, >>> >>> I've been playing around with the new DataFrame and ML pipelines APIs and >>> am having trouble accomplishing what seems like should be a fairly basic >>> task. >>> >>> I have a DataFrame where each column is a Double. I'd like to turn this >>> into a DataFrame with a features column and a label column that I can feed >>> into a regression. >>> >>> So far all the paths I've gone down have led me to internal APIs or >>> convoluted casting in and out of RDD[Row] and DataFrame. Is there a simple >>> way of accomplishing this? >>> >>> any assistance (lookin' at you Xiangrui) much appreciated, >>> Sandy >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org