It sounds like you probably want to do a standard Spark map, that results in a tuple with the structure you are looking for. You can then just assign names to turn it back into a dataframe.
Assuming the first column is your label and the rest are features you can do something like this: val df = sc.parallelize( (1.0, 2.3, 2.4) :: (1.2, 3.4, 1.2) :: (1.2, 2.3, 1.2) :: Nil).toDataFrame("a", "b", "c") df.map { row => (row.getDouble(0), row.toSeq.map(_.asInstanceOf[Double])) }.toDataFrame("label", "features") df: org.apache.spark.sql.DataFrame = [label: double, features: array<double>] If you'd prefer to stick closer to SQL you can define a UDF: val createArray = udf((a: Double, b: Double) => Seq(a, b)) df.select('a as 'label, createArray('b,'c) as 'features) df: org.apache.spark.sql.DataFrame = [label: double, features: array<double>] We'll add createArray as a first class member of the DSL. Michael On Wed, Feb 11, 2015 at 6:37 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > Hey All, > > I've been playing around with the new DataFrame and ML pipelines APIs and > am having trouble accomplishing what seems like should be a fairly basic > task. > > I have a DataFrame where each column is a Double. I'd like to turn this > into a DataFrame with a features column and a label column that I can feed > into a regression. > > So far all the paths I've gone down have led me to internal APIs or > convoluted casting in and out of RDD[Row] and DataFrame. Is there a simple > way of accomplishing this? > > any assistance (lookin' at you Xiangrui) much appreciated, > Sandy >