It sounds like you probably want to do a standard Spark map, that results
in a tuple with the structure you are looking for.  You can then just
assign names to turn it back into a dataframe.

Assuming the first column is your label and the rest are features you can
do something like this:

val df = sc.parallelize(
  (1.0, 2.3, 2.4) ::
  (1.2, 3.4, 1.2) ::
  (1.2, 2.3, 1.2) :: Nil).toDataFrame("a", "b", "c")

df.map { row =>
  (row.getDouble(0), row.toSeq.map(_.asInstanceOf[Double]))
}.toDataFrame("label", "features")

df: org.apache.spark.sql.DataFrame = [label: double, features:
array<double>]

If you'd prefer to stick closer to SQL you can define a UDF:

val createArray = udf((a: Double, b: Double) => Seq(a, b))
df.select('a as 'label, createArray('b,'c) as 'features)

df: org.apache.spark.sql.DataFrame = [label: double, features:
array<double>]

We'll add createArray as a first class member of the DSL.

Michael

On Wed, Feb 11, 2015 at 6:37 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:

> Hey All,
>
> I've been playing around with the new DataFrame and ML pipelines APIs and
> am having trouble accomplishing what seems like should be a fairly basic
> task.
>
> I have a DataFrame where each column is a Double.  I'd like to turn this
> into a DataFrame with a features column and a label column that I can feed
> into a regression.
>
> So far all the paths I've gone down have led me to internal APIs or
> convoluted casting in and out of RDD[Row] and DataFrame.  Is there a simple
> way of accomplishing this?
>
> any assistance (lookin' at you Xiangrui) much appreciated,
> Sandy
>

Reply via email to