Hey Sandy,

The work should be done by a VectorAssembler, which combines multiple
columns (double/int/vector) into a vector column, which becomes the
features column for regression. We can going to create JIRAs for each
of these standard feature transformers. It would be great if you can
help implement some of them.

Best,
Xiangrui

On Wed, Feb 11, 2015 at 7:55 PM, Patrick Wendell <pwend...@gmail.com> wrote:
> I think there is a minor error here in that the first example needs a
> "tail" after the seq:
>
> df.map { row =>
>   (row.getDouble(0), row.toSeq.tail.map(_.asInstanceOf[Double]))
> }.toDataFrame("label", "features")
>
> On Wed, Feb 11, 2015 at 7:46 PM, Michael Armbrust
> <mich...@databricks.com> wrote:
>> It sounds like you probably want to do a standard Spark map, that results in
>> a tuple with the structure you are looking for.  You can then just assign
>> names to turn it back into a dataframe.
>>
>> Assuming the first column is your label and the rest are features you can do
>> something like this:
>>
>> val df = sc.parallelize(
>>   (1.0, 2.3, 2.4) ::
>>   (1.2, 3.4, 1.2) ::
>>   (1.2, 2.3, 1.2) :: Nil).toDataFrame("a", "b", "c")
>>
>> df.map { row =>
>>   (row.getDouble(0), row.toSeq.map(_.asInstanceOf[Double]))
>> }.toDataFrame("label", "features")
>>
>> df: org.apache.spark.sql.DataFrame = [label: double, features:
>> array<double>]
>>
>> If you'd prefer to stick closer to SQL you can define a UDF:
>>
>> val createArray = udf((a: Double, b: Double) => Seq(a, b))
>> df.select('a as 'label, createArray('b,'c) as 'features)
>>
>> df: org.apache.spark.sql.DataFrame = [label: double, features:
>> array<double>]
>>
>> We'll add createArray as a first class member of the DSL.
>>
>> Michael
>>
>> On Wed, Feb 11, 2015 at 6:37 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:
>>>
>>> Hey All,
>>>
>>> I've been playing around with the new DataFrame and ML pipelines APIs and
>>> am having trouble accomplishing what seems like should be a fairly basic
>>> task.
>>>
>>> I have a DataFrame where each column is a Double.  I'd like to turn this
>>> into a DataFrame with a features column and a label column that I can feed
>>> into a regression.
>>>
>>> So far all the paths I've gone down have led me to internal APIs or
>>> convoluted casting in and out of RDD[Row] and DataFrame.  Is there a simple
>>> way of accomplishing this?
>>>
>>> any assistance (lookin' at you Xiangrui) much appreciated,
>>> Sandy
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to