Re: Best practice for preprocessing feature with DataFrame

Stuart White Thu, 17 Nov 2016 07:15:41 -0800

Sorry.  Small typo.  That last part should be:

val modifiedRows = rows
  .select(
    substring('age, 0, 2) as "age",
    when('gender === 1, "male").otherwise(when('gender === 2,
"female").otherwise("unknown")) as "gender"
  )
modifiedRows.show


+---+-------+
|age| gender|
+---+-------+
| 90|   male|
| 80| female|
| 80|unknown|
+---+-------+

On Thu, Nov 17, 2016 at 8:57 AM, Stuart White <stuart.whi...@gmail.com> wrote:
> import org.apache.spark.sql.functions._
>
> val rows = Seq(("90s", 1), ("80s", 2), ("80s", 3)).toDF("age", "gender")
> rows.show
>
> +---+------+
> |age|gender|
> +---+------+
> |90s|     1|
> |80s|     2|
> |80s|     3|
> +---+------+
>
> val modifiedRows
>   .select(
>     substring('age, 0, 2) as "age",
>     when('gender === 1, "male").otherwise(when('gender === 2,
> "female").otherwise("unknown")) as "gender"
>   )
> modifiedRows.show
>
> +---+-------+
> |age| gender|
> +---+-------+
> | 90|   male|
> | 80| female|
> | 80|unknown|
> +---+-------+
>
> On Thu, Nov 17, 2016 at 3:37 AM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote:
>> Could you give me an example, how to use Column function?
>> Thanks very much.
>>
>> On Thu, Nov 17, 2016 at 12:23 PM, Divya Gehlot <divya.htco...@gmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> You can use the Column functions provided by Spark API
>>>
>>>
>>> https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html
>>>
>>> Hope this helps .
>>>
>>> Thanks,
>>> Divya
>>>
>>>
>>> On 17 November 2016 at 12:08, 颜发才(Yan Facai) <yaf...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>> I have a sample, like:
>>>> +---+------+--------------------+
>>>> |age|gender|             city_id|
>>>> +---+------+--------------------+
>>>> |   |     1|1042015:city_2044...|
>>>> |90s|     2|1042015:city_2035...|
>>>> |80s|     2|1042015:city_2061...|
>>>> +---+------+--------------------+
>>>>
>>>> and expectation is:
>>>> "age":  90s -> 90, 80s -> 80
>>>> "gender": 1 -> "male", 2 -> "female"
>>>>
>>>> I have two solutions:
>>>> 1. Handle each column separately,  and then join all by index.
>>>>     val age = input.select("age").map(...)
>>>>     val gender = input.select("gender").map(...)
>>>>     val result = ...
>>>>
>>>> 2. Write utf function for each column, and then use in together:
>>>>      val result = input.select(ageUDF($"age"), genderUDF($"gender"))
>>>>
>>>> However, both are awkward,
>>>>
>>>> Does anyone have a better work flow?
>>>> Write some custom Transforms and use pipeline?
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>
>>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Best practice for preprocessing feature with DataFrame

Reply via email to