Re: Best practice for preprocessing feature with DataFrame

Stuart White Thu, 17 Nov 2016 07:04:01 -0800

import org.apache.spark.sql.functions._

val rows = Seq(("90s", 1), ("80s", 2), ("80s", 3)).toDF("age", "gender")
rows.show


+---+------+
|age|gender|
+---+------+
|90s|     1|
|80s|     2|
|80s|     3|
+---+------+

val modifiedRows
  .select(
    substring('age, 0, 2) as "age",
    when('gender === 1, "male").otherwise(when('gender === 2,
"female").otherwise("unknown")) as "gender"
  )
modifiedRows.show

+---+-------+
|age| gender|
+---+-------+
| 90|   male|
| 80| female|
| 80|unknown|
+---+-------+

On Thu, Nov 17, 2016 at 3:37 AM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote:
> Could you give me an example, how to use Column function?
> Thanks very much.
>
> On Thu, Nov 17, 2016 at 12:23 PM, Divya Gehlot <divya.htco...@gmail.com>
> wrote:
>>
>> Hi,
>>
>> You can use the Column functions provided by Spark API
>>
>>
>> https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html
>>
>> Hope this helps .
>>
>> Thanks,
>> Divya
>>
>>
>> On 17 November 2016 at 12:08, 颜发才(Yan Facai) <yaf...@gmail.com> wrote:
>>>
>>> Hi,
>>> I have a sample, like:
>>> +---+------+--------------------+
>>> |age|gender|             city_id|
>>> +---+------+--------------------+
>>> |   |     1|1042015:city_2044...|
>>> |90s|     2|1042015:city_2035...|
>>> |80s|     2|1042015:city_2061...|
>>> +---+------+--------------------+
>>>
>>> and expectation is:
>>> "age":  90s -> 90, 80s -> 80
>>> "gender": 1 -> "male", 2 -> "female"
>>>
>>> I have two solutions:
>>> 1. Handle each column separately,  and then join all by index.
>>>     val age = input.select("age").map(...)
>>>     val gender = input.select("gender").map(...)
>>>     val result = ...
>>>
>>> 2. Write utf function for each column, and then use in together:
>>>      val result = input.select(ageUDF($"age"), genderUDF($"gender"))
>>>
>>> However, both are awkward,
>>>
>>> Does anyone have a better work flow?
>>> Write some custom Transforms and use pipeline?
>>>
>>> Thanks.
>>>
>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Best practice for preprocessing feature with DataFrame

Reply via email to