Re: Best practice for preprocessing feature with DataFrame

2016-11-22 Thread Yan Facai
Thanks, White.

On Thu, Nov 17, 2016 at 11:15 PM, Stuart White 
wrote:

> Sorry.  Small typo.  That last part should be:
>
> val modifiedRows = rows
>   .select(
> substring('age, 0, 2) as "age",
> when('gender === 1, "male").otherwise(when('gender === 2,
> "female").otherwise("unknown")) as "gender"
>   )
> modifiedRows.show
>
> +---+---+
> |age| gender|
> +---+---+
> | 90|   male|
> | 80| female|
> | 80|unknown|
> +---+---+
>
> On Thu, Nov 17, 2016 at 8:57 AM, Stuart White 
> wrote:
> > import org.apache.spark.sql.functions._
> >
> > val rows = Seq(("90s", 1), ("80s", 2), ("80s", 3)).toDF("age", "gender")
> > rows.show
> >
> > +---+--+
> > |age|gender|
> > +---+--+
> > |90s| 1|
> > |80s| 2|
> > |80s| 3|
> > +---+--+
> >
> > val modifiedRows
> >   .select(
> > substring('age, 0, 2) as "age",
> > when('gender === 1, "male").otherwise(when('gender === 2,
> > "female").otherwise("unknown")) as "gender"
> >   )
> > modifiedRows.show
> >
> > +---+---+
> > |age| gender|
> > +---+---+
> > | 90|   male|
> > | 80| female|
> > | 80|unknown|
> > +---+---+
> >
> > On Thu, Nov 17, 2016 at 3:37 AM, 颜发才(Yan Facai) 
> wrote:
> >> Could you give me an example, how to use Column function?
> >> Thanks very much.
> >>
> >> On Thu, Nov 17, 2016 at 12:23 PM, Divya Gehlot  >
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> You can use the Column functions provided by Spark API
> >>>
> >>>
> >>> https://spark.apache.org/docs/1.6.2/api/java/org/apache/
> spark/sql/functions.html
> >>>
> >>> Hope this helps .
> >>>
> >>> Thanks,
> >>> Divya
> >>>
> >>>
> >>> On 17 November 2016 at 12:08, 颜发才(Yan Facai)  wrote:
> 
>  Hi,
>  I have a sample, like:
>  +---+--++
>  |age|gender| city_id|
>  +---+--++
>  |   | 1|1042015:city_2044...|
>  |90s| 2|1042015:city_2035...|
>  |80s| 2|1042015:city_2061...|
>  +---+--++
> 
>  and expectation is:
>  "age":  90s -> 90, 80s -> 80
>  "gender": 1 -> "male", 2 -> "female"
> 
>  I have two solutions:
>  1. Handle each column separately,  and then join all by index.
>  val age = input.select("age").map(...)
>  val gender = input.select("gender").map(...)
>  val result = ...
> 
>  2. Write utf function for each column, and then use in together:
>   val result = input.select(ageUDF($"age"), genderUDF($"gender"))
> 
>  However, both are awkward,
> 
>  Does anyone have a better work flow?
>  Write some custom Transforms and use pipeline?
> 
>  Thanks.
> 
> 
> 
> >>>
> >>
>


Re: Best practice for preprocessing feature with DataFrame

2016-11-17 Thread Stuart White
Sorry.  Small typo.  That last part should be:

val modifiedRows = rows
  .select(
substring('age, 0, 2) as "age",
when('gender === 1, "male").otherwise(when('gender === 2,
"female").otherwise("unknown")) as "gender"
  )
modifiedRows.show

+---+---+
|age| gender|
+---+---+
| 90|   male|
| 80| female|
| 80|unknown|
+---+---+

On Thu, Nov 17, 2016 at 8:57 AM, Stuart White  wrote:
> import org.apache.spark.sql.functions._
>
> val rows = Seq(("90s", 1), ("80s", 2), ("80s", 3)).toDF("age", "gender")
> rows.show
>
> +---+--+
> |age|gender|
> +---+--+
> |90s| 1|
> |80s| 2|
> |80s| 3|
> +---+--+
>
> val modifiedRows
>   .select(
> substring('age, 0, 2) as "age",
> when('gender === 1, "male").otherwise(when('gender === 2,
> "female").otherwise("unknown")) as "gender"
>   )
> modifiedRows.show
>
> +---+---+
> |age| gender|
> +---+---+
> | 90|   male|
> | 80| female|
> | 80|unknown|
> +---+---+
>
> On Thu, Nov 17, 2016 at 3:37 AM, 颜发才(Yan Facai)  wrote:
>> Could you give me an example, how to use Column function?
>> Thanks very much.
>>
>> On Thu, Nov 17, 2016 at 12:23 PM, Divya Gehlot 
>> wrote:
>>>
>>> Hi,
>>>
>>> You can use the Column functions provided by Spark API
>>>
>>>
>>> https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html
>>>
>>> Hope this helps .
>>>
>>> Thanks,
>>> Divya
>>>
>>>
>>> On 17 November 2016 at 12:08, 颜发才(Yan Facai)  wrote:

 Hi,
 I have a sample, like:
 +---+--++
 |age|gender| city_id|
 +---+--++
 |   | 1|1042015:city_2044...|
 |90s| 2|1042015:city_2035...|
 |80s| 2|1042015:city_2061...|
 +---+--++

 and expectation is:
 "age":  90s -> 90, 80s -> 80
 "gender": 1 -> "male", 2 -> "female"

 I have two solutions:
 1. Handle each column separately,  and then join all by index.
 val age = input.select("age").map(...)
 val gender = input.select("gender").map(...)
 val result = ...

 2. Write utf function for each column, and then use in together:
  val result = input.select(ageUDF($"age"), genderUDF($"gender"))

 However, both are awkward,

 Does anyone have a better work flow?
 Write some custom Transforms and use pipeline?

 Thanks.



>>>
>>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Best practice for preprocessing feature with DataFrame

2016-11-17 Thread Stuart White
import org.apache.spark.sql.functions._

val rows = Seq(("90s", 1), ("80s", 2), ("80s", 3)).toDF("age", "gender")
rows.show

+---+--+
|age|gender|
+---+--+
|90s| 1|
|80s| 2|
|80s| 3|
+---+--+

val modifiedRows
  .select(
substring('age, 0, 2) as "age",
when('gender === 1, "male").otherwise(when('gender === 2,
"female").otherwise("unknown")) as "gender"
  )
modifiedRows.show

+---+---+
|age| gender|
+---+---+
| 90|   male|
| 80| female|
| 80|unknown|
+---+---+

On Thu, Nov 17, 2016 at 3:37 AM, 颜发才(Yan Facai)  wrote:
> Could you give me an example, how to use Column function?
> Thanks very much.
>
> On Thu, Nov 17, 2016 at 12:23 PM, Divya Gehlot 
> wrote:
>>
>> Hi,
>>
>> You can use the Column functions provided by Spark API
>>
>>
>> https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html
>>
>> Hope this helps .
>>
>> Thanks,
>> Divya
>>
>>
>> On 17 November 2016 at 12:08, 颜发才(Yan Facai)  wrote:
>>>
>>> Hi,
>>> I have a sample, like:
>>> +---+--++
>>> |age|gender| city_id|
>>> +---+--++
>>> |   | 1|1042015:city_2044...|
>>> |90s| 2|1042015:city_2035...|
>>> |80s| 2|1042015:city_2061...|
>>> +---+--++
>>>
>>> and expectation is:
>>> "age":  90s -> 90, 80s -> 80
>>> "gender": 1 -> "male", 2 -> "female"
>>>
>>> I have two solutions:
>>> 1. Handle each column separately,  and then join all by index.
>>> val age = input.select("age").map(...)
>>> val gender = input.select("gender").map(...)
>>> val result = ...
>>>
>>> 2. Write utf function for each column, and then use in together:
>>>  val result = input.select(ageUDF($"age"), genderUDF($"gender"))
>>>
>>> However, both are awkward,
>>>
>>> Does anyone have a better work flow?
>>> Write some custom Transforms and use pipeline?
>>>
>>> Thanks.
>>>
>>>
>>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Best practice for preprocessing feature with DataFrame

2016-11-17 Thread Yan Facai
Could you give me an example, how to use Column function?
Thanks very much.

On Thu, Nov 17, 2016 at 12:23 PM, Divya Gehlot 
wrote:

> Hi,
>
> You can use the Column functions provided by Spark API
>
> https://spark.apache.org/docs/1.6.2/api/java/org/apache/
> spark/sql/functions.html
>
> Hope this helps .
>
> Thanks,
> Divya
>
>
> On 17 November 2016 at 12:08, 颜发才(Yan Facai)  wrote:
>
>> Hi,
>> I have a sample, like:
>> +---+--++
>> |age|gender| city_id|
>> +---+--++
>> |   | 1|1042015:city_2044...|
>> |90s| 2|1042015:city_2035...|
>> |80s| 2|1042015:city_2061...|
>> +---+--++
>>
>> and expectation is:
>> "age":  90s -> 90, 80s -> 80
>> "gender": 1 -> "male", 2 -> "female"
>>
>> I have two solutions:
>> 1. Handle each column separately,  and then join all by index.
>> val age = input.select("age").map(...)
>> val gender = input.select("gender").map(...)
>> val result = ...
>>
>> 2. Write utf function for each column, and then use in together:
>>  val result = input.select(ageUDF($"age"), genderUDF($"gender"))
>>
>> However, both are awkward,
>>
>> Does anyone have a better work flow?
>> Write some custom Transforms and use pipeline?
>>
>> Thanks.
>>
>>
>>
>>
>


Re: Best practice for preprocessing feature with DataFrame

2016-11-16 Thread Divya Gehlot
Hi,

You can use the Column functions provided by Spark API

https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html

Hope this helps .

Thanks,
Divya


On 17 November 2016 at 12:08, 颜发才(Yan Facai)  wrote:

> Hi,
> I have a sample, like:
> +---+--++
> |age|gender| city_id|
> +---+--++
> |   | 1|1042015:city_2044...|
> |90s| 2|1042015:city_2035...|
> |80s| 2|1042015:city_2061...|
> +---+--++
>
> and expectation is:
> "age":  90s -> 90, 80s -> 80
> "gender": 1 -> "male", 2 -> "female"
>
> I have two solutions:
> 1. Handle each column separately,  and then join all by index.
> val age = input.select("age").map(...)
> val gender = input.select("gender").map(...)
> val result = ...
>
> 2. Write utf function for each column, and then use in together:
>  val result = input.select(ageUDF($"age"), genderUDF($"gender"))
>
> However, both are awkward,
>
> Does anyone have a better work flow?
> Write some custom Transforms and use pipeline?
>
> Thanks.
>
>
>
>


Best practice for preprocessing feature with DataFrame

2016-11-16 Thread Yan Facai
Hi,
I have a sample, like:
+---+--++
|age|gender| city_id|
+---+--++
|   | 1|1042015:city_2044...|
|90s| 2|1042015:city_2035...|
|80s| 2|1042015:city_2061...|
+---+--++

and expectation is:
"age":  90s -> 90, 80s -> 80
"gender": 1 -> "male", 2 -> "female"

I have two solutions:
1. Handle each column separately,  and then join all by index.
val age = input.select("age").map(...)
val gender = input.select("gender").map(...)
val result = ...

2. Write utf function for each column, and then use in together:
 val result = input.select(ageUDF($"age"), genderUDF($"gender"))

However, both are awkward,

Does anyone have a better work flow?
Write some custom Transforms and use pipeline?

Thanks.