Re: Best practice for preprocessing feature with DataFrame
Thanks, White. On Thu, Nov 17, 2016 at 11:15 PM, Stuart Whitewrote: > Sorry. Small typo. That last part should be: > > val modifiedRows = rows > .select( > substring('age, 0, 2) as "age", > when('gender === 1, "male").otherwise(when('gender === 2, > "female").otherwise("unknown")) as "gender" > ) > modifiedRows.show > > +---+---+ > |age| gender| > +---+---+ > | 90| male| > | 80| female| > | 80|unknown| > +---+---+ > > On Thu, Nov 17, 2016 at 8:57 AM, Stuart White > wrote: > > import org.apache.spark.sql.functions._ > > > > val rows = Seq(("90s", 1), ("80s", 2), ("80s", 3)).toDF("age", "gender") > > rows.show > > > > +---+--+ > > |age|gender| > > +---+--+ > > |90s| 1| > > |80s| 2| > > |80s| 3| > > +---+--+ > > > > val modifiedRows > > .select( > > substring('age, 0, 2) as "age", > > when('gender === 1, "male").otherwise(when('gender === 2, > > "female").otherwise("unknown")) as "gender" > > ) > > modifiedRows.show > > > > +---+---+ > > |age| gender| > > +---+---+ > > | 90| male| > > | 80| female| > > | 80|unknown| > > +---+---+ > > > > On Thu, Nov 17, 2016 at 3:37 AM, 颜发才(Yan Facai) > wrote: > >> Could you give me an example, how to use Column function? > >> Thanks very much. > >> > >> On Thu, Nov 17, 2016 at 12:23 PM, Divya Gehlot > > >> wrote: > >>> > >>> Hi, > >>> > >>> You can use the Column functions provided by Spark API > >>> > >>> > >>> https://spark.apache.org/docs/1.6.2/api/java/org/apache/ > spark/sql/functions.html > >>> > >>> Hope this helps . > >>> > >>> Thanks, > >>> Divya > >>> > >>> > >>> On 17 November 2016 at 12:08, 颜发才(Yan Facai) wrote: > > Hi, > I have a sample, like: > +---+--++ > |age|gender| city_id| > +---+--++ > | | 1|1042015:city_2044...| > |90s| 2|1042015:city_2035...| > |80s| 2|1042015:city_2061...| > +---+--++ > > and expectation is: > "age": 90s -> 90, 80s -> 80 > "gender": 1 -> "male", 2 -> "female" > > I have two solutions: > 1. Handle each column separately, and then join all by index. > val age = input.select("age").map(...) > val gender = input.select("gender").map(...) > val result = ... > > 2. Write utf function for each column, and then use in together: > val result = input.select(ageUDF($"age"), genderUDF($"gender")) > > However, both are awkward, > > Does anyone have a better work flow? > Write some custom Transforms and use pipeline? > > Thanks. > > > > >>> > >> >
Re: Best practice for preprocessing feature with DataFrame
Sorry. Small typo. That last part should be: val modifiedRows = rows .select( substring('age, 0, 2) as "age", when('gender === 1, "male").otherwise(when('gender === 2, "female").otherwise("unknown")) as "gender" ) modifiedRows.show +---+---+ |age| gender| +---+---+ | 90| male| | 80| female| | 80|unknown| +---+---+ On Thu, Nov 17, 2016 at 8:57 AM, Stuart Whitewrote: > import org.apache.spark.sql.functions._ > > val rows = Seq(("90s", 1), ("80s", 2), ("80s", 3)).toDF("age", "gender") > rows.show > > +---+--+ > |age|gender| > +---+--+ > |90s| 1| > |80s| 2| > |80s| 3| > +---+--+ > > val modifiedRows > .select( > substring('age, 0, 2) as "age", > when('gender === 1, "male").otherwise(when('gender === 2, > "female").otherwise("unknown")) as "gender" > ) > modifiedRows.show > > +---+---+ > |age| gender| > +---+---+ > | 90| male| > | 80| female| > | 80|unknown| > +---+---+ > > On Thu, Nov 17, 2016 at 3:37 AM, 颜发才(Yan Facai) wrote: >> Could you give me an example, how to use Column function? >> Thanks very much. >> >> On Thu, Nov 17, 2016 at 12:23 PM, Divya Gehlot >> wrote: >>> >>> Hi, >>> >>> You can use the Column functions provided by Spark API >>> >>> >>> https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html >>> >>> Hope this helps . >>> >>> Thanks, >>> Divya >>> >>> >>> On 17 November 2016 at 12:08, 颜发才(Yan Facai) wrote: Hi, I have a sample, like: +---+--++ |age|gender| city_id| +---+--++ | | 1|1042015:city_2044...| |90s| 2|1042015:city_2035...| |80s| 2|1042015:city_2061...| +---+--++ and expectation is: "age": 90s -> 90, 80s -> 80 "gender": 1 -> "male", 2 -> "female" I have two solutions: 1. Handle each column separately, and then join all by index. val age = input.select("age").map(...) val gender = input.select("gender").map(...) val result = ... 2. Write utf function for each column, and then use in together: val result = input.select(ageUDF($"age"), genderUDF($"gender")) However, both are awkward, Does anyone have a better work flow? Write some custom Transforms and use pipeline? Thanks. >>> >> - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Best practice for preprocessing feature with DataFrame
import org.apache.spark.sql.functions._ val rows = Seq(("90s", 1), ("80s", 2), ("80s", 3)).toDF("age", "gender") rows.show +---+--+ |age|gender| +---+--+ |90s| 1| |80s| 2| |80s| 3| +---+--+ val modifiedRows .select( substring('age, 0, 2) as "age", when('gender === 1, "male").otherwise(when('gender === 2, "female").otherwise("unknown")) as "gender" ) modifiedRows.show +---+---+ |age| gender| +---+---+ | 90| male| | 80| female| | 80|unknown| +---+---+ On Thu, Nov 17, 2016 at 3:37 AM, 颜发才(Yan Facai)wrote: > Could you give me an example, how to use Column function? > Thanks very much. > > On Thu, Nov 17, 2016 at 12:23 PM, Divya Gehlot > wrote: >> >> Hi, >> >> You can use the Column functions provided by Spark API >> >> >> https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html >> >> Hope this helps . >> >> Thanks, >> Divya >> >> >> On 17 November 2016 at 12:08, 颜发才(Yan Facai) wrote: >>> >>> Hi, >>> I have a sample, like: >>> +---+--++ >>> |age|gender| city_id| >>> +---+--++ >>> | | 1|1042015:city_2044...| >>> |90s| 2|1042015:city_2035...| >>> |80s| 2|1042015:city_2061...| >>> +---+--++ >>> >>> and expectation is: >>> "age": 90s -> 90, 80s -> 80 >>> "gender": 1 -> "male", 2 -> "female" >>> >>> I have two solutions: >>> 1. Handle each column separately, and then join all by index. >>> val age = input.select("age").map(...) >>> val gender = input.select("gender").map(...) >>> val result = ... >>> >>> 2. Write utf function for each column, and then use in together: >>> val result = input.select(ageUDF($"age"), genderUDF($"gender")) >>> >>> However, both are awkward, >>> >>> Does anyone have a better work flow? >>> Write some custom Transforms and use pipeline? >>> >>> Thanks. >>> >>> >>> >> > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Best practice for preprocessing feature with DataFrame
Could you give me an example, how to use Column function? Thanks very much. On Thu, Nov 17, 2016 at 12:23 PM, Divya Gehlotwrote: > Hi, > > You can use the Column functions provided by Spark API > > https://spark.apache.org/docs/1.6.2/api/java/org/apache/ > spark/sql/functions.html > > Hope this helps . > > Thanks, > Divya > > > On 17 November 2016 at 12:08, 颜发才(Yan Facai) wrote: > >> Hi, >> I have a sample, like: >> +---+--++ >> |age|gender| city_id| >> +---+--++ >> | | 1|1042015:city_2044...| >> |90s| 2|1042015:city_2035...| >> |80s| 2|1042015:city_2061...| >> +---+--++ >> >> and expectation is: >> "age": 90s -> 90, 80s -> 80 >> "gender": 1 -> "male", 2 -> "female" >> >> I have two solutions: >> 1. Handle each column separately, and then join all by index. >> val age = input.select("age").map(...) >> val gender = input.select("gender").map(...) >> val result = ... >> >> 2. Write utf function for each column, and then use in together: >> val result = input.select(ageUDF($"age"), genderUDF($"gender")) >> >> However, both are awkward, >> >> Does anyone have a better work flow? >> Write some custom Transforms and use pipeline? >> >> Thanks. >> >> >> >> >
Re: Best practice for preprocessing feature with DataFrame
Hi, You can use the Column functions provided by Spark API https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html Hope this helps . Thanks, Divya On 17 November 2016 at 12:08, 颜发才(Yan Facai)wrote: > Hi, > I have a sample, like: > +---+--++ > |age|gender| city_id| > +---+--++ > | | 1|1042015:city_2044...| > |90s| 2|1042015:city_2035...| > |80s| 2|1042015:city_2061...| > +---+--++ > > and expectation is: > "age": 90s -> 90, 80s -> 80 > "gender": 1 -> "male", 2 -> "female" > > I have two solutions: > 1. Handle each column separately, and then join all by index. > val age = input.select("age").map(...) > val gender = input.select("gender").map(...) > val result = ... > > 2. Write utf function for each column, and then use in together: > val result = input.select(ageUDF($"age"), genderUDF($"gender")) > > However, both are awkward, > > Does anyone have a better work flow? > Write some custom Transforms and use pipeline? > > Thanks. > > > >
Best practice for preprocessing feature with DataFrame
Hi, I have a sample, like: +---+--++ |age|gender| city_id| +---+--++ | | 1|1042015:city_2044...| |90s| 2|1042015:city_2035...| |80s| 2|1042015:city_2061...| +---+--++ and expectation is: "age": 90s -> 90, 80s -> 80 "gender": 1 -> "male", 2 -> "female" I have two solutions: 1. Handle each column separately, and then join all by index. val age = input.select("age").map(...) val gender = input.select("gender").map(...) val result = ... 2. Write utf function for each column, and then use in together: val result = input.select(ageUDF($"age"), genderUDF($"gender")) However, both are awkward, Does anyone have a better work flow? Write some custom Transforms and use pipeline? Thanks.