Re: Complex transformation on a dataframe column

2015-10-17 Thread Raghavendra Pandey
Here is a quick code sample I can come up with :

case class Input(ID:String, Name:String, PhoneNumber:String, Address:
String)
val df = sc.parallelize(Seq(Input("1", "raghav", "0123456789",
"houseNo:StreetNo:City:State:Zip"))).toDF()
val formatAddress = udf { (s: String) => s.split(":").mkString("-")}
val outputDF = df.withColumn("FormattedAddress",
formatAddress(df("Address")))


-Raghav

On Thu, Oct 15, 2015 at 10:34 PM, Hao Wang <billhao.l...@gmail.com> wrote:

> Hi,
>
> I have searched around but could not find a satisfying answer to this
> question: what is the best way to do a complex transformation on a
> dataframe column?
>
> For example, I have a dataframe with the following schema and a function
> that has pretty complex logic to format addresses. I would like to use the
> function to format each address and store the output as an additional
> column in the dataframe. What is the best way to do it? Use Dataframe.map?
> Define a UDF? Some code example would be appreciated.
>
> Input dataframe:
> root
>  |-- ID: string (nullable = true)
>  |-- Name: string (nullable = true)
>  |-- PhoneNumber: string (nullable = true)
>  |-- Address: string (nullable = true)
>
> Output dataframe:
> root
>  |-- ID: string (nullable = true)
>  |-- Name: string (nullable = true)
>  |-- PhoneNumber: string (nullable = true)
>  |-- Address: string (nullable = true)
>  |-- FormattedAddress: string (nullable = true)
>
> The function for format addresses:
> def formatAddress(address: String): String
>
>
> Best regards,
> Hao Wang
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Complex transformation on a dataframe column

2015-10-15 Thread Hao Wang
Hi,

I have searched around but could not find a satisfying answer to this question: 
what is the best way to do a complex transformation on a dataframe column?

For example, I have a dataframe with the following schema and a function that 
has pretty complex logic to format addresses. I would like to use the function 
to format each address and store the output as an additional column in the 
dataframe. What is the best way to do it? Use Dataframe.map? Define a UDF? Some 
code example would be appreciated.

Input dataframe:
root
 |-- ID: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- PhoneNumber: string (nullable = true)
 |-- Address: string (nullable = true)

Output dataframe:
root
 |-- ID: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- PhoneNumber: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- FormattedAddress: string (nullable = true)

The function for format addresses:
def formatAddress(address: String): String


Best regards,
Hao Wang

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org