Re: What is the difference for the following UDFs?

2019-05-14 Thread Qian He
Hi Jacek,

Thanks for your reply. Your provided case was actually same as my second
option in my original email. What I'm wondering was the difference between
those two regarding query performance or efficiency.

On Tue, May 14, 2019 at 3:51 PM Jacek Laskowski  wrote:

> Hi,
>
> For this particular case I'd use Column.substr (
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column),
> e.g.
>
> val ns = Seq(("hello world", 1, 5)).toDF("w", "b", "e")
> scala> ns.select($"w".substr($"b", $"e" - $"b" + 1) as "demo").show
> +-+
> | demo|
> +-+
> |hello|
> +-+
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, May 14, 2019 at 5:08 PM Qian He  wrote:
>
>> For example, I have a dataframe with 3 columns: URL, START, END. For each
>> url from URL column, I want to fetch a substring of it starting from START
>> and ending at END.
>> ++--+-+
>> |URL|START |END |
>> ++--+-+
>> |www.amazon.com  |4  |14 |
>> |www.yahoo.com |4  |13 |
>> |www.amazon.com  |4  |14 |
>> |www.google.com|4  |14 |
>>
>> I have UDF1:
>>
>> def getSubString = (input: String, start: Int, end: Int) => {
>>input.substring(start, end)
>> }
>> val udf1 = udf(getSubString)
>>
>> and another UDF2:
>>
>> def getColSubString()(c1: Column, c2: Column, c3: Column): Column = {
>>c1.substr(c2, c3-c2)
>> }
>>
>> Let's assume they can both generate the result I want. But, from performance 
>> perspective, is there any difference between those two UDFs?
>>
>>
>>


Re: What is the difference for the following UDFs?

2019-05-14 Thread Jacek Laskowski
Hi,

For this particular case I'd use Column.substr (
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column),
e.g.

val ns = Seq(("hello world", 1, 5)).toDF("w", "b", "e")
scala> ns.select($"w".substr($"b", $"e" - $"b" + 1) as "demo").show
+-+
| demo|
+-+
|hello|
+-+

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski


On Tue, May 14, 2019 at 5:08 PM Qian He  wrote:

> For example, I have a dataframe with 3 columns: URL, START, END. For each
> url from URL column, I want to fetch a substring of it starting from START
> and ending at END.
> ++--+-+
> |URL|START |END |
> ++--+-+
> |www.amazon.com  |4  |14 |
> |www.yahoo.com |4  |13 |
> |www.amazon.com  |4  |14 |
> |www.google.com|4  |14 |
>
> I have UDF1:
>
> def getSubString = (input: String, start: Int, end: Int) => {
>input.substring(start, end)
> }
> val udf1 = udf(getSubString)
>
> and another UDF2:
>
> def getColSubString()(c1: Column, c2: Column, c3: Column): Column = {
>c1.substr(c2, c3-c2)
> }
>
> Let's assume they can both generate the result I want. But, from performance 
> perspective, is there any difference between those two UDFs?
>
>
>