Hi Jacek,

Thanks for your reply. Your provided case was actually same as my second
option in my original email. What I'm wondering was the difference between
those two regarding query performance or efficiency.

On Tue, May 14, 2019 at 3:51 PM Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> For this particular case I'd use Column.substr (
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column),
> e.g.
>
> val ns = Seq(("hello world", 1, 5)).toDF("w", "b", "e")
> scala> ns.select($"w".substr($"b", $"e" - $"b" + 1) as "demo").show
> +-----+
> | demo|
> +-----+
> |hello|
> +-----+
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, May 14, 2019 at 5:08 PM Qian He <hq.ja...@gmail.com> wrote:
>
>> For example, I have a dataframe with 3 columns: URL, START, END. For each
>> url from URL column, I want to fetch a substring of it starting from START
>> and ending at END.
>> +------------------------+----------+-----+
>> |URL                        |START |END |
>> +------------------------+----------+-----+
>> |www.amazon.com  |4          |14 |
>> |www.yahoo.com     |4          |13 |
>> |www.amazon.com  |4          |14 |
>> |www.google.com    |4          |14 |
>>
>> I have UDF1:
>>
>> def getSubString = (input: String, start: Int, end: Int) => {
>>    input.substring(start, end)
>> }
>> val udf1 = udf(getSubString)
>>
>> and another UDF2:
>>
>> def getColSubString()(c1: Column, c2: Column, c3: Column): Column = {
>>    c1.substr(c2, c3-c2)
>> }
>>
>> Let's assume they can both generate the result I want. But, from performance 
>> perspective, is there any difference between those two UDFs?
>>
>>
>>

Reply via email to