Actually, good question, I'm not sure. I don't think that Spark would
vectorize these operations over rows.
Whereas in a pandas UDF, given a DataFrame, you can apply operations like
sin to 1000s of values at once in native code via numpy. It's trivially
'vectorizable' and I've seen good wins over, at least, a single-row UDF.

On Fri, Apr 9, 2021 at 9:14 AM ayan guha <guha.a...@gmail.com> wrote:

> Hi Sean - absolutely open to suggestions.
>
> My impression was using spark native functions should provide similar perf
> as scala ones because serialization penalty should not be there, unlike
> native python udfs.
>
> Is it wrong understanding?
>
>
>
> On Fri, 9 Apr 2021 at 10:55 pm, Rao Bandaru <rao.m...@outlook.com> wrote:
>
>> Hi All,
>>
>>
>> yes ,i need to add the below scenario based code to the executing spark
>> job,while executing this it took lot of time to complete,please suggest
>> best way to get below requirement without using UDF
>>
>>
>> Thanks,
>>
>> Ankamma Rao B
>> ------------------------------
>> *From:* Sean Owen <sro...@gmail.com>
>> *Sent:* Friday, April 9, 2021 6:11 PM
>> *To:* ayan guha <guha.a...@gmail.com>
>> *Cc:* Rao Bandaru <rao.m...@outlook.com>; User <user@spark.apache.org>
>> *Subject:* Re: [Spark SQL]:to calculate distance between four
>> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk
>> dataframe
>>
>> This can be significantly faster with a pandas UDF, note, because you can
>> vectorize the operations.
>>
>> On Fri, Apr 9, 2021, 7:32 AM ayan guha <guha.a...@gmail.com> wrote:
>>
>> Hi
>>
>> We are using a haversine distance function for this, and wrapping it in
>> udf.
>>
>> from pyspark.sql.functions import acos, cos, sin, lit, toRadians, udf
>> from pyspark.sql.types import *
>>
>> def haversine_distance(long_x, lat_x, long_y, lat_y):
>>     return acos(
>>         sin(toRadians(lat_x)) * sin(toRadians(lat_y)) +
>>         cos(toRadians(lat_x)) * cos(toRadians(lat_y)) *
>>             cos(toRadians(long_x) - toRadians(long_y))
>>     ) * lit(6371.0)
>>
>> distudf = udf(haversine_distance, FloatType())
>>
>> in case you just want to use just Spark SQL, you can still utilize the
>> functions shown above to implement in SQL.
>>
>> Any reason you do not want to use UDF?
>>
>> Credit
>> <https://stackoverflow.com/questions/38994903/how-to-sum-distances-between-data-points-in-a-dataset-using-pyspark>
>>
>>
>> On Fri, Apr 9, 2021 at 10:19 PM Rao Bandaru <rao.m...@outlook.com> wrote:
>>
>> Hi All,
>>
>>
>>
>> I have a requirement to calculate distance between four
>> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the *pysaprk
>> dataframe *with the help of from *geopy* import *distance *without using
>> *UDF* (user defined function)*,*Please help how to achieve this scenario
>> and do the needful.
>>
>>
>>
>> Thanks,
>>
>> Ankamma Rao B
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>> --
> Best Regards,
> Ayan Guha
>

Reply via email to