This can be significantly faster with a pandas UDF, note, because you can
vectorize the operations.

On Fri, Apr 9, 2021, 7:32 AM ayan guha <guha.a...@gmail.com> wrote:

> Hi
>
> We are using a haversine distance function for this, and wrapping it in
> udf.
>
> from pyspark.sql.functions import acos, cos, sin, lit, toRadians, udf
> from pyspark.sql.types import *
>
> def haversine_distance(long_x, lat_x, long_y, lat_y):
>     return acos(
>         sin(toRadians(lat_x)) * sin(toRadians(lat_y)) +
>         cos(toRadians(lat_x)) * cos(toRadians(lat_y)) *
>             cos(toRadians(long_x) - toRadians(long_y))
>     ) * lit(6371.0)
>
> distudf = udf(haversine_distance, FloatType())
>
> in case you just want to use just Spark SQL, you can still utilize the
> functions shown above to implement in SQL.
>
> Any reason you do not want to use UDF?
>
> Credit
> <https://stackoverflow.com/questions/38994903/how-to-sum-distances-between-data-points-in-a-dataset-using-pyspark>
>
>
> On Fri, Apr 9, 2021 at 10:19 PM Rao Bandaru <rao.m...@outlook.com> wrote:
>
>> Hi All,
>>
>>
>>
>> I have a requirement to calculate distance between four
>> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the *pysaprk
>> dataframe *with the help of from *geopy* import *distance *without using
>> *UDF* (user defined function)*,*Please help how to achieve this scenario
>> and do the needful.
>>
>>
>>
>> Thanks,
>>
>> Ankamma Rao B
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Reply via email to