This can be significantly faster with a pandas UDF, note, because you can vectorize the operations.
On Fri, Apr 9, 2021, 7:32 AM ayan guha <guha.a...@gmail.com> wrote: > Hi > > We are using a haversine distance function for this, and wrapping it in > udf. > > from pyspark.sql.functions import acos, cos, sin, lit, toRadians, udf > from pyspark.sql.types import * > > def haversine_distance(long_x, lat_x, long_y, lat_y): > return acos( > sin(toRadians(lat_x)) * sin(toRadians(lat_y)) + > cos(toRadians(lat_x)) * cos(toRadians(lat_y)) * > cos(toRadians(long_x) - toRadians(long_y)) > ) * lit(6371.0) > > distudf = udf(haversine_distance, FloatType()) > > in case you just want to use just Spark SQL, you can still utilize the > functions shown above to implement in SQL. > > Any reason you do not want to use UDF? > > Credit > <https://stackoverflow.com/questions/38994903/how-to-sum-distances-between-data-points-in-a-dataset-using-pyspark> > > > On Fri, Apr 9, 2021 at 10:19 PM Rao Bandaru <rao.m...@outlook.com> wrote: > >> Hi All, >> >> >> >> I have a requirement to calculate distance between four >> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the *pysaprk >> dataframe *with the help of from *geopy* import *distance *without using >> *UDF* (user defined function)*,*Please help how to achieve this scenario >> and do the needful. >> >> >> >> Thanks, >> >> Ankamma Rao B >> > > > -- > Best Regards, > Ayan Guha >