Actually, good question, I'm not sure. I don't think that Spark would vectorize these operations over rows. Whereas in a pandas UDF, given a DataFrame, you can apply operations like sin to 1000s of values at once in native code via numpy. It's trivially 'vectorizable' and I've seen good wins over, at least, a single-row UDF.
On Fri, Apr 9, 2021 at 9:14 AM ayan guha <guha.a...@gmail.com> wrote: > Hi Sean - absolutely open to suggestions. > > My impression was using spark native functions should provide similar perf > as scala ones because serialization penalty should not be there, unlike > native python udfs. > > Is it wrong understanding? > > > > On Fri, 9 Apr 2021 at 10:55 pm, Rao Bandaru <rao.m...@outlook.com> wrote: > >> Hi All, >> >> >> yes ,i need to add the below scenario based code to the executing spark >> job,while executing this it took lot of time to complete,please suggest >> best way to get below requirement without using UDF >> >> >> Thanks, >> >> Ankamma Rao B >> ------------------------------ >> *From:* Sean Owen <sro...@gmail.com> >> *Sent:* Friday, April 9, 2021 6:11 PM >> *To:* ayan guha <guha.a...@gmail.com> >> *Cc:* Rao Bandaru <rao.m...@outlook.com>; User <user@spark.apache.org> >> *Subject:* Re: [Spark SQL]:to calculate distance between four >> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk >> dataframe >> >> This can be significantly faster with a pandas UDF, note, because you can >> vectorize the operations. >> >> On Fri, Apr 9, 2021, 7:32 AM ayan guha <guha.a...@gmail.com> wrote: >> >> Hi >> >> We are using a haversine distance function for this, and wrapping it in >> udf. >> >> from pyspark.sql.functions import acos, cos, sin, lit, toRadians, udf >> from pyspark.sql.types import * >> >> def haversine_distance(long_x, lat_x, long_y, lat_y): >> return acos( >> sin(toRadians(lat_x)) * sin(toRadians(lat_y)) + >> cos(toRadians(lat_x)) * cos(toRadians(lat_y)) * >> cos(toRadians(long_x) - toRadians(long_y)) >> ) * lit(6371.0) >> >> distudf = udf(haversine_distance, FloatType()) >> >> in case you just want to use just Spark SQL, you can still utilize the >> functions shown above to implement in SQL. >> >> Any reason you do not want to use UDF? >> >> Credit >> <https://stackoverflow.com/questions/38994903/how-to-sum-distances-between-data-points-in-a-dataset-using-pyspark> >> >> >> On Fri, Apr 9, 2021 at 10:19 PM Rao Bandaru <rao.m...@outlook.com> wrote: >> >> Hi All, >> >> >> >> I have a requirement to calculate distance between four >> coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the *pysaprk >> dataframe *with the help of from *geopy* import *distance *without using >> *UDF* (user defined function)*,*Please help how to achieve this scenario >> and do the needful. >> >> >> >> Thanks, >> >> Ankamma Rao B >> >> >> >> -- >> Best Regards, >> Ayan Guha >> >> -- > Best Regards, > Ayan Guha >