to find Difference of locations in Spark Dataframe rows

2022-06-07 Thread Chetan Khatri
Hi Dear Spark Users,

It has been many years that I have worked on Spark, Please help me. Thanks
much

I have different cities and their co-ordinates in DataFrame[Row], I want to
find distance in KMs and then show only those records /cities which are 10
KMs far.

I have a function created that can find the distance in KMs given two
co-coordinates. But I don't know how to apply it to rows, like one to many
and calculate the distance.

Some code that I wrote, Sorry for the basic code.

lass HouseMatching {
  def main(args: Array[String]): Unit = {

val search_property_id = args(0)

// list of columns where the condition should be exact match
val groupOneCriteria = List(
  "occupied_by_tenant",
  "water_index",
  "electricity_index",
  "elevator_index",
  "heating_index",
  "nb_bathtubs",
  "nb_showers",
  "nb_wc",
  "nb_rooms",
  "nb_kitchens"
)
// list of columns where the condition should be matching 80%
val groupTwoCriteria = List(
  "area",
  "home_condition",
  "building_age"
)
// list of columns where the condition should be found using
Euclidean distance
val groupThreeCriteria = List(
  "postal_code"
)

val region_or_city = "region"

def haversineDistance(destination_latitude: Column,
destination_longitude: Column, origin_latitude: Column,
  origin_longitude: Column): Column = {
  val a = pow(sin(radians(destination_latitude - origin_latitude) / 2), 2) +
cos(radians(origin_latitude)) * cos(radians(destination_latitude)) *
  pow(sin(radians(destination_longitude - origin_longitude) / 2), 2)
  val distance = atan2(sqrt(a), sqrt(-a + 1)) * 2 * 6371
  distance
}

val spark = SparkSession.builder().appName("real-estate-property-matcher")
  .getOrCreate()

val housingDataDF =
spark.read.csv("~/Downloads/real-estate-sample-data.csv")

// searching for the property by `ref_id`
val searchPropertyDF = housingDataDF.filter(col("ref_id") ===
search_property_id)

// Similar house in the same city (same postal code) and group one condition
val similarHouseAndSameCity = housingDataDF.join(searchPropertyDF,
groupThreeCriteria ++ groupOneCriteria,
  "inner")

// Similar house not in the same city but 10km range


Re: How the data is distributed

2022-06-07 Thread Sid
Thank you for the information.


On Tue, 7 Jun 2022, 03:21 Sean Owen,  wrote:

> Data is not distributed to executors by anything. If you are processing
> data with Spark. Spark spawns tasks on executors to read chunks of data
> from wherever they are (S3, HDFS, etc).
>
>
> On Mon, Jun 6, 2022 at 4:07 PM Sid  wrote:
>
>> Hi experts,
>>
>>
>> When we load any file, I know that based on the information in the spark
>> session about the executors location, status and etc , the data is
>> distributed among the worker nodes and executors.
>>
>> But I have one doubt. Is the data initially loaded on the driver and then
>> it is distributed or it is directly distributed amongst the workers?
>>
>> Thanks,
>> Sid
>>
>