Feng Zhang created SEDONA-648:
---------------------------------
Summary: Implement Distributed K Nearest Neighbor Join
Key: SEDONA-648
URL: https://issues.apache.org/jira/browse/SEDONA-648
Project: Apache Sedona
Issue Type: New Feature
Reporter: Feng Zhang
A geospatial k-Nearest Neighbors (kNN) join is a specialized form of the kNN
join that specifically deals with geospatial data. This method involves
identifying the k-nearest neighbors for a given spatial point or region based
on geographic proximity, typically using spatial coordinates and a suitable
distance metric like Euclidean or great-circle distance.
A kNN join operation involves two datasets. For each record in the first
dataset, it finds the k-nearest neighbors from the second dataset based on a
given distance metric. In a distributed environment, this process involves
several challenges:
* {*}Data Partitioning{*}: Data needs to be partitioned across different nodes
in a way that minimizes the inter-node communication and balances the load
among nodes.
* {*}Efficient Search{*}: Implementing efficient algorithms that can quickly
find the k-nearest neighbors among potentially billions of data points.
* {*}Data Locality{*}: Keeping data as close as possible to where it is
processed to reduce network transfers and latency.
- Syntax Definition
```sql
SELECT <column_list>
FROM <tableR>
JOIN <tableS> ON ST_KNN(<tableR.column>, <tableS.column>, <k>, <use_spheroid>)
```
- **Parameters:**
- **`<column_list>`**: The list of columns to be selected from both tables.
- **`<tableR>`**: The left table in the join.
- **`<tableS>`**: The right table in the join.
- **`<table1.column>`**: The column from the left table containing
geometric data.
- **`<table2.column>`**: The column from the right table containing
geometric data.
- **`<k>`**: The number of nearest neighbors to match between tables.
- **`<use_spheroid>`**: If the distance calculation will be based on
spherical coordinate system (e.g. WGS 84 Long Lat SRID=4326). Set it to false
to use the projected coordinate system (e.g., Mercator EPSG:3785).
- Example Usage:
```sql
SELECT R.id, S.id, R.location, S.location
FROM TableS S
JOIN TableR R ON ST_KNN(R.location, S.location, 5, true)
```
In this example, **`TableS`** and **`TableR`** are joined based on the 5
approximate nearest neighbors in their respective **`location`** columns, using
the GCD metric.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)