Spark SQL doesn't provide spatial features. Large-scale KNN is usually
combined with locality-sensitive hashing (LSH). This Spark package may
be helpful: http://spark-packages.org/package/mrsqueeze/spark-hash.
-Xiangrui

On Sat, May 9, 2015 at 9:25 PM, Dong Li <lid...@lidong.net.cn> wrote:
> Hello experts,
>
> I’m new to Spark, and want to find K nearest neighbors on huge scale 
> high-dimension points dataset in very short time.
>
> The scenario is: the dataset contains more than 10 million points, whose 
> dimension is 200d. I’m building a web service, to receive one new point at 
> each request and return K nearest points inside that dataset, also need to 
> ensure the time-cost not very high. I have a cluster with several high-memory 
> nodes for this service.
>
> Currently I only have these ideas here:
> 1. To create several ball-tree instances in each node when service 
> initializing. This is fast, but not perform well at data scaling ability. I 
> cannot insert new nodes to the ball-trees unless I restart the services and 
> rebuild them.
> 2. To use sql based solution. Some database like PostgreSQL and SqlServer 
> have features on spatial search. But these database may not perform well in 
> big data environment. (Does SparkSQL have Spatial features or spatial index?)
>
> Based on your experience, can I achieve this scenario in Spark SQL? Or do you 
> know other projects in Spark stack acting well for this?
> Any ideas are appreciated, thanks very much.
>
> Regards,
> Dong
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to