Spark SQL doesn't provide spatial features. Large-scale KNN is usually combined with locality-sensitive hashing (LSH). This Spark package may be helpful: http://spark-packages.org/package/mrsqueeze/spark-hash. -Xiangrui
On Sat, May 9, 2015 at 9:25 PM, Dong Li <lid...@lidong.net.cn> wrote: > Hello experts, > > I’m new to Spark, and want to find K nearest neighbors on huge scale > high-dimension points dataset in very short time. > > The scenario is: the dataset contains more than 10 million points, whose > dimension is 200d. I’m building a web service, to receive one new point at > each request and return K nearest points inside that dataset, also need to > ensure the time-cost not very high. I have a cluster with several high-memory > nodes for this service. > > Currently I only have these ideas here: > 1. To create several ball-tree instances in each node when service > initializing. This is fast, but not perform well at data scaling ability. I > cannot insert new nodes to the ball-trees unless I restart the services and > rebuild them. > 2. To use sql based solution. Some database like PostgreSQL and SqlServer > have features on spatial search. But these database may not perform well in > big data environment. (Does SparkSQL have Spatial features or spatial index?) > > Based on your experience, can I achieve this scenario in Spark SQL? Or do you > know other projects in Spark stack acting well for this? > Any ideas are appreciated, thanks very much. > > Regards, > Dong > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org