The batch version of this is part of rowSimilarities JIRA 4823 ...if your query points can fit in memory there is broadcast version which we are experimenting with internally....we are using brute force KNN right now in the PR...based on flann paper lsh did not work well but before you go to approximate knn you have to make sure your topk precision/recall is not degrading as compared to brute force in your cv flow...
I have not yet extracted knn model but that will use the IndexedRowMatrix changes that we put in the PR On May 19, 2015 12:58 PM, "Xiangrui Meng" <men...@gmail.com> wrote: > Spark SQL doesn't provide spatial features. Large-scale KNN is usually > combined with locality-sensitive hashing (LSH). This Spark package may > be helpful: http://spark-packages.org/package/mrsqueeze/spark-hash. > -Xiangrui > > On Sat, May 9, 2015 at 9:25 PM, Dong Li <lid...@lidong.net.cn> wrote: > > Hello experts, > > > > I’m new to Spark, and want to find K nearest neighbors on huge scale > high-dimension points dataset in very short time. > > > > The scenario is: the dataset contains more than 10 million points, whose > dimension is 200d. I’m building a web service, to receive one new point at > each request and return K nearest points inside that dataset, also need to > ensure the time-cost not very high. I have a cluster with several > high-memory nodes for this service. > > > > Currently I only have these ideas here: > > 1. To create several ball-tree instances in each node when service > initializing. This is fast, but not perform well at data scaling ability. I > cannot insert new nodes to the ball-trees unless I restart the services and > rebuild them. > > 2. To use sql based solution. Some database like PostgreSQL and > SqlServer have features on spatial search. But these database may not > perform well in big data environment. (Does SparkSQL have Spatial features > or spatial index?) > > > > Based on your experience, can I achieve this scenario in Spark SQL? Or > do you know other projects in Spark stack acting well for this? > > Any ideas are appreciated, thanks very much. > > > > Regards, > > Dong > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >