Hello experts, I’m new to Spark, and want to find K nearest neighbors on huge scale high-dimension points dataset in very short time.
The scenario is: the dataset contains more than 10 million points, whose dimension is 200d. I’m building a web service, to receive one new point at each request and return K nearest points inside that dataset, also need to ensure the time-cost not very high. I have a cluster with several high-memory nodes for this service. Currently I only have these ideas here: 1. To create several ball-tree instances in each node when service initializing. This is fast, but not perform well at data scaling ability. I cannot insert new nodes to the ball-trees unless I restart the services and rebuild them. 2. To use sql based solution. Some database like PostgreSQL and SqlServer have features on spatial search. But these database may not perform well in big data environment. (Does SparkSQL have Spatial features or spatial index?) Based on your experience, can I achieve this scenario in Spark SQL? Or do you know other projects in Spark stack acting well for this? Any ideas are appreciated, thanks very much. Regards, Dong --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org