[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882187#comment-15882187 ] Nick Pentreath commented on SPARK-2336: --- I think it's safe to say that this now lives in a Spark package (that seems reasonable actively maintained which is great) so is anyone wants this that is where to look. I further think it's safe to say this is not going to be prioritised for MLlib, so shall we close this ticket? > Approximate k-NN Models for MLLib > - > > Key: SPARK-2336 > URL: https://issues.apache.org/jira/browse/SPARK-2336 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Brian Gawalt >Priority: Minor > Labels: clustering, features > > After tackling the general k-Nearest Neighbor model as per > https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to > also offer approximate k-Nearest Neighbor. A promising approach would involve > building a kd-tree variant within from each partition, a la > http://www.autonlab.org/autonweb/14714.html?branch=1=2 > This could offer a simple non-linear ML model that can label new data with > much lower latency than the plain-vanilla kNN versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021527#comment-15021527 ] Sen Fang commented on SPARK-2336: - I finally took a crack on the hybrid spill tree for kNN and results so far appear to be promising. For anyone who is still interested, you can find it as a spark package at: https://github.com/saurfang/spark-knn The implementation is written for ml API and scales well in terms of both number of observations and number of vector dimensions. The KNN itself is flexible and the package comes with KNNClassifier and KNNRegression for (optionally weighted) classification and regression. There are a few implementation details I am still trying to iron out. Otherwise I look forward to benchmark it against other implementations such as KNN-join, KD-Tree, and LSH. > Approximate k-NN Models for MLLib > - > > Key: SPARK-2336 > URL: https://issues.apache.org/jira/browse/SPARK-2336 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Brian Gawalt >Priority: Minor > Labels: clustering, features > > After tackling the general k-Nearest Neighbor model as per > https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to > also offer approximate k-Nearest Neighbor. A promising approach would involve > building a kd-tree variant within from each partition, a la > http://www.autonlab.org/autonweb/14714.html?branch=1=2 > This could offer a simple non-linear ML model that can label new data with > much lower latency than the plain-vanilla kNN versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592815#comment-14592815 ] Qiyuan Qiu commented on SPARK-2336: --- Hi [~saurfang] and [~datawlb], May I ask why both of you based your implementation on the same paper http://dx.doi.org/10.1109/WACV.2007.18? It seems to me that the http://ww2.cs.fsu.edu/~czhang/knnjedbt/ mentioned by OP is reasonably good. Approximate k-NN Models for MLLib - Key: SPARK-2336 URL: https://issues.apache.org/jira/browse/SPARK-2336 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Brian Gawalt Priority: Minor Labels: clustering, features After tackling the general k-Nearest Neighbor model as per https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to also offer approximate k-Nearest Neighbor. A promising approach would involve building a kd-tree variant within from each partition, a la http://www.autonlab.org/autonweb/14714.html?branch=1language=2 This could offer a simple non-linear ML model that can label new data with much lower latency than the plain-vanilla kNN versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583886#comment-14583886 ] Debasish Das commented on SPARK-2336: - Very cool idea Sen. Did you also look into FLANN for randomized KDTree and KMeansTree. We have a PR for rowSimilarities which we will use to compare the QoR of your PR as soon as you open up a stable version. Approximate k-NN Models for MLLib - Key: SPARK-2336 URL: https://issues.apache.org/jira/browse/SPARK-2336 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Brian Gawalt Priority: Minor Labels: clustering, features After tackling the general k-Nearest Neighbor model as per https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to also offer approximate k-Nearest Neighbor. A promising approach would involve building a kd-tree variant within from each partition, a la http://www.autonlab.org/autonweb/14714.html?branch=1language=2 This could offer a simple non-linear ML model that can label new data with much lower latency than the plain-vanilla kNN versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526153#comment-14526153 ] Sen Fang commented on SPARK-2336: - Hey Longbao, great to hear from you. To my best understanding, in the paper I cited above, they are distributing the input points by pushing them through the top tree (figure 4). Of course, for a precise result, this means we need to backtrack which isn't very ideal. What they propose is to use a buffer boundary like a spill tree. However unlike spill tree, here you would push the input targets to both children if it falls within the buffer zone, because the top tree was built as a metric tree (they explained the reason being a spill tree as top tree has a high memory penalty). So every input now might end up in multiple subtrees and you will need to reduceByKey at the end to keep the top K neighbors. Is your implementation available somewhere? I'm having a hard time to find time to finish my implementation this month. Would be great if eventually we can compare our implementations, validate and benchmark. Approximate k-NN Models for MLLib - Key: SPARK-2336 URL: https://issues.apache.org/jira/browse/SPARK-2336 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Brian Gawalt Priority: Minor Labels: clustering, features After tackling the general k-Nearest Neighbor model as per https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to also offer approximate k-Nearest Neighbor. A promising approach would involve building a kd-tree variant within from each partition, a la http://www.autonlab.org/autonweb/14714.html?branch=1language=2 This could offer a simple non-linear ML model that can label new data with much lower latency than the plain-vanilla kNN versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523226#comment-14523226 ] longbao wang commented on SPARK-2336: - I really agree with you,and i'm already implementing it,but i have a trouble,after build tree successful,you search target points' knn,so parallelize the input target points then search,but i think this have some questions,and one point's knn may in two partitions or more. Approximate k-NN Models for MLLib - Key: SPARK-2336 URL: https://issues.apache.org/jira/browse/SPARK-2336 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Brian Gawalt Priority: Minor Labels: clustering, features After tackling the general k-Nearest Neighbor model as per https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to also offer approximate k-Nearest Neighbor. A promising approach would involve building a kd-tree variant within from each partition, a la http://www.autonlab.org/autonweb/14714.html?branch=1language=2 This could offer a simple non-linear ML model that can label new data with much lower latency than the plain-vanilla kNN versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523227#comment-14523227 ] longbao wang commented on SPARK-2336: - I really agree with you,and i'm already implementing it,but i have a trouble,after build tree successful,you search target points' knn,so parallelize the input target points then search,but i think this have some questions,and one point's knn may in two partitions or more. Approximate k-NN Models for MLLib - Key: SPARK-2336 URL: https://issues.apache.org/jira/browse/SPARK-2336 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Brian Gawalt Priority: Minor Labels: clustering, features After tackling the general k-Nearest Neighbor model as per https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to also offer approximate k-Nearest Neighbor. A promising approach would involve building a kd-tree variant within from each partition, a la http://www.autonlab.org/autonweb/14714.html?branch=1language=2 This could offer a simple non-linear ML model that can label new data with much lower latency than the plain-vanilla kNN versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14396306#comment-14396306 ] Sen Fang commented on SPARK-2336: - I'm tentatively going to give the hybrid spilltree implementation described in OP's post a try. Specifically I'm going to follow the implementation described in: http://dx.doi.org/10.1109/WACV.2007.18 According to the paper, the algorithm scales well in terms of number of observations, bounded by the available memory across clusters (billions in paper's example). The original hybrid spilltree paper claims the algorithm scales much better than other alternatives (LSH, metric tree) when it comes to number of features (up to hundreds). Furthermore, random projection is often used to reduce the dimensionality of feature space. Random projection is out of scope in my implementation and should probably implemented as a separate dimension reduction technique. (in the paper 4000 features are projected down to 100) The average runtime for training and prediction is generally O(log n). The training runtime will suffer if we turn up the buffer size. The search is only approximate on overlapping node and the higher the buffer size the more accurate the search is. The prediction runtime will suffer when backtracking search is needed in a non-overlap node (which is more likely as balance threshold increases). More specifically, metric tree promises accurate but less performant search while spill tree creates trees with overlapping nodes and uses more efficient defeatist search whose accuracy is related to the buffer size *tau* (the larger it is the more accurate the search but deeper tree). Hybrid spill tree is constructed with both overlapping (spill tree) and non-overlapping (metric tree) node and uses balance threshold *rho* to balances the tree depth and search efficiency. A high level overview of the algorithm is as follows: 1. Sample M data points (M = number of partitions) 2. Build the top level metric tree 3. Repartition RDD by assigning each point to leaf node of the above tree 4. Build a hybrid spill tree at each partition This concludes the training phase of kNN. Prediction is achieved by identify the subtree it falls into then run prediction on that subtree. Let me know if anyone has any thoughts, concerns or corrections on the things I said above. Approximate k-NN Models for MLLib - Key: SPARK-2336 URL: https://issues.apache.org/jira/browse/SPARK-2336 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Brian Gawalt Priority: Minor Labels: clustering, features After tackling the general k-Nearest Neighbor model as per https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to also offer approximate k-Nearest Neighbor. A promising approach would involve building a kd-tree variant within from each partition, a la http://www.autonlab.org/autonweb/14714.html?branch=1language=2 This could offer a simple non-linear ML model that can label new data with much lower latency than the plain-vanilla kNN versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335476#comment-14335476 ] Xiangrui Meng commented on SPARK-2336: -- [~Rusty] Could you provide a summary of your plan? For example, 1. Which paper are you going to follow? Any modification to the algorithm proposed in the paper? 2. What's the complexity of the algorithm and the expected scalability? 3. What is the trade-off between the approximate error and the cost? How do you want to expose it to users? Approximate k-NN Models for MLLib - Key: SPARK-2336 URL: https://issues.apache.org/jira/browse/SPARK-2336 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Brian Gawalt Priority: Minor Labels: clustering, features After tackling the general k-Nearest Neighbor model as per https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to also offer approximate k-Nearest Neighbor. A promising approach would involve building a kd-tree variant within from each partition, a la http://www.autonlab.org/autonweb/14714.html?branch=1language=2 This could offer a simple non-linear ML model that can label new data with much lower latency than the plain-vanilla kNN versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334880#comment-14334880 ] Ashutosh Trivedi commented on SPARK-2336: - Hi, [~mengxr] We are going through the paper, we are already working on this problem , but we were using z scores for locality search. We would like to take up this task, paper looks interesting. _Ashu Approximate k-NN Models for MLLib - Key: SPARK-2336 URL: https://issues.apache.org/jira/browse/SPARK-2336 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Brian Gawalt Priority: Minor Labels: clustering, features After tackling the general k-Nearest Neighbor model as per https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to also offer approximate k-Nearest Neighbor. A promising approach would involve building a kd-tree variant within from each partition, a la http://www.autonlab.org/autonweb/14714.html?branch=1language=2 This could offer a simple non-linear ML model that can label new data with much lower latency than the plain-vanilla kNN versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184848#comment-14184848 ] Ashutosh Trivedi commented on SPARK-2336: - Is anybody already working on it ? I can take up this task. We can also implement KNN joins which will be a nice utility for data mining. Here is the link for KNN-joins http://ww2.cs.fsu.edu/~czhang/knnjedbt/ Approximate k-NN Models for MLLib - Key: SPARK-2336 URL: https://issues.apache.org/jira/browse/SPARK-2336 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Brian Gawalt Priority: Minor Labels: features, newbie After tackling the general k-Nearest Neighbor model as per https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to also offer approximate k-Nearest Neighbor. A promising approach would involve building a kd-tree variant within from each partition, a la http://www.autonlab.org/autonweb/14714.html?branch=1language=2 This could offer a simple non-linear ML model that can label new data with much lower latency than the plain-vanilla kNN versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org