[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2017-02-24 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882187#comment-15882187
 ] 

Nick Pentreath commented on SPARK-2336:
---

I think it's safe to say that this now lives in a Spark package (that seems 
reasonable actively maintained which is great) so is anyone wants this that is 
where to look. I further think it's safe to say this is not going to be 
prioritised for MLlib, so shall we close this ticket?

> Approximate k-NN Models for MLLib
> -
>
> Key: SPARK-2336
> URL: https://issues.apache.org/jira/browse/SPARK-2336
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Brian Gawalt
>Priority: Minor
>  Labels: clustering, features
>
> After tackling the general k-Nearest Neighbor model as per 
> https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
> also offer approximate k-Nearest Neighbor. A promising approach would involve 
> building a kd-tree variant within from each partition, a la
> http://www.autonlab.org/autonweb/14714.html?branch=1=2
> This could offer a simple non-linear ML model that can label new data with 
> much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2015-11-22 Thread Sen Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021527#comment-15021527
 ] 

Sen Fang commented on SPARK-2336:
-

I finally took a crack on the hybrid spill tree for kNN and results so far 
appear to be promising. For anyone who is still interested, you can find it as 
a spark package at: https://github.com/saurfang/spark-knn

The implementation is written for ml API and scales well in terms of both 
number of observations and number of vector dimensions. The KNN itself is 
flexible and the package comes with KNNClassifier and KNNRegression for 
(optionally weighted) classification and regression.

There are a few implementation details I am still trying to iron out. Otherwise 
I look forward to benchmark it against other implementations such as KNN-join, 
KD-Tree, and LSH.

> Approximate k-NN Models for MLLib
> -
>
> Key: SPARK-2336
> URL: https://issues.apache.org/jira/browse/SPARK-2336
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Brian Gawalt
>Priority: Minor
>  Labels: clustering, features
>
> After tackling the general k-Nearest Neighbor model as per 
> https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
> also offer approximate k-Nearest Neighbor. A promising approach would involve 
> building a kd-tree variant within from each partition, a la
> http://www.autonlab.org/autonweb/14714.html?branch=1=2
> This could offer a simple non-linear ML model that can label new data with 
> much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2015-06-18 Thread Qiyuan Qiu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592815#comment-14592815
 ] 

Qiyuan Qiu commented on SPARK-2336:
---

Hi [~saurfang] and [~datawlb], 

May I ask why both of you based your implementation on the same paper 
http://dx.doi.org/10.1109/WACV.2007.18? 
It seems to me that the http://ww2.cs.fsu.edu/~czhang/knnjedbt/ mentioned by OP 
is reasonably good. 



 Approximate k-NN Models for MLLib
 -

 Key: SPARK-2336
 URL: https://issues.apache.org/jira/browse/SPARK-2336
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Brian Gawalt
Priority: Minor
  Labels: clustering, features

 After tackling the general k-Nearest Neighbor model as per 
 https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
 also offer approximate k-Nearest Neighbor. A promising approach would involve 
 building a kd-tree variant within from each partition, a la
 http://www.autonlab.org/autonweb/14714.html?branch=1language=2
 This could offer a simple non-linear ML model that can label new data with 
 much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2015-06-12 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583886#comment-14583886
 ] 

Debasish Das commented on SPARK-2336:
-

Very cool idea Sen. Did you also look into FLANN for randomized KDTree and 
KMeansTree. We have a PR for rowSimilarities which we will use to compare the 
QoR of your PR as soon as you open up a stable version.


 Approximate k-NN Models for MLLib
 -

 Key: SPARK-2336
 URL: https://issues.apache.org/jira/browse/SPARK-2336
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Brian Gawalt
Priority: Minor
  Labels: clustering, features

 After tackling the general k-Nearest Neighbor model as per 
 https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
 also offer approximate k-Nearest Neighbor. A promising approach would involve 
 building a kd-tree variant within from each partition, a la
 http://www.autonlab.org/autonweb/14714.html?branch=1language=2
 This could offer a simple non-linear ML model that can label new data with 
 much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2015-05-03 Thread Sen Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526153#comment-14526153
 ] 

Sen Fang commented on SPARK-2336:
-

Hey Longbao, great to hear from you. To my best understanding, in the paper I 
cited above, they are distributing the input points by pushing them through the 
top tree (figure 4). Of course, for a precise result, this means we need to 
backtrack which isn't very ideal. What they propose is to use a buffer boundary 
like a spill tree. However unlike spill tree, here you would push the input 
targets to both children if it falls within the buffer zone, because the top 
tree was built as a metric tree (they explained the reason being a spill tree 
as top tree has a high memory penalty). So every input now might end up in 
multiple subtrees and you will need to reduceByKey at the end to keep the top K 
neighbors.

Is your implementation available somewhere? I'm having a hard time to find time 
to finish my implementation this month. Would be great if eventually we can 
compare our implementations, validate and benchmark.

 Approximate k-NN Models for MLLib
 -

 Key: SPARK-2336
 URL: https://issues.apache.org/jira/browse/SPARK-2336
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Brian Gawalt
Priority: Minor
  Labels: clustering, features

 After tackling the general k-Nearest Neighbor model as per 
 https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
 also offer approximate k-Nearest Neighbor. A promising approach would involve 
 building a kd-tree variant within from each partition, a la
 http://www.autonlab.org/autonweb/14714.html?branch=1language=2
 This could offer a simple non-linear ML model that can label new data with 
 much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2015-05-01 Thread longbao wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523226#comment-14523226
 ] 

longbao wang commented on SPARK-2336:
-

I really agree with you,and i'm already implementing it,but i have a 
trouble,after build tree successful,you search target points' knn,so 
parallelize the input target points then search,but i think this have some 
questions,and one point's knn may in two partitions or more.

 Approximate k-NN Models for MLLib
 -

 Key: SPARK-2336
 URL: https://issues.apache.org/jira/browse/SPARK-2336
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Brian Gawalt
Priority: Minor
  Labels: clustering, features

 After tackling the general k-Nearest Neighbor model as per 
 https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
 also offer approximate k-Nearest Neighbor. A promising approach would involve 
 building a kd-tree variant within from each partition, a la
 http://www.autonlab.org/autonweb/14714.html?branch=1language=2
 This could offer a simple non-linear ML model that can label new data with 
 much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2015-05-01 Thread longbao wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523227#comment-14523227
 ] 

longbao wang commented on SPARK-2336:
-

I really agree with you,and i'm already implementing it,but i have a 
trouble,after build tree successful,you search target points' knn,so 
parallelize the input target points then search,but i think this have some 
questions,and one point's knn may in two partitions or more.

 Approximate k-NN Models for MLLib
 -

 Key: SPARK-2336
 URL: https://issues.apache.org/jira/browse/SPARK-2336
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Brian Gawalt
Priority: Minor
  Labels: clustering, features

 After tackling the general k-Nearest Neighbor model as per 
 https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
 also offer approximate k-Nearest Neighbor. A promising approach would involve 
 building a kd-tree variant within from each partition, a la
 http://www.autonlab.org/autonweb/14714.html?branch=1language=2
 This could offer a simple non-linear ML model that can label new data with 
 much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2015-04-05 Thread Sen Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14396306#comment-14396306
 ] 

Sen Fang commented on SPARK-2336:
-

I'm tentatively going to give the hybrid spilltree implementation described in 
OP's post a try. Specifically I'm going to follow the implementation described 
in: http://dx.doi.org/10.1109/WACV.2007.18

According to the paper, the algorithm scales well in terms of number of 
observations, bounded by the available memory across clusters (billions in 
paper's example). The original hybrid spilltree paper claims the algorithm 
scales much better than other alternatives (LSH, metric tree) when it comes to 
number of features (up to hundreds). Furthermore, random projection is often 
used to reduce the dimensionality of feature space. Random projection is out of 
scope in my implementation and should probably implemented as a separate 
dimension reduction technique. (in the paper 4000 features are projected down 
to 100)

The average runtime for training and prediction is generally O(log n). The 
training runtime will suffer if we turn up the buffer size. The search is only 
approximate on overlapping node and the higher the buffer size the more 
accurate the search is. The prediction runtime will suffer when backtracking 
search is needed in a non-overlap node (which is more likely as balance 
threshold increases). 

More specifically, metric tree promises accurate but less performant search 
while spill tree creates trees with overlapping nodes and uses more efficient 
defeatist search whose accuracy is related to the buffer size *tau* (the larger 
it is the more accurate the search but deeper tree). Hybrid spill tree is 
constructed with both overlapping (spill tree) and non-overlapping (metric 
tree) node and uses balance threshold *rho* to balances the tree depth and 
search efficiency.


A high level overview of the algorithm is as follows:
1. Sample M data points (M = number of partitions)
2. Build the top level metric tree
3. Repartition RDD by assigning each point to leaf node of the above tree
4. Build a hybrid spill tree at each partition 
This concludes the training phase of kNN.

Prediction is achieved by identify the subtree it falls into then run 
prediction on that subtree.


Let me know if anyone has any thoughts, concerns or corrections on the things I 
said above.


 Approximate k-NN Models for MLLib
 -

 Key: SPARK-2336
 URL: https://issues.apache.org/jira/browse/SPARK-2336
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Brian Gawalt
Priority: Minor
  Labels: clustering, features

 After tackling the general k-Nearest Neighbor model as per 
 https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
 also offer approximate k-Nearest Neighbor. A promising approach would involve 
 building a kd-tree variant within from each partition, a la
 http://www.autonlab.org/autonweb/14714.html?branch=1language=2
 This could offer a simple non-linear ML model that can label new data with 
 much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2015-02-24 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335476#comment-14335476
 ] 

Xiangrui Meng commented on SPARK-2336:
--

[~Rusty] Could you provide a summary of your plan? For example,

1. Which paper are you going to follow? Any modification to the algorithm 
proposed in the paper?
2. What's the complexity of the algorithm and the expected scalability?
3. What is the trade-off between the approximate error and the cost? How do you 
want to expose it to users?

 Approximate k-NN Models for MLLib
 -

 Key: SPARK-2336
 URL: https://issues.apache.org/jira/browse/SPARK-2336
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Brian Gawalt
Priority: Minor
  Labels: clustering, features

 After tackling the general k-Nearest Neighbor model as per 
 https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
 also offer approximate k-Nearest Neighbor. A promising approach would involve 
 building a kd-tree variant within from each partition, a la
 http://www.autonlab.org/autonweb/14714.html?branch=1language=2
 This could offer a simple non-linear ML model that can label new data with 
 much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2015-02-24 Thread Ashutosh Trivedi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334880#comment-14334880
 ] 

Ashutosh Trivedi commented on SPARK-2336:
-

Hi, [~mengxr] We are going through the paper, we are already working on this 
problem , but we were using z scores for locality search. We would like to take 
up this task,  paper looks interesting.

_Ashu

 Approximate k-NN Models for MLLib
 -

 Key: SPARK-2336
 URL: https://issues.apache.org/jira/browse/SPARK-2336
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Brian Gawalt
Priority: Minor
  Labels: clustering, features

 After tackling the general k-Nearest Neighbor model as per 
 https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
 also offer approximate k-Nearest Neighbor. A promising approach would involve 
 building a kd-tree variant within from each partition, a la
 http://www.autonlab.org/autonweb/14714.html?branch=1language=2
 This could offer a simple non-linear ML model that can label new data with 
 much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2014-10-26 Thread Ashutosh Trivedi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184848#comment-14184848
 ] 

Ashutosh Trivedi commented on SPARK-2336:
-

Is anybody already working on it ? I can take up this task. We can also 
implement KNN joins which will  be a nice utility for data mining.

Here is the link for KNN-joins
http://ww2.cs.fsu.edu/~czhang/knnjedbt/

 Approximate k-NN Models for MLLib
 -

 Key: SPARK-2336
 URL: https://issues.apache.org/jira/browse/SPARK-2336
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Brian Gawalt
Priority: Minor
  Labels: features, newbie

 After tackling the general k-Nearest Neighbor model as per 
 https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
 also offer approximate k-Nearest Neighbor. A promising approach would involve 
 building a kd-tree variant within from each partition, a la
 http://www.autonlab.org/autonweb/14714.html?branch=1language=2
 This could offer a simple non-linear ML model that can label new data with 
 much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org