[jira] [Comment Edited] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library

Daniel Blazevski (JIRA) Tue, 05 Apr 2016 20:07:07 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227632#comment-15227632
 ]


Daniel Blazevski edited comment on FLINK-1934 at 4/6/16 3:05 AM:
-----------------------------------------------------------------

[~chiwanpark] [~till.rohrmann]

I have a Flink version -- still a bit preliminary -- of the approximate knn up 
and running.  The exact knn using a quadtree performs quite bad in 
moderate-to-high spatial dimension (e.g 20,000 test and training points in 6D, 
the quadtree is worse, but no worries I took care of this and the exact decides 
when to use quadtree or not).  

https://github.com/danielblazevski/flink/blob/FLINK-1934/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/zknn.scala

A preliminary test shows good scaling with the number when the test + training 
points are increased.  

8,000 points in 6D (i.e. 8,000 test points and 8,000 training points):
Elapsed time approx =       : 2s
Elapsed time exact =       : 27s

64,000 in 6D:  
Elapsed time approx =       : 6s
(didn't run the exact version, we know it's O(N^2))

I will have to clean things up, add edge cases, etc which may slow down the 
run-time a bit, but will definitely not increase the complexity of the 
algorithm with respect to the number of test/training points.

This still use a cross product, which I was hoping to avoid, but not sure if 
that's possible.  Any thoughts?  Basically the idea is to hash the test/train 
set to 1D (I use the z-value hash based on [1]). 

I still have not implemented the ideas in [1] in full.  The full solution is 
quite complex.  They do a bunch of load balancing that I'm still learning, and 
not quite sure of the payoff.  One option could be that I clean up what I have 
now and optimize since it's already performing well, and we open a new issue 
for to do all the steps in [1].  

There are still many things to clean up, but any cleaning/edge cases will not 
add in computational complexity with respect to the number of test points.  
e.g. I now convert the coordinates to integers and ignore the decimal part and 
there are now lots of collisions in the z-value hash, normalizing the data and 
adding a fixed max number of bits to compute the z-value (this is described 
towards the end of [3])

Any thoughts?


was (Author: danielblazevski):
[~chiwanpark] [~till.rohrmann]

I have a Flink version -- still a bit preliminary -- of the approximate knn up 
and running.  The exact knn using a quadtree performs quite bad in 
moderate-to-high spatial dimension (e.g 20,000 test and training points in 6D, 
the quadtree is worse, but no worries I took care of this and the exact decides 
when to use quadtree or not).  

https://github.com/danielblazevski/flink/blob/FLINK-1934/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/zknn.scala

A preliminary test shows good scaling with the number when the test + training 
points are increased.  

8,000 points (i.e. 8,000 test points and 8,000 training points):
Elapsed time approx =       : 2s
Elapsed time exact =       : 27s

64,000:  
Elapsed time approx =       : 6s
(didn't run the exact version, we know it's O(N^2))

I will have to clean things up, add edge cases, etc which may slow down the 
run-time a bit, but will definitely not increase the complexity of the 
algorithm with respect to the number of test/training points.

This still use a cross product, which I was hoping to avoid, but not sure if 
that's possible.  Any thoughts?  Basically the idea is to hash the test/train 
set to 1D (I use the z-value hash based on [1]). 

I still have not implemented the ideas in [1] in full.  The full solution is 
quite complex.  They do a bunch of load balancing that I'm still learning, and 
not quite sure of the payoff.  One option could be that I clean up what I have 
now and optimize since it's already performing well, and we open a new issue 
for to do all the steps in [1].  

There are still many things to clean up, but any cleaning/edge cases will not 
add in computational complexity with respect to the number of test points.  
e.g. I now convert the coordinates to integers and ignore the decimal part and 
there are now lots of collisions in the z-value hash, normalizing the data and 
adding a fixed max number of bits to compute the z-value (this is described 
towards the end of [3])

Any thoughts?

> Add approximative k-nearest-neighbours (kNN) algorithm to machine learning 
> library
> ----------------------------------------------------------------------------------
>
>                 Key: FLINK-1934
>                 URL: https://issues.apache.org/jira/browse/FLINK-1934
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Daniel Blazevski
>              Labels: ML
>
> kNN is still a widely used algorithm for classification and regression. 
> However, due to the computational costs of an exact implementation, it does 
> not scale well to large amounts of data. Therefore, it is worthwhile to also 
> add an approximative kNN implementation as proposed in [1,2].  Reference [3] 
> is cited a few times in [1], and gives necessary background on the z-value 
> approach.
> Resources:
> [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf
> [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf
> [3] http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library

Reply via email to