[
https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227632#comment-15227632
]
Daniel Blazevski edited comment on FLINK-1934 at 4/6/16 3:05 AM:
-----------------------------------------------------------------
[~chiwanpark] [~till.rohrmann]
I have a Flink version -- still a bit preliminary -- of the approximate knn up
and running. The exact knn using a quadtree performs quite bad in
moderate-to-high spatial dimension (e.g 20,000 test and training points in 6D,
the quadtree is worse, but no worries I took care of this and the exact decides
when to use quadtree or not).
https://github.com/danielblazevski/flink/blob/FLINK-1934/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/zknn.scala
A preliminary test shows good scaling with the number when the test + training
points are increased.
8,000 points in 6D (i.e. 8,000 test points and 8,000 training points):
Elapsed time approx = : 2s
Elapsed time exact = : 27s
64,000 in 6D:
Elapsed time approx = : 6s
(didn't run the exact version, we know it's O(N^2))
I will have to clean things up, add edge cases, etc which may slow down the
run-time a bit, but will definitely not increase the complexity of the
algorithm with respect to the number of test/training points.
This still use a cross product, which I was hoping to avoid, but not sure if
that's possible. Any thoughts? Basically the idea is to hash the test/train
set to 1D (I use the z-value hash based on [1]).
I still have not implemented the ideas in [1] in full. The full solution is
quite complex. They do a bunch of load balancing that I'm still learning, and
not quite sure of the payoff. One option could be that I clean up what I have
now and optimize since it's already performing well, and we open a new issue
for to do all the steps in [1].
There are still many things to clean up, but any cleaning/edge cases will not
add in computational complexity with respect to the number of test points.
e.g. I now convert the coordinates to integers and ignore the decimal part and
there are now lots of collisions in the z-value hash, normalizing the data and
adding a fixed max number of bits to compute the z-value (this is described
towards the end of [3])
Any thoughts?
was (Author: danielblazevski):
[~chiwanpark] [~till.rohrmann]
I have a Flink version -- still a bit preliminary -- of the approximate knn up
and running. The exact knn using a quadtree performs quite bad in
moderate-to-high spatial dimension (e.g 20,000 test and training points in 6D,
the quadtree is worse, but no worries I took care of this and the exact decides
when to use quadtree or not).
https://github.com/danielblazevski/flink/blob/FLINK-1934/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/zknn.scala
A preliminary test shows good scaling with the number when the test + training
points are increased.
8,000 points (i.e. 8,000 test points and 8,000 training points):
Elapsed time approx = : 2s
Elapsed time exact = : 27s
64,000:
Elapsed time approx = : 6s
(didn't run the exact version, we know it's O(N^2))
I will have to clean things up, add edge cases, etc which may slow down the
run-time a bit, but will definitely not increase the complexity of the
algorithm with respect to the number of test/training points.
This still use a cross product, which I was hoping to avoid, but not sure if
that's possible. Any thoughts? Basically the idea is to hash the test/train
set to 1D (I use the z-value hash based on [1]).
I still have not implemented the ideas in [1] in full. The full solution is
quite complex. They do a bunch of load balancing that I'm still learning, and
not quite sure of the payoff. One option could be that I clean up what I have
now and optimize since it's already performing well, and we open a new issue
for to do all the steps in [1].
There are still many things to clean up, but any cleaning/edge cases will not
add in computational complexity with respect to the number of test points.
e.g. I now convert the coordinates to integers and ignore the decimal part and
there are now lots of collisions in the z-value hash, normalizing the data and
adding a fixed max number of bits to compute the z-value (this is described
towards the end of [3])
Any thoughts?
> Add approximative k-nearest-neighbours (kNN) algorithm to machine learning
> library
> ----------------------------------------------------------------------------------
>
> Key: FLINK-1934
> URL: https://issues.apache.org/jira/browse/FLINK-1934
> Project: Flink
> Issue Type: New Feature
> Components: Machine Learning Library
> Reporter: Till Rohrmann
> Assignee: Daniel Blazevski
> Labels: ML
>
> kNN is still a widely used algorithm for classification and regression.
> However, due to the computational costs of an exact implementation, it does
> not scale well to large amounts of data. Therefore, it is worthwhile to also
> add an approximative kNN implementation as proposed in [1,2]. Reference [3]
> is cited a few times in [1], and gives necessary background on the z-value
> approach.
> Resources:
> [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf
> [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf
> [3] http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)