[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15306122#comment-15306122
 ] 

ASF GitHub Bot commented on FLINK-1745:
---------------------------------------

GitHub user danielblazevski opened a pull request:

    https://github.com/apache/flink/pull/2048

    [Flink 1934] Add approximative k-nearest-neighbours (kNN) algorithm to 
machine learning library

    I added approximate knn algorithms.  In another PR, there are two exact 
methods, one basic algorithms using a prirority queue and another using a 
quadtree (see: https://github.com/apache/flink/pull/1220 ).  
    
    For this PR, I added z-value based knn and LSH (Locality Sensitive Hashing) 
based knn.  Z-values are good for low-to-moderate dimension.  For details, see 
the paper [2] someone put on the exact JIRA issue: 
    https://issues.apache.org/jira/browse/FLINK-1745
    https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf
    
    As the z-values aren't applicable for larger dimension, so I used -- as the 
paper suggests -- a more standard LSH approach. 
    
    The paper describes a fairly sophisticated MapReduce (MR) design, which I 
did not use.  Using the same MR design pattern as the exact method, I found 
really good performance improvement!  In JIRA, I ran this by @tillrohrmann, and 
he was OK with a less optimized version for now. Here is a link for a talk I 
recently gave on this, which includes links for the video and slides:
    http://www.meetup.com/ny-scala/events/231163636/
    
    Because both the LSH and z-value use the same MR design pattern, I 
reformatted the codebase from the PR for exact version a bit to make it more 
modular.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/danielblazevski/flink FLINK-1934

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2048.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2048
    
----
commit c7e5056c6d273f6f0f841f77e0fdd91ca221602d
Author: Chiwan Park <chiwanp...@apache.org>
Date:   2015-06-30T08:41:25Z

    [FLINK-1745] [ml] Add exact k-nearest-neighbor join

commit 9d0c7942c09086324fadb29bdce749683a0d1a7e
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-15T21:49:05Z

    modified kNN test to familiarize with Flink and KNN.scala

commit 611248e57166dc549f86f805b590dd4e45cb3df5
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-15T21:49:17Z

    modified kNN test to familiarize with Flink and KNN.scala

commit 1fd8231ce194b52b5a1bd55bbc5e135b3fa5775b
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-16T01:26:57Z

    nightly commit, minor changes:  got the filter to work, working on mapping 
the training set to include box lables

commit 15d7d2cb308b23e24c43d103b85a76b0e665cbd3
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-22T02:02:51Z

    commit before incporporating quadtree

commit 8f2da8a66516565c59df8828de2715b45397cb7f
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-22T15:49:25Z

    did a basic import of QuadTree and Test; to-do:  modify QuadTree to allow 
KNN.scala to make use of

commit e1cef2c5aea65c6f204caeff6348e2778231f98d
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-22T21:03:04Z

    transfered ListBuffers for objects in leaf nodes to Vectors

commit c3387ef2ef59734727b56ea652fdb29af957d20b
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-23T00:41:29Z

    basic test on 2D unit box seems to work -- need to generalize, e.g. to 
include automated bounding box

commit 48294ff37a5f800e5111280da5a3c03f4375028d
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-23T15:03:06Z

    had to debug quadtree -- back to testing 2D

commit 6403ba14e240ed8d67a296ac789e7e00dece800d
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-23T15:22:46Z

    Testing 2D looks good, strong improvement in run time compared to 
brute-force method

commit 426466a40bc2625f390fe0d912f56a346e46c8f8
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-23T19:04:52Z

    added automated detection of bounding box based on min/max values of both 
training and test sets

commit c35543b828384aa4ce04d56dfcb3d73db46d1e6d
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-24T00:28:56Z

    added automated radius about test point to define localized neighborhood, 
result runs.  TO-DO:  Lots of tests

commit 8e2d2e78f8533d4192aebe9b4baa7efbfa5928a5
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-24T00:54:06Z

    Note for future:  previous commit passed test of Chiwan Park had in intial 
knn implementation

commit d6fd40cb88d6e198e52c368e829bf7d32d432081
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-24T01:56:38Z

    Note for future:  previous commit passed 3D version of the test that Chiwan 
Park had in the intial knn implementation

commit 0ec1f4866157ca073341672e7fe9a50871ac0b7c
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-24T14:27:20Z

    changed filename of QuadTreeTest to QuadTreeSuite, about to make test more 
comprehensive and similar format to other Flink tests

commit ac81561cad27b65d158ae08fd0fb15bdb51d1c8b
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-24T19:51:32Z

    refactored testing of QuadTree, and added more tests

commit b17f82d5ce0214617c8dbc4a387410057d6f3832
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-24T22:49:10Z

    added KNNBenchmark to check runtimes

commit 530565835d4b5934fcac9e0e51105bb669fec9be
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-24T22:49:17Z

    added KNNBenchmark to check runtimes

commit 1f946cb30450604e92bbd0f5959ce9a60eb4c41b
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-25T01:13:00Z

    fixed bug -- in find siblings, needed to search for minimal bounding boxes

commit 22e4eb7b57795ad1ca4392ca1c1a8bdae76afa8e
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-27T03:34:16Z

    added more thorough benchmark files; about to modify  bounding box to only 
bound training set and modify search for the testing set

commit 3723f6b09ec7d45f6444df70a5f699cbf998a4bb
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-27T03:34:21Z

    added more thorough benchmark files; about to modify  bounding box to only 
bound training set and modify search for the testing set

commit c41d3e1029bf81a37cf3594f202b904e2d99e3ac
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-28T03:53:12Z

    major simplification in choosing the radius to look at nearby neighbors

commit 7c77ea20fd9e8a0c4a33c81b83187c84b380d6b2
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-28T04:03:59Z

    cleaned up; commit before deleting previous sibling search

commit cf4aa5d75611db19040466040c3d29432cb0e5f7
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-09-28T14:02:51Z

    added new devel branch to push temp changes on github

commit ec6ddb0a57136075b4f77616e6e48eb5bcc50a11
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-10-01T00:17:33Z

    cleaned up; removed comments and a unused method

commit 7ed9926d8207b5f59b4ceb968d7ebd732029f5c3
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-10-01T20:28:31Z

    fixed bug in bFiltVect, and renamed to trainingFiltered

commit 4b3bb2ec92396bf754d0d207ffc6853406ce7c39
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-10-01T21:05:46Z

    fixed bug:  if there are fewer than maxPerBox total training points, do not 
do heap construction, just make siblingsQueue = root.objects

commit 1662b38822dfdfbbca272d377fa3b94f8246e9e6
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-10-01T22:03:42Z

    added metric to constructor, and added a flag to test whether it is 
Euclidean or SquaredEuclidean

commit f654b841cc8d91fa861f188469831404c288b227
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-10-01T22:48:51Z

    changed name of test that uses the Quadtree along with KNN -- modified from 
CHiwan Park's test to ensure flag to use Quadtree will pass

commit 7928798281dba5554eeb63df7e67400a42e7a381
Author: danielblazevski <daniel.blazev...@gmail.com>
Date:   2015-10-01T22:54:25Z

    fixed QuadTree test to conform to using a min-heap

----


> Add exact k-nearest-neighbours algorithm to machine learning library
> --------------------------------------------------------------------
>
>                 Key: FLINK-1745
>                 URL: https://issues.apache.org/jira/browse/FLINK-1745
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Daniel Blazevski
>              Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to