[ https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15306122#comment-15306122 ]
ASF GitHub Bot commented on FLINK-1745: --------------------------------------- GitHub user danielblazevski opened a pull request: https://github.com/apache/flink/pull/2048 [Flink 1934] Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library I added approximate knn algorithms. In another PR, there are two exact methods, one basic algorithms using a prirority queue and another using a quadtree (see: https://github.com/apache/flink/pull/1220 ). For this PR, I added z-value based knn and LSH (Locality Sensitive Hashing) based knn. Z-values are good for low-to-moderate dimension. For details, see the paper [2] someone put on the exact JIRA issue: https://issues.apache.org/jira/browse/FLINK-1745 https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf As the z-values aren't applicable for larger dimension, so I used -- as the paper suggests -- a more standard LSH approach. The paper describes a fairly sophisticated MapReduce (MR) design, which I did not use. Using the same MR design pattern as the exact method, I found really good performance improvement! In JIRA, I ran this by @tillrohrmann, and he was OK with a less optimized version for now. Here is a link for a talk I recently gave on this, which includes links for the video and slides: http://www.meetup.com/ny-scala/events/231163636/ Because both the LSH and z-value use the same MR design pattern, I reformatted the codebase from the PR for exact version a bit to make it more modular. You can merge this pull request into a Git repository by running: $ git pull https://github.com/danielblazevski/flink FLINK-1934 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2048.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2048 ---- commit c7e5056c6d273f6f0f841f77e0fdd91ca221602d Author: Chiwan Park <chiwanp...@apache.org> Date: 2015-06-30T08:41:25Z [FLINK-1745] [ml] Add exact k-nearest-neighbor join commit 9d0c7942c09086324fadb29bdce749683a0d1a7e Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-15T21:49:05Z modified kNN test to familiarize with Flink and KNN.scala commit 611248e57166dc549f86f805b590dd4e45cb3df5 Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-15T21:49:17Z modified kNN test to familiarize with Flink and KNN.scala commit 1fd8231ce194b52b5a1bd55bbc5e135b3fa5775b Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-16T01:26:57Z nightly commit, minor changes: got the filter to work, working on mapping the training set to include box lables commit 15d7d2cb308b23e24c43d103b85a76b0e665cbd3 Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-22T02:02:51Z commit before incporporating quadtree commit 8f2da8a66516565c59df8828de2715b45397cb7f Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-22T15:49:25Z did a basic import of QuadTree and Test; to-do: modify QuadTree to allow KNN.scala to make use of commit e1cef2c5aea65c6f204caeff6348e2778231f98d Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-22T21:03:04Z transfered ListBuffers for objects in leaf nodes to Vectors commit c3387ef2ef59734727b56ea652fdb29af957d20b Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-23T00:41:29Z basic test on 2D unit box seems to work -- need to generalize, e.g. to include automated bounding box commit 48294ff37a5f800e5111280da5a3c03f4375028d Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-23T15:03:06Z had to debug quadtree -- back to testing 2D commit 6403ba14e240ed8d67a296ac789e7e00dece800d Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-23T15:22:46Z Testing 2D looks good, strong improvement in run time compared to brute-force method commit 426466a40bc2625f390fe0d912f56a346e46c8f8 Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-23T19:04:52Z added automated detection of bounding box based on min/max values of both training and test sets commit c35543b828384aa4ce04d56dfcb3d73db46d1e6d Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-24T00:28:56Z added automated radius about test point to define localized neighborhood, result runs. TO-DO: Lots of tests commit 8e2d2e78f8533d4192aebe9b4baa7efbfa5928a5 Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-24T00:54:06Z Note for future: previous commit passed test of Chiwan Park had in intial knn implementation commit d6fd40cb88d6e198e52c368e829bf7d32d432081 Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-24T01:56:38Z Note for future: previous commit passed 3D version of the test that Chiwan Park had in the intial knn implementation commit 0ec1f4866157ca073341672e7fe9a50871ac0b7c Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-24T14:27:20Z changed filename of QuadTreeTest to QuadTreeSuite, about to make test more comprehensive and similar format to other Flink tests commit ac81561cad27b65d158ae08fd0fb15bdb51d1c8b Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-24T19:51:32Z refactored testing of QuadTree, and added more tests commit b17f82d5ce0214617c8dbc4a387410057d6f3832 Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-24T22:49:10Z added KNNBenchmark to check runtimes commit 530565835d4b5934fcac9e0e51105bb669fec9be Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-24T22:49:17Z added KNNBenchmark to check runtimes commit 1f946cb30450604e92bbd0f5959ce9a60eb4c41b Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-25T01:13:00Z fixed bug -- in find siblings, needed to search for minimal bounding boxes commit 22e4eb7b57795ad1ca4392ca1c1a8bdae76afa8e Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-27T03:34:16Z added more thorough benchmark files; about to modify bounding box to only bound training set and modify search for the testing set commit 3723f6b09ec7d45f6444df70a5f699cbf998a4bb Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-27T03:34:21Z added more thorough benchmark files; about to modify bounding box to only bound training set and modify search for the testing set commit c41d3e1029bf81a37cf3594f202b904e2d99e3ac Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-28T03:53:12Z major simplification in choosing the radius to look at nearby neighbors commit 7c77ea20fd9e8a0c4a33c81b83187c84b380d6b2 Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-28T04:03:59Z cleaned up; commit before deleting previous sibling search commit cf4aa5d75611db19040466040c3d29432cb0e5f7 Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-09-28T14:02:51Z added new devel branch to push temp changes on github commit ec6ddb0a57136075b4f77616e6e48eb5bcc50a11 Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-10-01T00:17:33Z cleaned up; removed comments and a unused method commit 7ed9926d8207b5f59b4ceb968d7ebd732029f5c3 Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-10-01T20:28:31Z fixed bug in bFiltVect, and renamed to trainingFiltered commit 4b3bb2ec92396bf754d0d207ffc6853406ce7c39 Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-10-01T21:05:46Z fixed bug: if there are fewer than maxPerBox total training points, do not do heap construction, just make siblingsQueue = root.objects commit 1662b38822dfdfbbca272d377fa3b94f8246e9e6 Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-10-01T22:03:42Z added metric to constructor, and added a flag to test whether it is Euclidean or SquaredEuclidean commit f654b841cc8d91fa861f188469831404c288b227 Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-10-01T22:48:51Z changed name of test that uses the Quadtree along with KNN -- modified from CHiwan Park's test to ensure flag to use Quadtree will pass commit 7928798281dba5554eeb63df7e67400a42e7a381 Author: danielblazevski <daniel.blazev...@gmail.com> Date: 2015-10-01T22:54:25Z fixed QuadTree test to conform to using a min-heap ---- > Add exact k-nearest-neighbours algorithm to machine learning library > -------------------------------------------------------------------- > > Key: FLINK-1745 > URL: https://issues.apache.org/jira/browse/FLINK-1745 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library > Reporter: Till Rohrmann > Assignee: Daniel Blazevski > Labels: ML, Starter > > Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial > it is still used as a mean to classify data and to do regression. This issue > focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as > proposed in [2]. > Could be a starter task. > Resources: > [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm] > [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)