[ 
https://issues.apache.org/jira/browse/MADLIB-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736405#comment-16736405
 ] 

Frank McQuillan edited comment on MADLIB-1061 at 1/7/19 9:50 PM:
-----------------------------------------------------------------

 [^Sheet3-KNN-tree-depth.pdf]  [^Sheet2-KNN-tree-construction.pdf]  
[^Sheet1-KNN-perf-num-features.pdf] 

Attached are performance results for KNN with kd-tree.  Some observations:

1) For a given tree depth, the speed up from the kd tree diminishes until it 
becomes slower than brute force.  For tree_depth=3, this happens at 
num_features=9.

2) For a given number of features, tree construction time grows exponentially 
with depth.
For 1M points, it takes an hour to build tree_depth=15 for num_features=2.

3) Run-time is faster by having a deeper tree, but that gets offset by higher 
tree construction cost, at least up to a tree_depth=6.  For deeper trees, tree 
construction time will dominate.

Also:
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree
https://scikit-learn.org/stable/modules/neighbors.html (section 1.6.4.5 near 
bottom)
has a param `leaf_size` which is defined as the `number of points at which to 
switch to brute-force.`
Do we need a similar `leaf_size` param and how does it relate to `tree_depth` ?

Also:
How to pick default for tree depth?




was (Author: fmcquillan):
 [^Sheet3-KNN-tree-depth.pdf]  [^Sheet2-KNN-tree-construction.pdf]  
[^Sheet1-KNN-perf-num-features.pdf] 

Attached are performance results.  Some observations:

1) For a given tree depth, the speed up from the kd tree diminishes until it 
becomes slower than brute force.  For tree_depth=3, this happens at 
num_features=9.

2) For a given number of features, tree construction time grows exponentially 
with depth.
For 1M points, it takes an hour to build tree_depth=15 for num_features=2.

3) Run-time is faster by having a deeper tree, but that gets offset by higher 
tree construction cost, at least up to a tree_depth=6.  For deeper trees, tree 
construction time will dominate.

Also:
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree
https://scikit-learn.org/stable/modules/neighbors.html (section 1.6.4.5 near 
bottom)
has a param `leaf_size` which is defined as the `number of points at which to 
switch to brute-force.`
Do we need a similar `leaf_size` param and how does it relate to `tree_depth` ?

Also:
How to pick default for tree depth?



> Additional computation methods for k-NN
> ---------------------------------------
>
>                 Key: MADLIB-1061
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1061
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: k-NN
>            Reporter: Frank McQuillan
>            Assignee: Orhan Kislal
>            Priority: Major
>              Labels: starter
>             Fix For: v1.16
>
>         Attachments: Sheet1-KNN-perf-num-features.pdf, 
> Sheet2-KNN-tree-construction.pdf, Sheet3-KNN-tree-depth.pdf
>
>
> Follow on to
> https://issues.apache.org/jira/browse/MADLIB-927
> which uses brute force.
> Determine other k-NN algos to implement.  From 
> http://scikit-learn.org/stable/modules/neighbors.html
> candidates are:
> * K-D Tree
> * Ball Tree
> * Other?
> Look at how to implement in a distributed way.  Also may want to revisit 
> current brute force approach to see if there are improvements to make on 
> parallelism - testing is in serial currently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to