[GitHub] [madlib] reductionista opened a new pull request #571: Fast parallel-optimized DBSCAN

GitBox Tue, 06 Jul 2021 00:28:06 -0700


reductionista opened a new pull request #571:
URL: https://github.com/apache/madlib/pull/571



   This is a combination of over 92 commits that have been squashed, and while 
it works and the performance is great, it's still a "work in progress" in that, 
we need to add tests for it, clean up some things, and make sure the brute 
force option still functions properly, along with any other special cases (such 
as different names for id and point columns, etc.).  Also todo:  clean up and 
consolidate commit messages, which for now have just been aggregated and pruned 
a bit.
   
   Opening now so that a PR review can begin for this.
   
     This is a re-write / 2nd attempt at improving over the original
     brute force implementation. This seems to be working now, and
     should have roughly O(N/S log N) runtimes, where N is the size
     of the input dataset (number of points/rows) and S is the number
     of segments used (note:  the algorithm now decides on its own how
     many segments to use, as sometimes splitting things up further
     will only increase the runtime.  Therefore S is not necessarily
     the total number of segments in the cluster, it may be less
     depending on the structure of the dataset.)
   
     This borrows many aspects from the previous attempt (using
     an overlapping spatial binary tree to segment the dataset and
     using an R tree index to speed up range queries), but differs in
     its approach in other ways.
   
     === Rewording of previous summary of the improvements
   
     The brute force DBSCAN runs on N^2 time. To improve this,
     we split the data into different overlapping regions, running
     DBSCAN on each in parallel, then merging the results together
     on the coordinator. More specifically:
   
     1. The data is segmented into connected spatial regions, with
        a binary tree optimized specifically for DBSCAN.
   
     2. Each leaf of the optimized spatial tree runs the dbscan
        algorithm on the points in its spatial region, including
        some points from other regions near its boundaries, using
        an R tree index for efficient range queries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [madlib] reductionista opened a new pull request #571: Fast parallel-optimized DBSCAN

Reply via email to