reductionista opened a new pull request #571:
URL: https://github.com/apache/madlib/pull/571
This is a combination of over 92 commits that have been squashed, and while
it works and the performance is great, it's still a "work in progress" in that,
we need to add tests for it, clean up some things, and make sure the brute
force option still functions properly, along with any other special cases (such
as different names for id and point columns, etc.). Also todo: clean up and
consolidate commit messages, which for now have just been aggregated and pruned
a bit.
Opening now so that a PR review can begin for this.
This is a re-write / 2nd attempt at improving over the original
brute force implementation. This seems to be working now, and
should have roughly O(N/S log N) runtimes, where N is the size
of the input dataset (number of points/rows) and S is the number
of segments used (note: the algorithm now decides on its own how
many segments to use, as sometimes splitting things up further
will only increase the runtime. Therefore S is not necessarily
the total number of segments in the cluster, it may be less
depending on the structure of the dataset.)
This borrows many aspects from the previous attempt (using
an overlapping spatial binary tree to segment the dataset and
using an R tree index to speed up range queries), but differs in
its approach in other ways.
=== Rewording of previous summary of the improvements
The brute force DBSCAN runs on N^2 time. To improve this,
we split the data into different overlapping regions, running
DBSCAN on each in parallel, then merging the results together
on the coordinator. More specifically:
1. The data is segmented into connected spatial regions, with
a binary tree optimized specifically for DBSCAN.
2. Each leaf of the optimized spatial tree runs the dbscan
algorithm on the points in its spatial region, including
some points from other regions near its boundaries, using
an R tree index for efficient range queries.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]