[jira] [Commented] (MATH-1367) DBSCAN Implementation does not count the seed point itself as part of its neighbors count

Amol Singh (JIRA) Sat, 21 May 2016 09:17:59 -0700

    [ 
https://issues.apache.org/jira/browse/MATH-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15295112#comment-15295112
 ]


Amol Singh commented on MATH-1367:
----------------------------------

Hi Giles, 

Thanks for the quick response. 

Here's a simple breaking test, you should be able to add this to 
DBSCANClustererTest.java and run it. 

 @Test public void testBreakingCase() {
        final DoublePoint[] points = { new DoublePoint(new double[] { 
21.345289965479466, 32.149537670215295 }),
                new DoublePoint(new double[] { 42.02226837841131, 
19.69611946377732 }),
                new DoublePoint(new double[] { 41.930019221367985, 
21.954397109962457 }),
                new DoublePoint(new double[] { 42.87345060400816, 
18.796775009724836 }) };

        final DBSCANClusterer<DoublePoint> clusterer = new 
DBSCANClusterer<DoublePoint>(3, 3);
        List<Cluster<DoublePoint>> clusters = 
clusterer.cluster(Arrays.asList(points));
        Assert.assertEquals(1, clusters.size());
        final List<DoublePoint> clusterOne = Arrays.asList(points[1], 
points[2], points[3]);
        Assert.assertTrue(clusters.get(0).getPoints().containsAll(clusterOne));
    }

Now, if you set minPts=2 or change the code to remove the reference check of 
the point to itself in getNeighbors() this will pass. 

Let me know if you agree this is a bug. I'm happy to submit a patch. This 
change does change the behavior for existing users, the algorithm will produce 
more clusters / shape of clusters may change. Nonetheless I feel this would be 
the correct interpretation of the DBSCAN paper, and it seems consistent with 
other third party libraries I've evaluated.

> DBSCAN Implementation does not count the seed point itself as part of its 
> neighbors count
> -----------------------------------------------------------------------------------------
>
>                 Key: MATH-1367
>                 URL: https://issues.apache.org/jira/browse/MATH-1367
>             Project: Commons Math
>          Issue Type: Bug
>    Affects Versions: 3.6.1
>            Reporter: Amol Singh
>             Fix For: 4.0
>
>
> The DSCAN paper describes the eps-neighborhood of a point as 
> https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf (Page 2)
> Definition 1: (Eps-neighborhood of a point) The Eps-neighborhood of a point 
> p, denoted by NEps(p), is defined by NEps(p) = {q ∈ D | dist(p,q)< Eps} 
> in other words for all q points that are a member of database D whose 
> distance from p is less that Eps should be classified as a neighbor. This 
> should include the point itself. 
> The implementation however has a reference check to the point itself and does 
> not add it to its neighbors list.
> private List<T> getNeighbors(final T point, final Collection<T> points) {
>         final List<T> neighbors = new ArrayList<T>();
>         for (final T neighbor : points) {
>             if (point != neighbor && distance(neighbor, point) <= eps) {
>                 neighbors.add(neighbor);
>             }
>         }
>         return neighbors;
>     } 
> "point != neighbor "  check should be removed here. Keeping this check 
> effectively is raising the minPts count by 1. Other third party QuadTree 
> backed DBSCAN implementations consider the center point in its neighbor count 
> E.g. bmw-carit library. 
> If this is infact by design, the check should use value equality instead of 
> reference equality. T extends Clusterable<T> , the client should be able to 
> define this behavior. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MATH-1367) DBSCAN Implementation does not count the seed point itself as part of its neighbors count

Reply via email to