[jira] [Commented] (MATH-1367) DBSCAN Implementation does not count the seed point itself as part of its neighbors count
[ https://issues.apache.org/jira/browse/MATH-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17471246#comment-17471246 ] Gilles Sadowski commented on MATH-1367: --- Will you provide a patch? > DBSCAN Implementation does not count the seed point itself as part of its > neighbors count > - > > Key: MATH-1367 > URL: https://issues.apache.org/jira/browse/MATH-1367 > Project: Commons Math > Issue Type: Bug >Affects Versions: 3.6.1 >Reporter: Amol Singh >Priority: Major > Fix For: 4.0 > > > The DSCAN paper describes the eps-neighborhood of a point as > https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf (Page 2) > Definition 1: (Eps-neighborhood of a point) The Eps-neighborhood of a point > p, denoted by NEps(p), is defined by NEps(p) = {q ∈ D | dist(p,q)< Eps} > in other words for all q points that are a member of database D whose > distance from p is less that Eps should be classified as a neighbor. This > should include the point itself. > The implementation however has a reference check to the point itself and does > not add it to its neighbors list. > private List getNeighbors(final T point, final Collection points) { > final List neighbors = new ArrayList(); > for (final T neighbor : points) { > if (point != neighbor && distance(neighbor, point) <= eps) { > neighbors.add(neighbor); > } > } > return neighbors; > } > "point != neighbor " check should be removed here. Keeping this check > effectively is raising the minPts count by 1. Other third party QuadTree > backed DBSCAN implementations consider the center point in its neighbor count > E.g. bmw-carit library. > If this is infact by design, the check should use value equality instead of > reference equality. T extends Clusterable , the client should be able to > define this behavior. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (MATH-1367) DBSCAN Implementation does not count the seed point itself as part of its neighbors count
[ https://issues.apache.org/jira/browse/MATH-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581926#comment-16581926 ] wang qiang commented on MATH-1367: -- I agree with you > DBSCAN Implementation does not count the seed point itself as part of its > neighbors count > - > > Key: MATH-1367 > URL: https://issues.apache.org/jira/browse/MATH-1367 > Project: Commons Math > Issue Type: Bug >Affects Versions: 3.6.1 >Reporter: Amol Singh >Priority: Major > Fix For: 4.0 > > > The DSCAN paper describes the eps-neighborhood of a point as > https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf (Page 2) > Definition 1: (Eps-neighborhood of a point) The Eps-neighborhood of a point > p, denoted by NEps(p), is defined by NEps(p) = {q ∈ D | dist(p,q)< Eps} > in other words for all q points that are a member of database D whose > distance from p is less that Eps should be classified as a neighbor. This > should include the point itself. > The implementation however has a reference check to the point itself and does > not add it to its neighbors list. > private List getNeighbors(final T point, final Collection points) { > final List neighbors = new ArrayList(); > for (final T neighbor : points) { > if (point != neighbor && distance(neighbor, point) <= eps) { > neighbors.add(neighbor); > } > } > return neighbors; > } > "point != neighbor " check should be removed here. Keeping this check > effectively is raising the minPts count by 1. Other third party QuadTree > backed DBSCAN implementations consider the center point in its neighbor count > E.g. bmw-carit library. > If this is infact by design, the check should use value equality instead of > reference equality. T extends Clusterable , the client should be able to > define this behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MATH-1367) DBSCAN Implementation does not count the seed point itself as part of its neighbors count
[ https://issues.apache.org/jira/browse/MATH-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16323717#comment-16323717 ] jiaopaner commented on MATH-1367: - Hello,Amol I quite agree with your opinion .I also found this problem when I use the DBSCAN algorithm.The version I used is apache-commons -math-3.6.1. So the problem still exists,and I also found other problems in addition to the above problem.The time complexity of the algorithm is always n^2 when I test.Besides that,there are many no necessary operations in the algorithm implementation.BTW,why only DoublePoint, no StringPoint, so I can't use Levenshtein distance to search the adjacency points.I'll create a detailed bug report . > DBSCAN Implementation does not count the seed point itself as part of its > neighbors count > - > > Key: MATH-1367 > URL: https://issues.apache.org/jira/browse/MATH-1367 > Project: Commons Math > Issue Type: Bug >Affects Versions: 3.6.1 >Reporter: Amol Singh > Fix For: 4.0 > > > The DSCAN paper describes the eps-neighborhood of a point as > https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf (Page 2) > Definition 1: (Eps-neighborhood of a point) The Eps-neighborhood of a point > p, denoted by NEps(p), is defined by NEps(p) = {q ∈ D | dist(p,q)< Eps} > in other words for all q points that are a member of database D whose > distance from p is less that Eps should be classified as a neighbor. This > should include the point itself. > The implementation however has a reference check to the point itself and does > not add it to its neighbors list. > private List getNeighbors(final T point, final Collection points) { > final List neighbors = new ArrayList(); > for (final T neighbor : points) { > if (point != neighbor && distance(neighbor, point) <= eps) { > neighbors.add(neighbor); > } > } > return neighbors; > } > "point != neighbor " check should be removed here. Keeping this check > effectively is raising the minPts count by 1. Other third party QuadTree > backed DBSCAN implementations consider the center point in its neighbor count > E.g. bmw-carit library. > If this is infact by design, the check should use value equality instead of > reference equality. T extends Clusterable , the client should be able to > define this behavior. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MATH-1367) DBSCAN Implementation does not count the seed point itself as part of its neighbors count
[ https://issues.apache.org/jira/browse/MATH-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15298255#comment-15298255 ] Amol Singh commented on MATH-1367: -- Okay, I'll do that. https://en.wikipedia.org/wiki/DBSCAN#cite_note-dbscan-1 Not the most reliable source but if you look at the pseudocode, thats how others have interpreted this algorithm as well. {quote} regionQuery(P, eps) return all points within P's eps-neighborhood (including P) {quote} I'll submit a patch. > DBSCAN Implementation does not count the seed point itself as part of its > neighbors count > - > > Key: MATH-1367 > URL: https://issues.apache.org/jira/browse/MATH-1367 > Project: Commons Math > Issue Type: Bug >Affects Versions: 3.6.1 >Reporter: Amol Singh > Fix For: 4.0 > > > The DSCAN paper describes the eps-neighborhood of a point as > https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf (Page 2) > Definition 1: (Eps-neighborhood of a point) The Eps-neighborhood of a point > p, denoted by NEps(p), is defined by NEps(p) = {q ∈ D | dist(p,q)< Eps} > in other words for all q points that are a member of database D whose > distance from p is less that Eps should be classified as a neighbor. This > should include the point itself. > The implementation however has a reference check to the point itself and does > not add it to its neighbors list. > private List getNeighbors(final T point, final Collection points) { > final List neighbors = new ArrayList(); > for (final T neighbor : points) { > if (point != neighbor && distance(neighbor, point) <= eps) { > neighbors.add(neighbor); > } > } > return neighbors; > } > "point != neighbor " check should be removed here. Keeping this check > effectively is raising the minPts count by 1. Other third party QuadTree > backed DBSCAN implementations consider the center point in its neighbor count > E.g. bmw-carit library. > If this is infact by design, the check should use value equality instead of > reference equality. T extends Clusterable , the client should be able to > define this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1367) DBSCAN Implementation does not count the seed point itself as part of its neighbors count
[ https://issues.apache.org/jira/browse/MATH-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15298132#comment-15298132 ] Gilles commented on MATH-1367: -- bq. Let me know if you agree this is a bug. I don't know. :( If you are positive that the algorithm was not correctly implemented, please do submit a patch with code comments that explain why the change was necessary. It would also be nice to set up a unit test showing a practical case that this implementation agrees with the result computed by another implementation. Thanks. > DBSCAN Implementation does not count the seed point itself as part of its > neighbors count > - > > Key: MATH-1367 > URL: https://issues.apache.org/jira/browse/MATH-1367 > Project: Commons Math > Issue Type: Bug >Affects Versions: 3.6.1 >Reporter: Amol Singh > Fix For: 4.0 > > > The DSCAN paper describes the eps-neighborhood of a point as > https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf (Page 2) > Definition 1: (Eps-neighborhood of a point) The Eps-neighborhood of a point > p, denoted by NEps(p), is defined by NEps(p) = {q ∈ D | dist(p,q)< Eps} > in other words for all q points that are a member of database D whose > distance from p is less that Eps should be classified as a neighbor. This > should include the point itself. > The implementation however has a reference check to the point itself and does > not add it to its neighbors list. > private List getNeighbors(final T point, final Collection points) { > final List neighbors = new ArrayList(); > for (final T neighbor : points) { > if (point != neighbor && distance(neighbor, point) <= eps) { > neighbors.add(neighbor); > } > } > return neighbors; > } > "point != neighbor " check should be removed here. Keeping this check > effectively is raising the minPts count by 1. Other third party QuadTree > backed DBSCAN implementations consider the center point in its neighbor count > E.g. bmw-carit library. > If this is infact by design, the check should use value equality instead of > reference equality. T extends Clusterable , the client should be able to > define this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1367) DBSCAN Implementation does not count the seed point itself as part of its neighbors count
[ https://issues.apache.org/jira/browse/MATH-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15295112#comment-15295112 ] Amol Singh commented on MATH-1367: -- Hi Giles, Thanks for the quick response. Here's a simple breaking test, you should be able to add this to DBSCANClustererTest.java and run it. @Test public void testBreakingCase() { final DoublePoint[] points = { new DoublePoint(new double[] { 21.345289965479466, 32.149537670215295 }), new DoublePoint(new double[] { 42.02226837841131, 19.69611946377732 }), new DoublePoint(new double[] { 41.930019221367985, 21.954397109962457 }), new DoublePoint(new double[] { 42.87345060400816, 18.796775009724836 }) }; final DBSCANClusterer clusterer = new DBSCANClusterer(3, 3); Listclusters = clusterer.cluster(Arrays.asList(points)); Assert.assertEquals(1, clusters.size()); final List clusterOne = Arrays.asList(points[1], points[2], points[3]); Assert.assertTrue(clusters.get(0).getPoints().containsAll(clusterOne)); } Now, if you set minPts=2 or change the code to remove the reference check of the point to itself in getNeighbors() this will pass. Let me know if you agree this is a bug. I'm happy to submit a patch. This change does change the behavior for existing users, the algorithm will produce more clusters / shape of clusters may change. Nonetheless I feel this would be the correct interpretation of the DBSCAN paper, and it seems consistent with other third party libraries I've evaluated. > DBSCAN Implementation does not count the seed point itself as part of its > neighbors count > - > > Key: MATH-1367 > URL: https://issues.apache.org/jira/browse/MATH-1367 > Project: Commons Math > Issue Type: Bug >Affects Versions: 3.6.1 >Reporter: Amol Singh > Fix For: 4.0 > > > The DSCAN paper describes the eps-neighborhood of a point as > https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf (Page 2) > Definition 1: (Eps-neighborhood of a point) The Eps-neighborhood of a point > p, denoted by NEps(p), is defined by NEps(p) = {q ∈ D | dist(p,q)< Eps} > in other words for all q points that are a member of database D whose > distance from p is less that Eps should be classified as a neighbor. This > should include the point itself. > The implementation however has a reference check to the point itself and does > not add it to its neighbors list. > private List getNeighbors(final T point, final Collection points) { > final List neighbors = new ArrayList(); > for (final T neighbor : points) { > if (point != neighbor && distance(neighbor, point) <= eps) { > neighbors.add(neighbor); > } > } > return neighbors; > } > "point != neighbor " check should be removed here. Keeping this check > effectively is raising the minPts count by 1. Other third party QuadTree > backed DBSCAN implementations consider the center point in its neighbor count > E.g. bmw-carit library. > If this is infact by design, the check should use value equality instead of > reference equality. T extends Clusterable , the client should be able to > define this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MATH-1367) DBSCAN Implementation does not count the seed point itself as part of its neighbors count
[ https://issues.apache.org/jira/browse/MATH-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294842#comment-15294842 ] Gilles commented on MATH-1367: -- Hi Amol. Thanks for your report. >From reading the problem description it is not obvious to figure out whether >your proposed fix won't have any unwanted side-effects (e.g. it could be that >the "getNeighbors" was not meant to exactly match the definition in the >reference you cite; maybe you are totally right, I'm just guessing since I >never looked at that code...). Could you provide a unit test showing that it is indeed a bug (i.e. a case where the best solution cannot be recovered unless the fix is applied)? > DBSCAN Implementation does not count the seed point itself as part of its > neighbors count > - > > Key: MATH-1367 > URL: https://issues.apache.org/jira/browse/MATH-1367 > Project: Commons Math > Issue Type: Bug >Affects Versions: 3.6.1 >Reporter: Amol Singh > Fix For: 4.0 > > > The DSCAN paper describes the eps-neighborhood of a point as > https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf (Page 2) > Definition 1: (Eps-neighborhood of a point) The Eps-neighborhood of a point > p, denoted by NEps(p), is defined by NEps(p) = {q ∈ D | dist(p,q)< Eps} > in other words for all q points that are a member of database D whose > distance from p is less that Eps should be classified as a neighbor. This > should include the point itself. > The implementation however has a reference check to the point itself and does > not add it to its neighbors list. > private List getNeighbors(final T point, final Collection points) { > final List neighbors = new ArrayList(); > for (final T neighbor : points) { > if (point != neighbor && distance(neighbor, point) <= eps) { > neighbors.add(neighbor); > } > } > return neighbors; > } > "point != neighbor " check should be removed here. Keeping this check > effectively is raising the minPts count by 1. Other third party QuadTree > backed DBSCAN implementations consider the center point in its neighbor count > E.g. bmw-carit library. > If this is infact by design, the check should use value equality instead of > reference equality. T extends Clusterable , the client should be able to > define this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)