fmcquillan99 edited a comment on pull request #496:
URL: https://github.com/apache/madlib/pull/496#issuecomment-624228952
(5)
To be consistent with knn, can we please make the algorithm `brute_force` as
the base. It seems you can do `brute` but `brute_force` does not work:
```
SELECT madlib.dbscan(
'dbscan_train_data', -- source table
'dbscan_result', -- output table
'pid', --
point id column
'pointsxx', -- data
point
1.75, --
epsilon
4,
-- min samples
'dist_norm2', -- metric
'brute_force'); -- algorithm
ERROR: plpy.Error: dbscan Error: algorithm has to be one of the following:
brute, kd-tree (plpython.c:5038)
CONTEXT: Traceback (most recent call last):
PL/Python function "dbscan", line 21, in <module>
return dbscan.dbscan(**globals())
PL/Python function "dbscan", line 38, in dbscan
PL/Python function "dbscan", line 184, in _validate_dbscan
PL/Python function "dbscan", line 123, in _assert
PL/Python function "dbscan"
```
(6)
It seems the column header `points` does not work, maybe in conflicts with
an internal name?
I think it could be a common col name so we should fix this.
```
SELECT madlib.dbscan(
'dbscan_train_data', -- source table
'dbscan_result', -- output table
'pid', --
point id column
'points', -- data
point
1.75, --
epsilon
4,
-- min samples
'dist_norm2', -- metric
'brute'); -- algorithm
ERROR: plpy.SPIError: column reference "points" is ambiguous
LINE 3: SET points = points
^
QUERY:
UPDATE dbscan_result AS t1
SET points = points
FROM dbscan_train_data AS t2
WHERE t1.pid = t2.pid
CONTEXT: Traceback (most recent call last):
PL/Python function "dbscan", line 21, in <module>
return dbscan.dbscan(**globals())
PL/Python function "dbscan", line 109, in dbscan
PL/Python function "dbscan"
```
(7)
Please check the logic with the following example from
https://dbs.ifi.uni-heidelberg.de/files/Team/eschubert/lectures/KDDClusterAnalysis17-screen.pdf#page=215
I think there should only be 3 clusters created
```
DROP TABLE IF EXISTS dbscan_train_data;
CREATE TABLE dbscan_train_data (pid int, pointsxx double precision[]);
INSERT INTO dbscan_train_data VALUES
(1, '{1, 1}'),
(2, '{2, 1}'),
(3, '{1, 2}'),
(4, '{2, 2}'),
(5, '{3, 5}'),
(6, '{3, 9}'),
(7, '{3, 10}'),
(8, '{4, 10}'),
(9, '{4, 11}'),
(10, '{5, 10}'),
(11, '{7, 10}'),
(12, '{10, 9}'),
(13, '{10, 6}'),
(14, '{9, 5}'),
(15, '{10, 5}'),
(16, '{11, 5}'),
(17, '{9, 4}'),
(18, '{10, 4}'),
(19, '{11, 4}'),
(20, '{10, 3}');
DROP TABLE IF EXISTS dbscan_result;
SELECT madlib.dbscan(
'dbscan_train_data', -- source table
'dbscan_result', -- output table
'pid', --
point id column
'pointsxx', -- data
point
1.75, --
epsilon
4,
-- min samples
'dist_norm2', -- metric
'brute'); -- algorithm
SELECT madlib.dbscan(
'dbscan_train_data', -- source table
'dbscan_result', -- output table
'pid', --
point id column
'points', -- data
point
1.75, --
epsilon
4,
-- min samples
'dist_norm2', -- metric
'brute_force'); -- algorithm
SELECT * FROM dbscan_result ORDER BY pid;
pid | cluster_id | is_core_point | points
-----+------------+---------------+--------
1 | 0 | t | {1,1}
2 | 0 | t | {2,1}
3 | 0 | t | {1,2}
4 | 0 | t | {2,2}
6 | 1 | f | {3,9}
7 | 1 | t | {3,10}
8 | 1 | t | {4,10}
9 | 1 | t | {4,11}
10 | 1 | f | {5,10}
13 | 2 | t | {10,6}
14 | 2 | t | {9,5}
15 | 2 | t | {10,5}
16 | 2 | t | {11,5}
17 | 3 | t | {9,4}
18 | 3 | t | {10,4}
19 | 4 | t | {11,4}
20 | 5 | t | {10,3}
(17 rows)
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]