[GitHub] [madlib] fmcquillan99 edited a comment on pull request #496: DBSCAN: Add new module DBSCAN

GitBox Tue, 05 May 2020 11:47:46 -0700


fmcquillan99 edited a comment on pull request #496:
URL: https://github.com/apache/madlib/pull/496#issuecomment-624228952



   (5)
   To be consistent with knn, can we please make the algorithm `brute_force` as 
the base.  It seems you can do `brute` but `brute_force` does not work:
   ```
   SELECT madlib.dbscan( 
                                'dbscan_train_data',    -- source table
                                'dbscan_result',                -- output table
                                'pid',                                  -- 
point id column
                                'pointsxx',                             -- data 
point
                                 1.75,                                  -- 
epsilon
                                 4,                                             
-- min samples
                                 'dist_norm2',  -- metric
                                'brute_force');                 -- algorithm
   
   ERROR:  plpy.Error: dbscan Error: algorithm has to be one of the following: 
brute, kd-tree (plpython.c:5038)
   CONTEXT:  Traceback (most recent call last):
     PL/Python function "dbscan", line 21, in <module>
       return dbscan.dbscan(**globals())
     PL/Python function "dbscan", line 38, in dbscan
     PL/Python function "dbscan", line 184, in _validate_dbscan
     PL/Python function "dbscan", line 123, in _assert
   PL/Python function "dbscan"
   ```
   
   
   (6) 
   It seems the column header `points` does not work, maybe in conflicts with 
an internal name?
   I think it could be a common col name so we should fix this.
   ```
   SELECT madlib.dbscan( 
                                'dbscan_train_data',    -- source table
                                'dbscan_result',                -- output table
                                'pid',                                  -- 
point id column
                                'points',                               -- data 
point
                                 1.75,                                  -- 
epsilon
                                 4,                                             
-- min samples
                                 'dist_norm2',  -- metric
                                'brute');                       -- algorithm
   
   ERROR:  plpy.SPIError: column reference "points" is ambiguous
   LINE 3:             SET points = points
                                    ^
   QUERY:  
               UPDATE dbscan_result AS t1
               SET points = points
               FROM dbscan_train_data AS t2
               WHERE t1.pid = t2.pid
           
   CONTEXT:  Traceback (most recent call last):
     PL/Python function "dbscan", line 21, in <module>
       return dbscan.dbscan(**globals())
     PL/Python function "dbscan", line 109, in dbscan
   PL/Python function "dbscan"
   ```
   
   (7)
   Please check the logic with the following example from
   
https://dbs.ifi.uni-heidelberg.de/files/Team/eschubert/lectures/KDDClusterAnalysis17-screen.pdf#page=215
   
   I think there should only be 3 clusters created
   
   ```
   DROP TABLE IF EXISTS dbscan_train_data;
   CREATE TABLE dbscan_train_data (pid int, pointsxx double precision[]);
   INSERT INTO dbscan_train_data VALUES
   (1,  '{1, 1}'),
   (2,  '{2, 1}'),
   (3,  '{1, 2}'),
   (4,  '{2, 2}'),
   (5,  '{3, 5}'),
   (6,  '{3, 9}'),
   (7,  '{3, 10}'),
   (8,  '{4, 10}'),
   (9,  '{4, 11}'),
   (10,  '{5, 10}'),
   (11,  '{7, 10}'),
   (12,  '{10, 9}'),
   (13,  '{10, 6}'),
   (14,  '{9, 5}'),
   (15,  '{10, 5}'),
   (16,  '{11, 5}'),
   (17,  '{9, 4}'),
   (18,  '{10, 4}'),
   (19,  '{11, 4}'),
   (20,  '{10, 3}');
   
   DROP TABLE IF EXISTS dbscan_result;
   SELECT madlib.dbscan( 
                                'dbscan_train_data',    -- source table
                                'dbscan_result',                -- output table
                                'pid',                                  -- 
point id column
                                'pointsxx',                             -- data 
point
                                 1.75,                                  -- 
epsilon
                                 4,                                             
-- min samples
                                'dist_norm2',           -- metric
                                'brute');                       -- algorithm
   
   SELECT madlib.dbscan( 
                                'dbscan_train_data',    -- source table
                                'dbscan_result',                -- output table
                                'pid',                                  -- 
point id column
                                'points',                               -- data 
point
                                 1.75,                                  -- 
epsilon
                                 4,                                             
-- min samples
                                 'dist_norm2',  -- metric
                                'brute_force');                 -- algorithm
   
   SELECT * FROM dbscan_result ORDER BY pid;
   
    pid | cluster_id | is_core_point | points 
   -----+------------+---------------+--------
      1 |          0 | t             | {1,1}
      2 |          0 | t             | {2,1}
      3 |          0 | t             | {1,2}
      4 |          0 | t             | {2,2}
      6 |          1 | f             | {3,9}
      7 |          1 | t             | {3,10}
      8 |          1 | t             | {4,10}
      9 |          1 | t             | {4,11}
     10 |          1 | f             | {5,10}
     13 |          2 | t             | {10,6}
     14 |          2 | t             | {9,5}
     15 |          2 | t             | {10,5}
     16 |          2 | t             | {11,5}
     17 |          3 | t             | {9,4}
     18 |          3 | t             | {10,4}
     19 |          4 | t             | {11,4}
     20 |          5 | t             | {10,3}
   (17 rows)
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [madlib] fmcquillan99 edited a comment on pull request #496: DBSCAN: Add new module DBSCAN

Reply via email to