Frank McQuillan created MADLIB-1380:
---------------------------------------

             Summary: Select number of centroids in k-means
                 Key: MADLIB-1380
                 URL: https://issues.apache.org/jira/browse/MADLIB-1380
             Project: Apache MADlib
          Issue Type: New Feature
          Components: Module: k-Means Clustering
            Reporter: Frank McQuillan
             Fix For: v1.17


{code}
kmeans_random( rel_source,
               expr_point,
               k,                       -- can be a single value like now or an 
array of k values
               fn_dist,                 -- optional
               agg_centroid,                    -- optional
               max_num_iterations,              -- optional
               min_frac_reassigned,             -- optional
               k_selection_algorithm    -- optional (only applies if 'k' 
parameter is an array with multiple k values)
             )

kmeanspp( rel_source,
          expr_point,
          k,                            -- can be a single value like now or an 
array of k values
          fn_dist,                                              -- optional
          agg_centroid,                                 -- optional
          max_num_iterations,                   -- optional
          min_frac_reassigned,                  -- optional
          seeding_sample_ratio,                 -- optional
          k_selection_algorithm                 -- optional (only applies if 
'k' parameter is an array with multiple k values)
        )

k
INTEGER of INTEGER[]. The number of centroids to calculate.  Can be a single 
value
or an array of k values to explore.  If array of k values given, the parameter 
'k_selection_algorithm'
determines the evaluation method.

k_selection_algorithm (optional)
TEXT, default: 'elbow'. Method to evaluate number of centroids k.
Only applies if the parameter 'k' is an array with multiple k values.
Currently two approaches are supported: 'elbow', and 'silhouette'. 
The text can be any subset of the strings; for e.g., 'silh' will use the 
silhouette method.
{code}

e.g., 
{code}
SELECT * FROM madlib.kmeanspp (
                                                                'km_sample',    
                                -- rel_source
                                                                'points',       
                                        -- expr_point
                                                                'ARRAY[2, 4, 6, 
8, 10]',                -- k       
                                                        
'madlib.squared_dist_norm2',    -- fn_dist
                                                        'madlib.avg',           
                        -- agg_centroid
                                                        20,                     
                                -- max_num_iterations
                                                        0.001,                  
                                -- min_frac_reassigned
                                                        'elbow'                 
                                -- k_selection_algorithm
                                                        );
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to