Frank McQuillan created MADLIB-1380:
---------------------------------------
Summary: Select number of centroids in k-means
Key: MADLIB-1380
URL: https://issues.apache.org/jira/browse/MADLIB-1380
Project: Apache MADlib
Issue Type: New Feature
Components: Module: k-Means Clustering
Reporter: Frank McQuillan
Fix For: v1.17
{code}
kmeans_random( rel_source,
expr_point,
k, -- can be a single value like now or an
array of k values
fn_dist, -- optional
agg_centroid, -- optional
max_num_iterations, -- optional
min_frac_reassigned, -- optional
k_selection_algorithm -- optional (only applies if 'k'
parameter is an array with multiple k values)
)
kmeanspp( rel_source,
expr_point,
k, -- can be a single value like now or an
array of k values
fn_dist, -- optional
agg_centroid, -- optional
max_num_iterations, -- optional
min_frac_reassigned, -- optional
seeding_sample_ratio, -- optional
k_selection_algorithm -- optional (only applies if
'k' parameter is an array with multiple k values)
)
k
INTEGER of INTEGER[]. The number of centroids to calculate. Can be a single
value
or an array of k values to explore. If array of k values given, the parameter
'k_selection_algorithm'
determines the evaluation method.
k_selection_algorithm (optional)
TEXT, default: 'elbow'. Method to evaluate number of centroids k.
Only applies if the parameter 'k' is an array with multiple k values.
Currently two approaches are supported: 'elbow', and 'silhouette'.
The text can be any subset of the strings; for e.g., 'silh' will use the
silhouette method.
{code}
e.g.,
{code}
SELECT * FROM madlib.kmeanspp (
'km_sample',
-- rel_source
'points',
-- expr_point
'ARRAY[2, 4, 6,
8, 10]', -- k
'madlib.squared_dist_norm2', -- fn_dist
'madlib.avg',
-- agg_centroid
20,
-- max_num_iterations
0.001,
-- min_frac_reassigned
'elbow'
-- k_selection_algorithm
);
{code}
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)