Re: K-Means Clustering

Mehmet Hakan Satman Tue, 15 Mar 2011 05:14:13 -0700

Hi John,

1. In the case of no variable definition, cmd_quick_cluster returns a 
CMD_FAILURE now.
2. By the now, default number of clusters and the mxiter are both set to 2.
3. Your test strategy is very well, but multi-dimensionality is also a problem. 
I tried this with your code:


______________________________________________________________
input program.
        loop #i = 1 to 50000.
                compute x = rv.uniform (0, 1).
                compute y = rv.uniform (0, 1).
                compute z = rv.uniform (0, 1).
                end case.
        end loop.
        loop #i = 1 to 50000.
                compute x = rv.uniform (0, 1)+1.
                compute y = rv.uniform (0, 1)+1.
                compute z = rv.uniform (0, 1)+1.
                end case.
        end loop.
          loop #i = 1 to 50000.
                compute x = rv.uniform (0, 1)+2.
                compute y = rv.uniform (0, 1)+2.
                compute z = rv.uniform (0, 1)+2.
                end case.
        end loop.      
        loop #i = 1 to 50000.
                compute x = rv.uniform (0, 1)+3.
                compute y = rv.uniform (0, 1)+3.
                compute z = rv.uniform (0, 1)+3.
                end case.
        end loop.
        end file.
end input program.


QUICK CLUSTER ALL
        /CRITERIA = CLUSTER(4) MXITER (100).
______________________________________________________________

The expected cluster centers are 
(0.5, 0.5, 0.5)
(1.5, 1.5, 1.5)
(2.5, 2.5, 2.5)
(3.5, 3.5, 3.5)

and the actual output is

Result:
Number of cases: 200000
Number of variables: 3
Number of groups: 4
Number of trials: 1
Number of iterations at last trial: 5
Centers:
Center of Group 1: 1.498 1.501 1.502 
Center of Group 2: 3.500 3.500 3.501 
Center of Group 3: 2.498 2.499 2.499 
Center of Group 4: 0.501 0.502 0.502 

which is nearly same as the expected one. I think its ok. It is also be tried 
that generating some big data with R and using kmeans command to calculate 
centers and comparing one-or-two samples with pspp. But as in your simulation 
study, cluster centers map to correct quantiles of uniform distribution.


4. I re-formatted the code using indent with --gnu-style parameter.

5. This is the part what i didn't understand. Which part of code must be 
labeled 
as "static"? If the answer is "All methods must be defined as static", what i 
must do for "struct Kmeans* kmeans_create(...." ?

6. I changed malloc 's to xmalloc 's. I think this is why we didn't control the 
null pointer in memory allocations.

7. I used the gsl_rng_* things instead of standard library.

8) this link (http://www.norusis.com/pdf/SPC_v13.pdf) and googling "kmeans 
spss" 
may help for the output of quick cluster. I haven't got a copy of SPSS too.  


Ben has sent me the form about fsf. I answered the questions and re-sent the 
given address.

I hope corrections are suitable for the project.

Best.


 Mehmet Hakan Satman
http://www.mhsatman.com





________________________________
From: John Darrington <[email protected]>
To: Mehmet Hakan Satman <[email protected]>
Cc: John Darrington <[email protected]>; [email protected]
Sent: Tue, March 15, 2011 11:23:34 AM
Subject: Re: K-Means Clustering

On Mon, Mar 14, 2011 at 12:22:36PM -0700, Mehmet Hakan Satman wrote:
     Hi John,
    
     1) I renamed the file as "quick-cluster.c"
    
     2. I added an entry to? "src/language/stats/automake.mk" for quick-cluster
    
     3. I removed the entry "UNIMPL_CMD ("QUICK CLUSTER", "Fast clustering")" 
from command.def file.

Thanks.  I tried some experiments with it.  It looks promising.  But there are 
some improvements
which can be made.
    
     4. Now cmd_quick_cluster can parse a command line like:
    
     QUICK CLUSTER x y z 
     ? ? ? /CRITERIA = CLUSTER(5) MXITER (100).
    
I inadvertently ran it with the wrong syntax (I typed just "QUICK CLUSTER." 
without any variables),
and it caused PSPP to crash.  You should check the return value of 
parse_variables_const 

and return an error if it fails.  See the code for the other procedures to see 
how to do this.

It also crashed if I omitted the /CRITERIA subcommand because your algorithm 
expects 

the number of groups is greater than 0.  The spss documentation says that
the CLUSTER and MXITER parameters both default to 2.  So you should initialise 
them accordingly.

     As
      I mentioned, i test my results with random data with uniform 
     distributed random values. It can not be considered as a comprehensive 
     work and should be tested with simulations. 

It's not my field of expertise, but I ran it with the following syntax:

input program.
        loop #i = 1 to 100000.
                compute x = rv.uniform (0, 1).
                end case.
        end loop.
        end file.
end input program.


QUICK CLUSTER ALL
        /CRITERIA = CLUSTER(3) MXITER (100).

and got :

Centers:
Center of Group 1: 0.499 
Center of Group 2: 0.833 
Center of Group 3: 0.165 

    
which is close to what I would expect (the centres are 1/6, 3/6 and 5/6).  Can 
you 

suggest some more comprehensive tests?


I have some general suggestions about the quick-cluster.c file:

1. The formatting style doesn't really fit the GNU way of doing things.  I 
recommend
   that you run the command "indent --gnu-style quick-cluster.c" to make it 
more 
consistent
   with the rest of the code.   You might want to read the information at 
  http://www.gnu.org/prep/standards/standards.html which explains how GNU 
software is 

   normally written and why we do it that way.

2. When compiling, I get a number of warnings.  Most of these are due to 
missing 
"static"
   qualifiers from the functions.

3. In PSPP we don't use the stdlib "malloc".  Instead we use "xmalloc" from 
gnulib.

4. Similarly, we don't use the srand and rand functions.  Use the gsl_rng_* 
functions.
   These are supposed to be better random number generators.
   See the file src/language/xforms/sample.c and/or the gsl manual for an 
example.


I'm looking forward to seeing the QUICK CLUSTER command integrated into PSPP.  
I 
tried
to find some examples of how spss presents its output for this command but I 
couldn't
find any.  Do you have any such examples or do you have access to a copy of 
pspp 
so that
we can see how users might expect to see the results?

Regards,


John




-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.

/* PSPP - a program for statistical analysis.
   Copyright (C) 2009, 2010 Free Software Foundation, Inc.

   This program is free software: you can redistribute it and/or modify
   it under the terms of the GNU General Public License as published by
   the Free Software Foundation, either version 3 of the License, or
   (at your option) any later version.

   This program is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
   GNU General Public License for more details.

   You should have received a copy of the GNU General Public License
   along with this program.  If not, see <http://www.gnu.org/licenses/>. */

#include <config.h>

#include <math.h>

#include <libpspp/misc.h>

#include <libpspp/str.h>
#include <libpspp/message.h>


#include <data/procedure.h>
#include <data/missing-values.h>
#include <data/casereader.h>
#include <data/casegrouper.h>
#include <data/dictionary.h>
#include <data/format.h>

#include <language/lexer/variable-parser.h>
#include <language/command.h>
#include <language/lexer/lexer.h>

#include <math/moments.h>
#include <output/tab.h>
#include <output/text-item.h>

#include <stdio.h>

#include <gsl/gsl_matrix.h>
#include <gsl/gsl_statistics.h>

#include <gsl/gsl_rng.h>

#include "gettext.h"
#define _(msgid) gettext (msgid)
#define N_(msgid) msgid

struct quick_cluster
{
  const char **varNames;
  int numGroups;
};

/*
Struct KMeans:
Holds all of the information for the functions.
*/
struct Kmeans
{
  gsl_matrix *data;             //User Data (Given by the user)
  gsl_matrix *centers;          //Centers for groups
  gsl_vector_int *index;        //Integer values from zero to ngroups. Shows 
group of an observation.
  gsl_vector *v1, *v2, *v3;     //Temporary vector for program. Do not use.
  gsl_rng *rng;                 //Random Number Generator.
  int ngroups;                  //Number of group. (Given by the user)
  int n;                        //Number of observations. (Given by the user)
  int m;                        //Number of observations. (Given by the user)
  int maxiter;                  //Maximum number of iterations (Given by the 
user)
  int lastiter;                 //Show at which iteration it found the solution.
  int trials;                   //If not convergence, how many times has 
clustering done.
  double *weights;              //Double values for handling weights for 
program use.
};

struct Kmeans *
kmeans_create (double *data, int n, int m, int ngroups, int maxiter)
{
  int i, j;
  struct Kmeans *k = (struct Kmeans *) xmalloc (sizeof (struct Kmeans));
  k->data = gsl_matrix_alloc (n, m);
  k->centers = gsl_matrix_alloc (ngroups, m);
  k->ngroups = ngroups;
  k->index = gsl_vector_int_alloc (n);
  k->n = n;
  k->m = m;
  k->maxiter = maxiter;
  k->lastiter = 0;
  k->trials = 0;
  for (i = 0; i < n; i++)
    {
      for (j = 0; j < m; j++)
        {
          gsl_matrix_set (k->data, i, j, data[i * m + j]);
        }
    }
  k->weights = (double *) malloc (sizeof (double) * k->index->size);
  k->v1 = gsl_vector_alloc (k->centers->size2);
  k->v2 = gsl_vector_alloc (k->centers->size2);
  k->v3 = gsl_vector_alloc (k->n);
  k->rng = gsl_rng_alloc (gsl_rng_taus);
  return (k);
}

void
kmeans_randomize_centers (struct Kmeans *kmeans)
{
  int i, j;
  int randind;
  for (i = 0; i < kmeans->centers->size1; i++)
    {
      randind = (int) (gsl_rng_uniform (kmeans->rng) * kmeans->data->size1);
      for (j = 0; j < kmeans->centers->size2; j++)
        {
          gsl_matrix_set (kmeans->centers, i, j,
                          gsl_matrix_get (kmeans->data, randind, j));
        }
    }
}

void
kmeans_randomize_index (struct Kmeans *kmeans)
{
  int i;
  for (i = 0; i < kmeans->index->size; i++)
    {
      kmeans->index->data[i] = -1;
    }
}

void
kmeans_print (struct Kmeans *kmeans)
{
  int i, j;
  printf ("Number of groups: %d\n", kmeans->ngroups);
  printf ("Centers:\n");
  for (i = 0; i < kmeans->centers->size1; i++)
    {
      for (j = 0; j < kmeans->centers->size2; j++)
        {
          printf ("%f ", gsl_matrix_get (kmeans->centers, i, j));
        }
      printf ("\n");
    }

  printf ("Index:\n");
  for (i = 0; i < kmeans->n; i++)
    {
      printf ("%d ", gsl_vector_int_get (kmeans->index, i));
    }
  printf ("\nLast iter: %d\n", kmeans->lastiter);
}


void
print_matrix (gsl_matrix * m)
{
  int i, j;
  for (i = 0; i < m->size1; i++)
    {
      for (j = 0; j < m->size2; j++)
        {
          printf ("%f ", m->data[i * m->size2 + j]);
        }
      printf ("\n");
    }
}


double
kmeans_euclidean_distance (gsl_vector * v1, gsl_vector * v2)
{
  double result = 0.0;
  double val;
  int i;
  for (i = 0; i < v1->size; i++)
    {
      val = v1->data[i] - v2->data[i];
      result += val * val;
    }
  return (result);
}



int
kmeans_num_elements_group (struct Kmeans *kmeans, int group)
{
  int total = 0;
  int i;
  for (i = 0; i < kmeans->n; i++)
    {
      if (gsl_vector_int_get (kmeans->index, i) == group)
        total++;
    }
  return (total);
}


void
kmeans_recalculate_centers (struct Kmeans *kmeans)
{
  int i, j, h;
  int elm;
  double mean;
  gsl_vector *v1 = kmeans->v3;

  for (i = 0; i < kmeans->ngroups; i++)
    {
      elm = kmeans_num_elements_group (kmeans, i);
      for (j = 0; j < kmeans->index->size; j++)
        {
          if (gsl_vector_int_get (kmeans->index, j) == i)
            {
              kmeans->weights[j] = 1.0;
            }
          else
            {
              kmeans->weights[j] = 0.0;
            }
        }

      for (h = 0; h < kmeans->m; h++)
        {
          gsl_matrix_get_col (v1, kmeans->data, h);
          mean = gsl_stats_wmean (kmeans->weights, 1, v1->data, 1, v1->size);
          gsl_matrix_set (kmeans->centers, i, h, mean);
        }
    }
}


void
kmeans_calculate_indexes (struct Kmeans *kmeans)
{
  double dist;
  double mindist;
  int bestindex = 0;
  int i, j;
  gsl_vector *v1 = kmeans->v1;
  gsl_vector *v2 = kmeans->v2;

  for (i = 0; i < kmeans->n; i++)
    {
      mindist = INFINITY;
      gsl_matrix_get_row (v1, kmeans->data, i);
      for (j = 0; j < kmeans->ngroups; j++)
        {
          gsl_matrix_get_row (v2, kmeans->centers, j);
          dist = kmeans_euclidean_distance (v1, v2);
          if (dist < mindist)
            {
              mindist = dist;
              bestindex = j;
            }
        }
      gsl_vector_int_set (kmeans->index, i, bestindex);
    }
}



int
kmeans_check_converge (gsl_vector_int * current,
                       gsl_vector_int * old, struct Kmeans *kmeans)
{
  int i;
  int total = 0;
  for (i = 0; i < current->size; i++)
    {
      if (current->data[i] == old->data[i])
        total++;
      old->data[i] = current->data[i];
    }
  return (current->size - total);
}



gsl_matrix *
kmeans_getGroup (struct Kmeans * kmeans, int index)
{
  int i;
  int j = 0;
  int elm = kmeans_num_elements_group (kmeans, index);
  gsl_matrix *agroup = gsl_matrix_alloc (elm, kmeans->data->size2);
  gsl_vector *v1 = gsl_vector_alloc (kmeans->data->size2);
  for (i = 0; i < kmeans->data->size1; i++)
    {
      if (kmeans->index->data[i] == index)
        {
          gsl_matrix_get_row (v1, kmeans->data, i);
          gsl_matrix_set_row (agroup, j, v1);
          j++;
        }
    }
  return (agroup);
}



static void
kmeans_cluster (struct Kmeans *kmeans)
{
  int i;
  double *ind;
  double sum;
  bool redo;
  gsl_vector_int *oldindex = gsl_vector_int_alloc (kmeans->index->size);

cluster:
  redo = false;
  kmeans_randomize_centers (kmeans);
  for (kmeans->lastiter = 0; kmeans->lastiter < kmeans->maxiter;
       kmeans->lastiter++)
    {
      kmeans_calculate_indexes (kmeans);
      kmeans_recalculate_centers (kmeans);
      if (kmeans_check_converge (kmeans->index, oldindex, kmeans) == 0)
        break;
    }

  for (i = 0; i < kmeans->ngroups; i++)
    {
      if (kmeans_num_elements_group (kmeans, i) == 0)
        {
          kmeans->trials++;
          redo = true;
          break;
        }
    }
  if (redo)
    goto cluster;

}


void
quick_cluster_show_results (struct Kmeans *kmeans)
{
  int i, j;
  printf ("Number of cases: %d\n", kmeans->n);
  printf ("Number of variables: %d\n", kmeans->m);
  printf ("Number of groups: %d\n", kmeans->ngroups);
  printf ("Number of trials: %d\n", (kmeans->trials + 1));
  printf ("Number of iterations at last trial: %d\n", (kmeans->lastiter + 1));
  printf ("Centers:\n");
  for (i = 0; i < kmeans->centers->size1; i++)
    {
      printf ("Center of Group %d: ", (i + 1));
      for (j = 0; j < kmeans->centers->size2; j++)
        {
          printf ("%0.3f ",
                  kmeans->centers->data[i * kmeans->centers->size2 + j]);
        }
      printf ("\n");
    }
  printf ("Data Index:\n");
  for (int i = 0; i < kmeans->index->size; i++)
    {
      printf ("%3d ", kmeans->index->data[i] + 1);
    }
  printf ("\n");
}


int
cmd_quick_cluster (struct lexer *lexer, struct dataset *ds)
{
  struct Kmeans *kmeans;
  double *data;
  struct ccase *c;
  bool ok;
  struct dictionary *dict = dataset_dict (ds);
  int n;
  struct variable **variables;
  struct casereader *input, *inputnew;
  int groups = 2;
  int numobs = 0;
  int maxiter = 2;
  struct quick_cluster *qc =
    (struct quick_cluster *) malloc (sizeof (struct quick_cluster));
  int i;


  if (!parse_variables_const (lexer, dict, &variables, &n,
                              PV_NO_DUPLICATE | PV_NUMERIC))
    {
      printf ("Variables cannot parse.\n");
      return (CMD_FAILURE);
    }

  if (lex_match (lexer, T_SLASH))
    {
      if (lex_match_id (lexer, "CRITERIA"))
        {
          lex_match (lexer, T_EQUALS);
          while (lex_token (lexer) != T_ENDCMD
                 && lex_token (lexer) != T_SLASH)
            {
              if (lex_match_id (lexer, "CLUSTERS"))
                {
                  if (lex_force_match (lexer, T_LPAREN))
                    {
                      lex_force_int (lexer);
                      groups = lex_integer (lexer);
                      lex_get (lexer);
                      lex_force_match (lexer, T_RPAREN);
                    }
                }
              else if (lex_match_id (lexer, "MXITER"))
                {
                  if (lex_force_match (lexer, T_LPAREN))
                    {
                      lex_force_int (lexer);
                      maxiter = lex_integer (lexer);
                      lex_get (lexer);
                      lex_force_match (lexer, T_RPAREN);
                    }
                }
              else
                {
                  //further command set
                  printf ("Error parsing command.\n");
                  return (CMD_FAILURE);
                }
            }
        }
    }


  inputnew = proc_open (ds);
  numobs = casereader_count_cases (inputnew);

  if (groups > numobs)
    {
      printf
        ("Number of groups cannot be larger than the number of cases.\n");
      ok = casereader_destroy (inputnew);
      proc_commit (ds);
      return (CMD_FAILURE);
    }

  data = (double *) xmalloc (numobs * n * sizeof (double));
  i = 0;
  //j = 0;
  for (; (c = casereader_read (inputnew)) != NULL; case_unref (c))
    {
      int v;
      double x;
      for (v = 0; v < n; ++v)
        {
          x = case_data (c, variables[v])->f;
          data[i * n + v] = x;
        }
      i++;
    }

  ok = casereader_destroy (inputnew);
  ok = proc_commit (ds) && ok;
  kmeans = kmeans_create (data, numobs, n, groups, maxiter);
  kmeans_cluster (kmeans);
  quick_cluster_show_results (kmeans);

  return (ok);
}

_______________________________________________
pspp-dev mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/pspp-dev

Re: K-Means Clustering

Reply via email to