Hi John
it is good to receive such an email from you. Actually it is an important case to be engaged such a project like PSPP for me. I think i can add these functionalties to PSPP if the members of the project help me. Another way, i can try implementing other clustering algorithms to PSPP, such as single linkage and complete linkage with several distance functions. Note: If i am not wrong, "implementing such a command in PSPP" means making the k-means clustering codes be callable using the "QUICK CLUSTER x y /MISSING=LISTWISE /CRITERIA=CLUSTER(2) MXITER(10) CONVERGE(0) /METHOD=KMEANS(NOUPDATE) /PRINT INITIAL ANOVA CLUSTER DISTAN." syntax. is that right? also: I have no experience about coding GTK+ best regards ________________________________ From: John Darrington <[email protected]> To: Mehmet Hakan Satman <[email protected]> Cc: [email protected] Sent: Thu, March 10, 2011 1:07:59 PM Subject: Re: K-Means Clustering Hello Mehmet, I think the code is well written and is something that we could use in PSPP. Like you say, perhaps it doesn't lend itself to a generalized clustering algorithm, but I think it would nicely satisfy the needs of the QUICK CLUSTER command. Would you like to try implementing such a command in PSPP ? Ask on the mailing list if you need help getting started. Regards, John On Wed, Mar 09, 2011 at 05:51:44AM -0800, Mehmet Hakan Satman wrote: Hi John, I am sending the .h, the .c and the test.c file as attachments. My implementation clusters a random data set with number of cases 10000, the number of parameters 10 and the number of clusters 10 in average 0.45 seconds. Algorithm starts with the randomly selected cluster centers and in each single steps these centers are changed. Distances between these centers and the observations must be calculated in every step. I am not sure if it can be implemented a "create-one-and-use-anywhere" distances in this algorithm. In hierarchial cluster algorithms such as "single linkage clustering", it is possible to build a distance matrix and use it during the clustering. Best Regards. Hakan. ________________________________ From: John Darrington <[email protected]> To: Mehmet Hakan Satman <[email protected]> Cc: [email protected] Sent: Wed, March 9, 2011 3:21:59 PM Subject: Re: K-Means Clustering Hi Mehmet! As I mentioned, the CLUSTER command is soemthing which I think it would be great to support. One issue with clustering is its memory complexity. It requires O(n^2) where n is the number of cases being clustered. Have you tested your algorithm with large numbers of cases? Maybe Ben has some ideas how an efficient distance matrix can be implemented in PSPP (maybe sparse-array.c can help?) . In any case, I'd be interested to see your code, and the results of your comparisons. Can you post them somewhere? J' On Tue, Mar 08, 2011 at 05:56:37AM -0800, Mehmet Hakan Satman wrote: hi everybody, I am interested in PSPP and i read about something about the needs for developing some functionality. I implemented a k-means clustering library using the GNU scientific library and sent an informative e-mail to John. He suggested me to join this group and share my ideas with the stuff. I compared the results with SPSS outputs. The analysis of variance table is not completed but we may add this feature. I would be glad to integrate something to PSPP and work with you. What do you think about this? -- PGP Public key ID: 1024D/2DE827B3 fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3 See http://pgp.mit.edu or any PGP keyserver for public key. /* kmeans - K-Means Clustering Library for the GNU PSPP (People Should Prefer PSPP) project. Copyright (C) 2011 Dr.Mehmet Hakan Satman <[email protected]> This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. */ #include <stdio.h> #include <math.h> #include <gsl/gsl_matrix.h> #include <gsl/gsl_statistics.h> /* Struct KMeans: Holds all of the information for the functions. */ struct Kmeans { gsl_matrix *data; //User Data (Given by the user) gsl_matrix *centers; //Centers for groups gsl_vector_int *index; //Integer values from zero to ngroups. Shows group of an observation. gsl_vector *v1,*v2,*v3; //Temporary vector for program. Do not use. int ngroups; //Number of group. (Given by the user) int n; //Number of observations. (Given by the user) int m; //Number of observations. (Given by the user) int maxiter; //Maximum number of iterations (Given by the user) int lastiter; //Show at which iteration it found the solution. double *weights; //Double values for handling weights for program use. }; /* Creates a Kmeans structure for a given data 'data', number of observations 'n', number of columns 'm', number of groups 'ngroups' (it is usually called 'k') and number of maximum iterations 'maxiter'. */ struct Kmeans* kmeans_create(double* data, int n, int m, int ngroups, int maxiter); /* Randomly chooses centers of 'ngroup' groups. These centers are initial and will be changed iteratively. */ void kmeans_randomize_centers(struct Kmeans *kmeans); /* Prints the given Kmeans structure */ void kmeans_print(struct Kmeans* kmeans); /* Calculates the squared euclidean distance between vector v1 and v2 */ double kmeans_euclidean_distance(gsl_vector *v1, gsl_vector *v2); /* Calculates and returns the number of elements contained in the specific group. */ int kmeans_num_elements_group(struct Kmeans *kmeans, int group); /* Calculates group centers. Those centers are calculated using iteration and they are usually different from the initial centers. Recalculation process remains while the last two solutions are not equal. */ void kmeans_recalculate_centers(struct Kmeans *kmeans); /* Constructs the index variable. This variable shows the current group of the each single observation. */ void kmeans_calculate_indexes(struct Kmeans *kmeans); /* Checks if the last two index variables are equal. If they are equal, algorithm can not find a better classification anymore and stops. */ int kmeans_check_converge(gsl_vector_int *current, gsl_vector_int *old); /* This is the main method of the algorithm. */ void kmeans_cluster(struct Kmeans *kmeans); #include "kmeans.h" struct Kmeans* kmeans_create(double* data, int n, int m, int ngroups, int maxiter){ int i,j; struct Kmeans *k = (struct Kmeans*) malloc(sizeof(struct Kmeans)); k->data=gsl_matrix_alloc(n,m); k->centers=gsl_matrix_alloc(ngroups, m); k->ngroups=ngroups; k->index=gsl_vector_int_alloc(n); k->n=n; k->m=m; k->maxiter=maxiter; k->lastiter=0; for (i=0;i<n;i++){ for(j=0;j<m;j++){ gsl_matrix_set(k->data, i, j, data[i * m +j]); } } k->weights = (double*)malloc(sizeof(double) * k->index->size); k->v1 = gsl_vector_alloc(k->centers->size2); k->v2 = gsl_vector_alloc(k->centers->size2); k->v3 = gsl_vector_alloc(k->n); return(k); } void kmeans_randomize_centers(struct Kmeans *kmeans){ int i,j; double min=0,max=0; min=gsl_matrix_min(kmeans->data); max=gsl_matrix_max(kmeans->data); gsl_matrix_minmax(kmeans->data, &min, &max); for (i=0;i<kmeans->centers->size1;i++){ for (j=0;j<kmeans->centers->size2; j++){ gsl_matrix_set(kmeans->centers, i, j, min + (((double)rand())/RAND_MAX)*(max-min)); } } } void kmeans_print(struct Kmeans* kmeans){ int i,j; printf("Number of groups: %d\n",kmeans->ngroups); printf("Centers:\n"); for (i=0;i<kmeans->centers->size1;i++) { for (j=0;j<kmeans->centers->size2;j++){ printf("%f ",gsl_matrix_get(kmeans->centers, i,j)); } printf("\n"); } printf("Index:\n"); for (i=0;i<kmeans->n;i++){ printf("%d ",gsl_vector_int_get(kmeans->index, i)); } printf("\nLast iter: %d\n",kmeans->lastiter); } void print_matrix(gsl_matrix *m){ int i,j; for (i=0;i<m->size1;i++){ for (j=0;j<m->size2;j++){ printf("%f ",m->data[i * m->size2 + j]); } printf("\n"); } } double kmeans_euclidean_distance(gsl_vector *v1, gsl_vector *v2){ double result=0.0; double val; int i; for (i=0;i<v1->size;i++){ val=v1->data[i] - v2->data[i]; result+=val*val; } return(result); } int kmeans_num_elements_group(struct Kmeans *kmeans, int group){ int total=0; int i; for (i=0;i<kmeans->n;i++){ if(gsl_vector_int_get(kmeans->index,i)==group) total++; } return(total); } void kmeans_recalculate_centers(struct Kmeans *kmeans){ int i,j,h; int elm; double mean; gsl_vector *v1=kmeans->v3; for (i=0;i<kmeans->ngroups;i++){ elm=kmeans_num_elements_group(kmeans,i); for (j=0;j<kmeans->index->size;j++){ if(gsl_vector_int_get(kmeans->index,j)==i){ kmeans->weights[j]=1.0; }else{ kmeans->weights[j]=0.0; } } for (h=0;h<kmeans->m;h++){ gsl_matrix_get_col(v1,kmeans->data, h); mean=gsl_stats_wmean(kmeans->weights, 1, v1->data ,1, v1->size); gsl_matrix_set(kmeans->centers, i,h, mean); } } } void kmeans_calculate_indexes(struct Kmeans *kmeans){ double dist; double mindist; int bestindex=0; int i,j; gsl_vector *v1 = kmeans->v1; gsl_vector *v2 = kmeans->v2; for (i=0;i<kmeans->n; i++){ mindist=INFINITY; gsl_matrix_get_row(v1, kmeans->data, i); for (j=0;j<kmeans->ngroups; j++){ gsl_matrix_get_row(v2, kmeans->centers,j); dist=kmeans_euclidean_distance(v1,v2); if(dist<mindist){ mindist=dist; bestindex=j; } } gsl_vector_int_set(kmeans->index,i,bestindex); } } int kmeans_check_converge(gsl_vector_int *current, gsl_vector_int *old){ int i; int total=0; for (i=0;i<current->size;i++) { if(current->data[i] == old->data[i]) total++; old->data[i]=current->data[i]; } return(current->size-total); } gsl_matrix* kmeans_getGroup(struct Kmeans *kmeans, int index){ int i; int j=0; int elm=kmeans_num_elements_group(kmeans,index); gsl_matrix *agroup=gsl_matrix_alloc(elm, kmeans->data->size2); gsl_vector *v1=gsl_vector_alloc(kmeans->data->size2); for(i=0;i<kmeans->data->size1;i++){ if(kmeans->index->data[i]==index){ gsl_matrix_get_row(v1, kmeans->data, i); gsl_matrix_set_row(agroup, j, v1); j++; } } return(agroup); } void kmeans_cluster(struct Kmeans *kmeans){ int i; gsl_vector_int *oldindex = gsl_vector_int_alloc(kmeans->index->size); kmeans_randomize_centers(kmeans); for (i=0;i<kmeans->maxiter;i++){ kmeans->lastiter=i; kmeans_calculate_indexes(kmeans); kmeans_recalculate_centers(kmeans); if(kmeans_check_converge(kmeans->index, oldindex)==0) break; } } #include "kmeans.h" int main() { srand(12345); int i,j; double val; int n=10000; //Number of cases int m=10; //Number of variables int k=10; //Number of groups int maxiter=100; double * data = (double*) malloc(n* m * sizeof(double)); for (i=0;i<n;i++){ for (j=0;j<m;j++){ val= ((float)rand())/(float)RAND_MAX; data[i * m + j]=val*(i+j); } } struct Kmeans *km = kmeans_create(data, n, m ,k, maxiter); kmeans_cluster(km); //kmeans_print(km); return 0; } -- PGP Public key ID: 1024D/2DE827B3 fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3 See http://pgp.mit.edu or any PGP keyserver for public key.
_______________________________________________ pspp-dev mailing list [email protected] http://lists.gnu.org/mailman/listinfo/pspp-dev
