On Sat, May 30, 2015 at 09:13:16AM +0200, John Darrington wrote: > On Fri, May 29, 2015 at 09:33:30AM -0500, Alan Mead wrote: > John suggested that I post to pspp-dev. I'm adding code to the k-means > (i.e., quick-cluster.c) procedure to show cluster membership. > > CLUSTER works perfectly on a trivial two-dimensional problem but it > fails miserably on some real data. For example, in one analysis > requesting 3 clusters on 98 cases, it found that everyone was in cluster > 3 and zero people were in clusters 1 & 2. I think part of it is that > the starting values seem to be a pattern of 1's and zero's, even though > the comments describe selecting random individuals as starting values. > > My question is about accessing the data. I copied other code to use a > "casereader" to iterate over the rows of data. Below are the relevant > parts of the code I've added that seems to display cluster membership. > If I want to randomly select cases as starting values, is there a way to > retrieve random records directly? > > > Ben is the casereader expert! Maybe he can comment? But I think you might > be able to use the function casereader_select (defined in casereader-select.c) > > casereader_select (subreader, random_number - 1, random_number + 1, 1); > > You would have to ensure that random_number was within the range of subreader.
That seems reasonable to me. If clustering actually wants a shuffled version of the complete data set (I don't know if that is true?) then probably more efficient algorithms are available. _______________________________________________ pspp-dev mailing list [email protected] https://lists.gnu.org/mailman/listinfo/pspp-dev
