James Ravenscroft wrote: > Dear All, > > I am currently trying to write a simple Agglomerative Clustering > algorithm which sorts through my MP3 collection and uses associated > Last.FM tags to cluster files into 'genres'. Unfortunately, I'm having > some trouble with my algorithm and some tracks are ending up in multiple > clusters. I have a feeling this is because of (my lack of understanding > of) list behaviours within Python. This is all purely a learning > exercise, so any input would be greatly appreciated > > The actual clustering method can be seen described here. Each Cluster is > stored within the 'clusters' array and has two elements: the data > containing song metadata such as artist, title and most importantly > tags, and a weight: that is, the overall Euclidean weight of the cluster > (based on the song data within). > > In theory, the algorithm runs in a loop until all clusters have been > merged into a hierarchy. It takes the weight of each cluster and merges > each cluster with their closest counterpart. My code has a lot of > comments so it should be fairly easy to understand.
> tmp = [] > > for c in self.clusters: > > closestCluster = None > closestDelta = float('inf') > > for c2 in self.clusters: > > #skip if this is the same node > if(c == c2): > continue > > delta = abs(c2['weight'] - c['weight']) > > if(delta < closestDelta): > closestCluster = c2 > closestDelta = delta > > > print "Merging clusters %(c1)d and %(c2)d" % {'c1' : > self.clusters.index(c), > 'c2' : > self.clusters.index(closestCluster)} > #now merge the two clusters > self.clusters.remove(closestCluster) > self.clusters.remove(c) > > c['data'].extend(closestCluster['data']) > > tmp.append(c) I can't run your code because you didn't make it standalone, but I believe that the problem (at least one of them) is that you iterate over self.clusters and modify it from within the loop. That's a big no-no in python. A simple example to demonstrate the effects: >>> import random >>> old = range(10) >>> new = [] >>> for item in old: ... closest = random.choice(old) ... new.append((item, closest)) ... old.remove(item) ... old.remove(closest) ... >>> old [3, 4] >>> new [(0, 8), (2, 1), (5, 7), (9, 6)] The remedy is normally to iterate over a copy for item in list(old): ... but in your case that is probably not enough. Try something along these lines: # untested while len(self.clusters) > 1: c = self.clusters.pop() # find closest index for i, c2 in enumerate(self.clusters): ... if ...: closest_index = i closest = self.clusters.pop(closest_index) tmp.append(c + closest) if self.clusters: tmp.append(self.clusters[0]) Peter -- http://mail.python.org/mailman/listinfo/python-list