On Wed, 10 Aug 2005, Paolino wrote: > I have a self organizing net which aim is clustering words. Let's think > the clustering is about their 2-grams set. Words then are instances of > this class. > > class clusterable(str): > def __abs__(self):# the set of q-grams (to be calculated only once) > return set([(self+self[0])[n:n+2] for n in range(len(self))]) > def __sub__(self,other): # the q-grams distance between 2 words > set1=abs(self) > set2=abs(other) > return len(set1|set2)-len(set1&set2)
Firstly: - What do you mean by "to be calculated only once"? The code in __abs__ will run every time anyone calls abs() on the object. Do you mean that clients should avoid calling abs more than once? If so, how about memoising the function, or computing the 2-gram set up front, so clients don't need to worry about it? - Could i suggest frozenset instead of set, since the 2-gram set of a string can't change? - How about making the last line "return len(set1 ^ set2)"? > I'm looking for the medium of a set of words, as the word which > minimizes the sum of the distances from those words. I think i understand. Does the word have to be drawn from the set of words you're looking at? You can do that straightforwardly like this: def distance(w, ws): return sum([w - x for x in ws]) def medium(ws): return min([(distance(w, ws), w) for w in ws])[1] However, this is not terribly efficient - it's O(N**2) if you're counting calls to __sub__. If you want a more efficient algorithm, well, that's tricky. Luckily, i am one of the most brilliant hackers alive, so here is an O(N) solution: def distance_(w, counts, h, n): "Returns the total distance from the word to the words in the set; the set is specified by its digram counts, horizon and size." return h + sum([(n - (2 * counts[digram])) for digram in abs(w)]) def horizon(counts): return sum(counts.itervalues()) def countdigrams(ws): "Returns a map from digram to the number of words in which that digram appears." counts = {} for w in ws: for digram in abs(w): counts[digram] = counts.get(digram, 0) + 1 return counts def distance(w, ws): "Returns the total distance from the word to the words in the set." counts = countdigrams(ws) return distance_(w, counts, horizon(counts), len(ws)) def medium(ws): "Returns the word in the set with the least total distance to the other words." counts = countdigrams(ws) h = horizon(counts) n = len(ws) return min([(distance_(w, counts, h, n), w) for w in ws])[1] Note that this code calls abs a lot, so you'll want to memoise it. Also, all of those list comprehensions could be replaced by generator expressions, which would probably be faster - they certainly wouldn't allocate as much memory; i'm on 2.3 at the moment, so i don't have genexps. I am ashamed to admit that i don't really understand how this code works. I had a flash of insight into how the problem could be solved, wrote the skeleton, then set to the details; by the time i'd finished with the details, i'd forgotten the fundamental idea! I think it's something like using the counts to represent the ensemble properties of the population of words, which means measuring the total distance for each word is O(1). > Aka:sum([medium-word for word in words]) I have no idea what you're trying to do here! tom -- I sometimes think that the IETF is one of the crown jewels in all of western civilization. -- Tim O'Reilly -- http://mail.python.org/mailman/listinfo/python-list