On 26 November 2013 06:18, John O'Hagan <resea...@johnohagan.com> wrote: > > Thanks, that is exactly the type of thing I was after. Although it is > actually slower as is than a hack like > > def imulticombs(it, n): > last = () > for i in itertools.combinations(it,n): > if i > last: > yield i > last = i > > it can be modified to be a lot faster for my use-case like this: > > def _multicombs(prepend, words, r, chkstr): > """chkstr is the string of remaining availalable characters""" > if r == 0: > yield prepend, chkstr > return > (word, count, rem), *remaining = words > newstr = chkstr > for k in range(max(r-rem, 0), min(count, r) + 1): > for letter in word * k: > if letter in newstr: > newstr = newstr.replace(letter, '' , 1) > else: > return > yield from _multicombs(prepend + (word,) * k, remaining, r-k, > newstr)
Further micro-optimisations are possible. One is to inline the lower recursion levels. For example if the termination condition is checked immediately before recursion you can eliminate a redundant generator function call. > By progressively checking against the available characters, it avoids > fruitless recursion branches. I'm not 100% sure I'm doing this right yet > but so far it's promising. This is the quickest non-redundant approach > I've used so far (although strangely, the original product-only version > which produces a lot of redundant results is still by far the quickest.) Bear in mind that the interpreter does a lot of "redundant" things so what you might think of as non-redundant code can actually be highly redundant when interpreter over-head is accounted for. Recently I've been trying out PyPY in my own work and it is *really fast*. In fact it is so much faster that I've given up on the idea of speed-optimising code for CPython: just use PyPy instead - although it is worth optimising highly intensive code for PyPy. > I'll try to explain better what I'm actually doing. It's quite long, > but you seem to be wondering, or maybe someone else is! [snip] Okay I understand now. I was confused because I think of anagrams as being words of the same length. I didn't understand how your multiword version works but I think I do now. In my own words: You want to generate all sequences of English words having the same letters as some input string, ignoring the spaces in both the input and output strings (so that the number or length of the individual words need not be the same). [snip] > > For example, take the phrase "break beat". I make a wordlength > dictionary of sub-alphabet words, using the system dictionary file, > like this: > > {1: ['a'], 2: ['at', 'be', 're'], 3: ['are', 'ark', 'art', 'ate', > 'baa', 'bar', 'bat', 'bee', 'bet', 'bra', 'ear', 'eat', 'ebb', 'eke', > 'era', 'ere', 'eta', 'rat', 'tab', 'tar', 'tea', 'tee'], 4: ['abet', > 'area', 'babe', 'bake', 'barb', 'bare', 'bark', 'bate', 'beak', > 'bear', 'beat', 'beer', 'beet', 'beta', 'brat', 'rake', 'rate', > 'reek', 'take', 'tare', 'teak', 'tear', 'tree', 'trek'], 5: ['abate', > 'baker', 'beret', 'brake', 'break', 'eater', 'karat', 'kebab', > 'taker'], 6: ['aerate', 'beaker', 'beater', 'berate', 'betake', > 'karate', 'rebate', 'retake']} Okay, how about the following (pseudoish-code)? def word_sequences(prepend, target, subwords): # subwords is a list rather than a dict for n in range(1, len(subwords)): for word in subwords[n]: if equal_in_a_multiset_sense(word, target): yield prepend + ' ' + target elif is_a_submultiset(word, target): recurse_target = multiset_subtract(target, word) subsubwords = list_of_subwords_by_length[:len(recurse_target)] if any(subsubwords): yield from word_sequences(prepend + ' ' + word, recurse_target, subsubwords) Your general approach seems reasonable to me now that I understand the problem. The word_sequences generator above is the solution I would probably write at first if faced with the problem. I suspect that your current approach using partition sizes is better though. Oscar -- https://mail.python.org/mailman/listinfo/python-list