Duncan, my comments below yours at end. ---YOURS--- The Counter approach only requires iterating over the letters once to construct the letters bag, then each word once to create the relevant word bag. After that it's (at worst) a couple of lookups and a comparison for each unique character in letters (for each word).
Your approach requires iteration over the words to create the lists of characters. Then they are (potentially) iterated over multiple times looking for the characters in letters. There's also the cost of removing items from arbitrary positions in the lists. Also, it seems that the character frequencies must be equal, and your approach only ensures that the words contain at least the required numbers of characters. In computational terms, if the first approach is something like O(n+m) for n letters and words of length m, your algorithm is more like O(nm). Not to say that it will be slower for all possible letters and dictionaries, but probably for any realistic cases and a lot slower for large enough dictionaries. Duncan --MINE--- I appreciate your analysis. I have not looked at the "counter" implementation and suspect it does some similar loops within, albeit it may be implemented in a compiled language like C++. I did not write out my algorithm in Python but have done it for myself. It runs fast enough with most of the time spent in the slow I/O part. We can agree all algorithms have to read in all the words in a data file. There may be ways to store the data such as in a binary tree and even ways to thus prune the search as once a node is reached where all required letters have been found, all further words qualify below that point. If you match say "instant" then instants and instantiation would be deeper in the tree and also qualify assuming extra letters are allowed. We may differ on the requirement as I think that the number of repeats for something like a,t,t require to be at least as big as in "attend" but that "attention" with yet another "t" would also be OK. If I am wrong, fine, but I note the person requesting this has admitted a certain lack of credentials while also claiming he made up a scenario just for fun. So this is not actually a particularly worthy challenge let alone with a purpose. My impression is that the average word length would be something small like 5-7. The number of words in a dictionary might be 100,000 or more. So if you want efficiency, where do you get the bang for the buck? I would argue that a simple test run on all the words might often narrow the field to a much smaller number of answers like just a few thousand or even much less. Say you test for the presence of "aeiou" in words, in whatever order. That might be done from reading a file and filtering out a relatively few potential answers. You can save those for second round to determine if they are fully qualified by any additional rules that may involve more expensive operations. How fast (or slow) are regular expressions for this purpose? Obviously it depends on complexity and something like "^[^aeiou]*[aeiou] [^aeiou]*[aeiou] [^aeiou]*[aeiou] [^aeiou]*[aeiou] [^aeiou]*[aeiou] [^aeiou]*$" would be easy to construct once but likely horribly inefficient in searching and a bit of overkill here. I suspect there is already some simple C function that could be used from within python that looks like findall(choices, word) that might return how many of the letters in choices were found in word and you simply comparer that to length(word) perhaps more efficiently. It looks easier to check if a character exists in one of the ways already discussed within python using a loop as discussed. Something as simple as this: needed = "aeiou" trying = "education" found = all([trying.find(each) >= 0 for each in needed ]) print(found) trying = "educated" found = all([trying.find(each) >= 0 for each in needed ]) print(found) The above prints My point is you can use the above to winnow down possible answers and only subject that smaller number to one of many tests including using a counter, making a dictionary you can query (albeit probably slower for fairly short words) and so on. Back to your other point, you suggest iterating over characters ( I happened to do it in a list) multiple times can result in duplication as the same characters are searched repeatedly. Sure. I simply note most words are short. How much overhead is there searching for say nine characters five times? Compare that to say creating a dictionary or other data structure once, and then making a hash out of each letter being searched for? I can envision many possible ways to speed this up but many involve more use of memory or all other kinds of overhead such as initializing an object and calling various member functions and then deleting it or re-initializing it. My suggestion of removing items from a short list is not ideal and depending on how a bag was otherwise implemented, I could see it being better but again, how much is saved if you have a nine letter word that may contain nine unique letters or even have repeated letters as in banana? You end up saving a copy of the letter (or a hashed version) plus some integer value (perhaps just a byte) for the count. I would need to see the implementation and do some speed tests, ... In real programming, there are many choices. This is a bit of a hypothetical but a part of me suggests the best answer for a quick prototype is something simple and not necessarily tuned for speed, just to show it can be done. Then, if the code is used often and is slow, sure, enhance it. Back in my UNIX days, I might have started with a rather inefficient pipeline for my example like: grep 'a' filename | grep 'e' | grep 'i' | grep 'o' | grep 'u' | ... That would produce a list of candidates, again not efficiently. I might have switched to using better regular expressions, perhaps using sed instead, or AWK or even PERL. But if it had to be fast production code, I might have used C then C++ or whatever latest tool I had. These days, some things run so fast that we get lazy. When doing a pipelined approach in R or Python that analyzes data, I often effectively have to apply a set of conditions to lots of data, one after another. Do you trim the number of columns in a data frame first or do you apply a condition that trims the number of rows first? Clearly if you need 5 columns out of 666, then chances are you narrow the columns so there is less useless copying in later parts of the pipeline. . But what if you have multiple conditions that each remove some rows and it depends on what the data looks like? For example, remove all rows in which one variable is NA and also all rows where another two variables are not equal to each other and so on. Isn't part of the consideration how expensive each comparison is? Imagine if you want to remove rows where a word in one column is not in the scrabble dictionary and your stupid function has to read it in each and every time! I would want that to be done last after cheaper methods removed say all words longer than 10 characters or that had uppercase or numeric parts or spaces or were NA. So if the functionality in the topic above was really being built, and the functionality was being called repeatedly as some game progressed, you definitely might want to speed it up, if nothing else, by reading the dictionary file into memory once, perhaps into a data structure like a modified binary tree which could be searched with tail recursion or short-cuts. And in some cases, you might even want odd things like to cache the results of earlier queries in a dictionary and skip a full repeat search entirely. You might argue this situation always gets called with different arguments. Perhaps but imagine writing a program where the computer plays against a human. The computer might look at every open region of a scrabble board and take the tiles already in their "hand" and add zero or more tiles indicating characters already on the game board that it wanted to extend into or beyond. Say it sees the letter "r" and wonders if it could place all letters immediately to the left of that "r" and then to see if the "r" could be in multiple places within the word formed and finally if it could make a word starting with "r". That could be multiple identical calls, perhaps only differing in the order of the characters it is searching for. And it might repeat similar calls in another part of the board that has a dangling "r", perhaps the same one it did not use before. Obviously, the goal here would be to find all potentially possible words it could actually fit into that slot so further analysis would happen and it would investigate which method provided more points. It may even look ahead to see if that move allowed the opponent a better-countermove and so on into deeper analyses. Much of the expense of deciding what to actually do would be outside the functionality described. Admittedly, I see no reason for it to actually ask for these words as described that could be longer, but only proper permutations of the letters. My point is I can envision places where caching earlier results would be potentially a great speedup. Heck, I suggest the problem could be turned around as the number of combinations might be computed and simply tested to see if they exist in a Python dictionary made from the external dictionary. Heck, why even write the main routine in an interpreted language when you can link in a function written to be even more efficient? But enhancements like that come with a cost. Would you pay a programmer an extra month to speed something like that up 5%? Maybe, but often not. And if you add in the typical costs in a large organization of various kinds of testing needed for a more complex or subtle approach, ... Avi -- https://mail.python.org/mailman/listinfo/python-list