On 10 December 2015 at 12:38, Spyros Charonis <s.charo...@gmail.com> wrote: > Dear All, > > I am learning about analysis of algorithms (python 2.7.6). I am reading a > book (Problem solving with Algorithms and Data Structures) where Python is > the language used for implementations. The author introduces algorithm > analysis in a clear and understandable way, and uses an anagram detection > program as a template to compare different runtime implementations > (quadratic, log linear, linear). In the linear, and most efficient > implementation, the code is as follows (comments added by me):
This is a good example problem for demonstrating the idea of computational complexity but having actually wanted to write a program that used anagram detection recently I concluded that a quadratic performance can be acceptable or maybe even preferable for this. Here is my own anagram function: def anagram(s1, s2): seen = [False] * len(s2) for c1 in s1: for n2, c2 in enumerate(s2): if c1 == c2 and not seen[n2]: seen[n2] = TrueI would do this by using a dict break else: return False else: return all(seen) This has O(N**2) complexity with N the string lengths. However the real average complexity depends on the distribution of strings. The loops in the above only run to completion if the two strings are anagrams. Most of the time if you need to check whether two strings are anagrams the probability that they are is small. When two strings are not anagrams then the chances are typically high that the first character of s1 is not in s2 at all. What that means is that the most common runtime behaviour is to take the first character c1 from s1 then loop once through s2 verifying that none of the characters in s2 is equal to c1. So in the most common case we have a single iteration through each character of one of the strings, whereas the "linear" complexity algorithm you showed will always loop through both strings fully. There are other differences such as not assuming lower-case ascii characters etc. but these are not so important to this point. If course in practice in Python the above code is inefficient because it is implemented by the interpreter. The following is O(N*log(N)) but will probably out-compete any of the examples shown: def anagram(s1, s2): return sorted(s1) == sorted(s2) The sorted function in Python is very efficient so it will be hard to beat the performance of that unless the strings really are long. This is why anagrams are also a poor example of algorithmic performance: in most contexts where we might practically want to check for anagrams the strings involved are likely to be very short meaning that the big-O type analysis is useless. > My questions are: ... > > 2) > How could I go about adapting this algorithm for multiple strings (say I > had 20 strings and wanted to check if they are anagrams of one another). > > def are_anagrams(*args): > > """ Accepts a tuple of strings and checks if > > they are anagrams of each other """ > > > # Check that neither of strings are null > > for i in args: > > if not i: > > raise TypeError, "Invalid input" > > return None > > > > # Initialize a list of counters for each string > > c = ( [] for i in range(len(args) ) ??? The complexity problem here is that you want to avoid comparing all N strings against each other. If you do then your comparing N*(N-1)/2 (quadratic) strings for anagram-ness. Peter suggested using tuples of counts so that a dict can check them. I would use sorted string keys: def anagrams(*words): bykey = {} for word in words: key = ''.join(sorted(word)) if key not in bykey: bykey[key] = [word] else: bykey[key].append(word) return bykey.values() -- Oscar _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor