On Tue, 04 Sep 2012 18:32:57 +0200, Johannes Bauer wrote: > On 04.09.2012 04:17, Steven D'Aprano wrote: > >> On average, string equality needs to check half the characters in the >> string. > > How do you arrive at that conclusion?
Take two non-empty strings of the same length, N. If the strings are equal, you have to make N comparisons to be sure they are equal (you check all N pairs of characters). If they are unequal, you have to check each pair of characters up to the first pair that are different: def string_equality(astr, bstr): # Assumes lengths are equal. for achr, bchr in zip(astr, bstr): if achr != bchr: return False return True For the unequal case, how many comparisons do you do? Ahead of time, we know very little about the strings. We don't even know how many possible characters there are (are they English alphanumeric, ASCII, Greek, Thai, or from the full range of 1114111 Unicode code points?) or what their probability distribution is. A reasonable, conservative assumption is to calculate the largest possible value of the average for random strings. That largest value occurs when the alphabet is as small as possible, namely two characters. In practice, strings come from a larger alphabet, up to 1114111 different characters for full Unicode strings, so the average for them will be less than the average we calculate now. So for unequal strings, the number of comparisons is equally likely to be 1, 2, 3, ..., N. The average then is: sum([1, 2, 3, ..., N])/N which by a bit of simple algebra works out to be (N+1)/2, or half the characters as I said. (Note that this average assumes the strings are completely random. In practice, strings tend to come from strongly non-uniform distributions of characters. For instance, in English texts, 'e' is much more likely than 'q' and most characters are vanishingly rare, and in practice, strings are likely to be highly non-random.) -- Steven -- http://mail.python.org/mailman/listinfo/python-list