On Wed, 05 Sep 2012 22:47:14 +0000, Oscar Benjamin wrote: > In news.gmane.comp.python.general, you wrote: >> On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote: [...] >>>> You are making unjustified assumptions about the distribution of >>>> letters in the words. This might be a list of long chemical compounds >>>> where the words typically differ only in their suffix. It might be a >>>> list of people with titles: >>> >>> Actually, I'm not. I'm stating exactly what assumptions I'm making to >>> get my calculation. I'm comparing *random* character strings or >>> bitstrings. >> >> Excuse me, you are not. You are comparing English words which are >> highly non-random. > > Evidently we have different understandings of what 'random' means. I > don't think it's unreasonable to say that strings drawn uniformly from > the set of all strings in the English language (having a given number of > characters) is random. The distribution is not uniform over the set of > all possible character strings but it is still random. I think Johannes > deliberately chose these strings to emulate a particular kind of 'real' > distribution of strings that might occur in practise.
Of course for some "real" applications, your strings being compared will be English words. And for other applications, they will be strings of numeric digits uniformly distributed between 20000 and 30000. Or taken from a Gaussian distribution centered around 7000. Or strings made up of deeply nested sets of parentheses (((((( ... )))))). Whichever special distribution of strings you pick, I don't think calling them "random" is terribly useful in context, even if they are generated randomly from some non-uniform distribution. But you know, it really doesn't make a difference. Equality testing will *still* be O(N) since the asymptomatic behaviour for sufficiently large string comparisons will be bounded by O(N). Multiplicative constants ("half the string" versus "0.001 of the string") do not matter. I may have been overly-conservative earlier when I said that on average string equality has to compare half the characters. I thought I had remembered that from a computer science textbook, but I can't find that reference now, so possibly I was thinking of something else. (String searching perhaps?). In any case, the *worst* case for string equality testing is certainly O(N) (every character must be looked at), and the *best* case is O(1) obviously (the first character fails to match). But I'm not so sure about the average case. Further thought is required. -- Steven -- http://mail.python.org/mailman/listinfo/python-list