On 05.09.2012 18:24, Steven D'Aprano wrote: > On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote: > [...] >>> You are making unjustified assumptions about the distribution of >>> letters in the words. This might be a list of long chemical compounds >>> where the words typically differ only in their suffix. It might be a >>> list of people with titles: >> >> Actually, I'm not. I'm stating exactly what assumptions I'm making to >> get my calculation. I'm comparing *random* character strings or >> bitstrings. > > Excuse me, you are not. You are comparing English words which are highly > non-random.
Not in my original post. If you read it again, you will clearly see that I was talking about purely random strings. And since you like to nitpick, I'll clarify further: I'm talking about bitstrings in which every bit of every character has the same probability of occurence, 50%. You then replied by mentioning probability distributions of characters in different languages/encodings, to which I tried giving a example as it might (and does) occur in the real world, like sorting a dictionary. But the original point is still valid: Sorting of randomized bitstrings is definitely not O(N), you got that wrong. >> You, on the other hand, are making vague assumptions which you do not >> care for formalize and yet you claim that "the number of comparisons is >> equally likely to be 1, 2, 3, ..., N. The average then is". Without any >> explanation for this. At all. > > I will accept that my explanation was not good enough, but I strongly > disagree that I gave no explanation at all. What possible explanation could there be that comparison of purely random bitstrings yields an equal amount of comparisons for its length? >>> Herr Professor Frederick Schmidt >>> Herr Professor Frederick Wagner >>> ... >> >> Is your assumtion that we're comparing words that have the common prefix >> "Herr Professor Frederick "? > > No, I am pointing out that *your* assumption that most string comparisons > will halt close to the beginning of the string is an invalid assumption. > Your assumption only holds for some non-random strings. Definitely not. It *especially* holds true for random strings. For random strings with an alphabet of size n, the probability that the first character needs to be compared is n^0, i.e. 1. The probability that the second character needs to be compared is n^(-1). For the xth character, it is n^(-x). This is due to the lazy abort in comparison and it results in the average number of comparisons being extremely short no matter the bitlength n, or O(log n). >> it's counting the number of character comparisons. > > Correct. But only out of an extremely limited subset of all possible > string comparisons, namely those very few that happen when sorting lists > of English words using a very specific algorithm, namely timsort. Yes, but this was to look at a real-world example (in which way more comparisons are needed than in the random case). You first were talking about randomized bitstrings (and so was I), then you brought up character sets and languages (which, for the original problem, are entirely irrelevant). > Whatever figure you have calculated by taking this non-random selection, > it is irrelevant to the question of the average performance of string > equality given random strings. Correct. I gave the estimate for random strings in my very first post. > For the sake of simple calculations, let's pretend that there are only > 1000 five-letter strings possible. We want to know how many character- > comparisons are done on average when testing two random five-letter > strings for equality. To calculate this average, you must compare every > string to every other string, *including* itself. > > This gives 1000*1000 = one million equality tests to be performed. For > each equality test, we count the number of character comparisons > performed, sum all those counts, and divide by 1e6. That is the average > number of char comparisons for random strings of five letters. Easy enough. Since you didn't look at my original equation, here are some numbers and the program attached: 64 combinations for length 6 Char comparisons: 8064 Comparison cnt : 4096 Avg comparison : 1.97 128 combinations for length 7 Char comparisons: 32512 Comparison cnt : 16384 Avg comparison : 1.98 256 combinations for length 8 Char comparisons: 130560 Comparison cnt : 65536 Avg comparison : 1.99 512 combinations for length 9 Char comparisons: 523264 Comparison cnt : 262144 Avg comparison : 2.00 1024 combinations for length 10 Char comparisons: 2095104 Comparison cnt : 1048576 Avg comparison : 2.00 2048 combinations for length 11 Char comparisons: 8384512 Comparison cnt : 4194304 Avg comparison : 2.00 4096 combinations for length 12 Char comparisons: 33546240 Comparison cnt : 16777216 Avg comparison : 2.00 Hopefully now you'll see that your assumption O(n) is dead wrong. > But that's not what you do. First you eliminate 900 out of the 1000 > possible strings by only sampling English words -- strings like "xxoij" > cannot possibly be selected. So you are left with the highly non-random > subset of 10000 equality tests. This is *exactly* what my original calculation did. Enumerate all possibilites and divide by the total number. Regards, Johannes #!/usr/bin/python3 import itertools import random alphabet = [ c for c in "01" ] def cmpstr(s1, s2): c = 0 for i in range(len(s1)): c += 1 if s1[i] != s2[i]: break return c for strlength in range(1, 16): combinations = list(itertools.product(*([ alphabet ] * strlength))) assert(len(combinations) == len(alphabet) ** strlength) print("%d combinations for length %d" % (len(combinations), strlength)) compcnt = 0 comparisons = 0 for c1 in combinations: for c2 in combinations: comparisons += cmpstr(c1, c2) compcnt += 1 print("Char comparisons: %d" % (comparisons)) print("Comparison cnt : %d" % (compcnt)) print("Avg comparison : %.2f" % (comparisons / compcnt)) -- >> Wo hattest Du das Beben nochmal GENAU vorhergesagt? > Zumindest nicht öffentlich! Ah, der neueste und bis heute genialste Streich unsere großen Kosmologen: Die Geheim-Vorhersage. - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1...@speranza.aioe.org> -- http://mail.python.org/mailman/listinfo/python-list