Re: Abuse of subject, was Re: Abuse of Big Oh notation
Le mardi 21 août 2012 09:52:09 UTC+2, Peter Otten a écrit : > wxjmfa...@gmail.com wrote: > > > > > By chance and luckily, first attempt. > > > > > c:\python32\python -m timeit "('€'*100+'€'*100).replace('€' > > > , 'œ')" > > > 100 loops, best of 3: 1.48 usec per loop > > > c:\python33\python -m timeit "('€'*100+'€'*100).replace('€' > > > , 'œ')" > > > 10 loops, best of 3: 7.62 usec per loop > > > > OK, that is roughly factor 5. Let's see what I get: > > > > $ python3.2 -m timeit '("€"*100+"€"*100).replace("€", "œ")' > > 10 loops, best of 3: 1.8 usec per loop > > $ python3.3 -m timeit '("€"*100+"€"*100).replace("€", "œ")' > > 1 loops, best of 3: 9.11 usec per loop > > > > That is factor 5, too. So I can replicate your measurement on an AMD64 Linux > > system with self-built 3.3 versus system 3.2. > > > > > Note > > > The used characters are not members of the latin-1 coding > > > scheme (btw an *unusable* coding). > > > They are however charaters in cp1252 and mac-roman. > > > > You seem to imply that the slowdown is connected to the inability of latin-1 > > to encode "œ" and "€" (to take the examples relevant to the above > > microbench). So let's repeat with latin-1 characters: > > > > $ python3.2 -m timeit '("ä"*100+"ä"*100).replace("ä", "ß")' > > 10 loops, best of 3: 1.76 usec per loop > > $ python3.3 -m timeit '("ä"*100+"ä"*100).replace("ä", "ß")' > > 1 loops, best of 3: 10.3 usec per loop > > > > Hm, the slowdown is even a tad bigger. So we can safely dismiss your theory > > that an unfortunate choice of the 8 bit encoding is causing it. Do you > > agree? - I do not care too much about the numbers. It's an attempt to show the principles. - The fact, considering latin-1 as a bad coding, lies on the point that is is simply unsuable for some scripts / languages. It has mainly to do with source/text files coding. This is not really the point here. - Now, the technical aspect. This "coding" (latin-1) may be considered somehow as the pseudo-coding covering the unicode code points range 128..255. Unfortunatelly, this "coding" is not very optimal (or can be see as) when you work with a full range of Unicode, but is is fine when one works only in pure latin-1, with only 256 characters. This range 128..255 is always the critical part (all codings considered). And probably represents the most used characters. I hope, it was not too confused. I have no proof for my theory. With my experience on that field, I highly suspect this as the bottleneck. Some os as before. Py 3.2.3 >>> timeit.repeat("('€'*100+'€'*100).replace('€', 'œ')") [1.5384088242603358, 1.532421642233382, 1.5327445924545433] >>> timeit.repeat("('ä'*100+'ä'*100).replace('ä', 'ß')") [1.561762063667686, 1.5443503206462594, 1.5458670051605168] 3.3.0b2 >>> timeit.repeat("('€'*100+'€'*100).replace('€', 'œ')") [7.701523104134512, 7.720358191179441, 7.614549852683501]>>> >>> timeit.repeat("('ä'*100+'ä'*100).replace('ä', 'ß')") [4.887939423990709, 4.868787294350611, 4.865697999795991] Quite mysterious! In any way it is a regression. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: Abuse of Big Oh notation
On Mon, 20 Aug 2012 11:12:22 -0600, Ian Kelly wrote: > On Mon, Aug 20, 2012 at 10:09 AM, Chris Angelico > wrote: >> On Tue, Aug 21, 2012 at 2:01 AM, Paul Rubin >> wrote: >>> Analogy: how big a box is required to hold a pair of shoes? In a >>> purely theoretical sense we might say O(S) where S is the shoe size, >>> treating shoe size as an arbitrary independent variable. But in the >>> real world, shoe size is controlled by the size of human feet, which >>> is bounded by a constant for biological reasons. You don't have to >>> consider shoes the size of Jupiter. So it is O(1). >> >> By that argument, everything is amortized O(1), because there's a limit >> on every variable. You can't possibly be working with a data set >> greater than the total sum of storage space in the entire world. That >> still doesn't mean that bubble sort and heap sort are equivalently >> efficient. > > The difference between the two is that the former is bounded by a > constant that is fundamental to the algorithm at hand, whereas the > latter is bounded only by available resources. By any practical > consideration the latter must be considered a variable, but the former > need not be. > > Paul discusses above the asymptotic growth of a variable as O(S) where S > is shoe size, but really this measure makes no sense in the first place. > The question that this notation seeks to answer is, "What is the big-Oh > behaviour of this variable as shoe size increases indefinitely (i.e. to > infinity)?" and the answer in this case is "Shoe size does not increase > indefinitely"; the question is invalid. > > A more logical question might be, "How much material do I need to > construct N shoes of size S?" The answer to this question would > presumably be some constant factor of N * S**2, which is O(N * S**2). > Although N can be assumed to vary freely (up to nonsensical quantities > like the mass of the entire universe), S is clearly bounded by the > constraints of actual shoes, so we can safely treat S as a constant and > call it O(N). Why the hell are we talking about shoe sizes? Let's not lose sight of the actual question in hand: how expensive is it to fetch a character at some arbitrary index in a string? With fixed-width characters, average time is O(1), best-case time is O(1) and worst-case time is O(1). With non-fixed width characters, average time and worst-case time are both O(N), and best-case ("give me the character at index 0") is O(1). But that's a degenerate case that gives us absolutely no insight into the cost of the operation. It's like the observation that "sorting a list of one item takes constant time, regardless of the algorithm". Um, yes, it does. So what? Invented constraints like "people only care about indexes really close to zero, or to the end of the string", effectively confuse the best possible case for the average case. In real life, people do care about random access to strings and the average case is more important than the best case. That's why strings are typically implemented as arrays of fixed- width characters rather than linked lists, and Haskell (which doesn't) explicitly warns you that they don't and that your string-handling code is likely to be slow. Of course, you can swap space and complexity for time, e.g. ropes, or use cached pointers to character locations in the hope that indexes will be clustered rather than scattered randomly. These more complex implementations typically use as much memory than, and performance is often no better than, a dead simple implementation that uses a simple array of fixed width characters. Most harmful, it means that a simple indexing operation may be fast on average, and occasionally degenerate to slow. The only thing worse than "This program is slow" is "This program is *occasionally* slow, and I can't predict when". Not surprising, almost all programming languages in common usage use arrays of fixed-width characters as their native string type. Python 3.3 will now do the same thing, with the memory optimization that the fixed- width will be per string and selected from 1, 2, or 4 bytes according to the minimum width needed, rather than choosing between 2 and 4 bytes when you build the Python compiler. This is a big positive step forward with a pretty simple algorithm and will strongly help encourage the use of Unicode by taking much of the sting out of the perennial complaints that "Unicode wastes space". Not so wasteful any more. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Abuse of subject, was Re: Abuse of Big Oh notation
wxjmfa...@gmail.com wrote: > By chance and luckily, first attempt. > c:\python32\python -m timeit "('€'*100+'€'*100).replace('€' > , 'œ')" > 100 loops, best of 3: 1.48 usec per loop > c:\python33\python -m timeit "('€'*100+'€'*100).replace('€' > , 'œ')" > 10 loops, best of 3: 7.62 usec per loop OK, that is roughly factor 5. Let's see what I get: $ python3.2 -m timeit '("€"*100+"€"*100).replace("€", "œ")' 10 loops, best of 3: 1.8 usec per loop $ python3.3 -m timeit '("€"*100+"€"*100).replace("€", "œ")' 1 loops, best of 3: 9.11 usec per loop That is factor 5, too. So I can replicate your measurement on an AMD64 Linux system with self-built 3.3 versus system 3.2. > Note > The used characters are not members of the latin-1 coding > scheme (btw an *unusable* coding). > They are however charaters in cp1252 and mac-roman. You seem to imply that the slowdown is connected to the inability of latin-1 to encode "œ" and "€" (to take the examples relevant to the above microbench). So let's repeat with latin-1 characters: $ python3.2 -m timeit '("ä"*100+"ä"*100).replace("ä", "ß")' 10 loops, best of 3: 1.76 usec per loop $ python3.3 -m timeit '("ä"*100+"ä"*100).replace("ä", "ß")' 1 loops, best of 3: 10.3 usec per loop Hm, the slowdown is even a tad bigger. So we can safely dismiss your theory that an unfortunate choice of the 8 bit encoding is causing it. Do you agree? -- http://mail.python.org/mailman/listinfo/python-list
Re: Abuse of Big Oh notation
In article , wxjmfa...@gmail.com wrote: > Note > The used characters are not members of the latin-1 coding > scheme (btw an *unusable* coding). > They are however charaters in cp1252 and mac-roman. mac-roman is an obsolete encoding that was used in MacOS 9 and MacOS Classic systems of previous decades. Except in a very small and shrinking number of legacy applications, it isn't used in modern systems and hasn't been widely used for a long time. MacOS X systems generally use standard Unicode encodings, usually UTF-8, for external representations. Forget about mac-roman; it is not relevant. -- Ned Deily, n...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Abuse of Big Oh notation
Paul Rubin於 2012年8月21日星期二UTC+8上午3時29分12秒寫道: > Ian Kelly writes: > > > The difference between the two is that the former is bounded by a > > > constant that is fundamental to the algorithm at hand,... S is > > > clearly bounded by the constraints of actual shoes, so we can safely > > > treat S as a constant and call it O(N). > > > > Thanks, that is a good explain of what I was trying to get at. One > > quibble is in the parsing example, the constant is inherent in the > > distribution of input data one expects to normally encounter, rather > > than in the algorithm itself. It's sort of like saying dictionary > > access (based on hashing) is O(1), based on the variety of input data > > that is normally used. There is such a thing as pathological > > (e.g. malicious) input data that causes a lot of hash collisions, making > > O(n) access in the size of the dictionary. > > > > I suppose abnormal input should also be taken into account in examples > > involving parsing if one parses potentially hostile data. OK, the hash key function has to be seeded with some randomization in the creation of a hash table from some data independent factors. Also for those items hashed to the same key with a length >=16 I think an insertion sort and a heap sort are good in storing a sorted list of items hashed to the same key of a length 16 to 256 or even larger which indicates the hash function should be changed from time to time in the occasions of resizing the hash table when the table is under a lot operations of item insertions and deletions which are operated much frequently than the number of item searches in the same period. -- http://mail.python.org/mailman/listinfo/python-list
Re: Abuse of Big Oh notation
Paul Rubin於 2012年8月21日星期二UTC+8上午3時29分12秒寫道: > Ian Kelly writes: > > > The difference between the two is that the former is bounded by a > > > constant that is fundamental to the algorithm at hand,... S is > > > clearly bounded by the constraints of actual shoes, so we can safely > > > treat S as a constant and call it O(N). > > > > Thanks, that is a good explain of what I was trying to get at. One > > quibble is in the parsing example, the constant is inherent in the > > distribution of input data one expects to normally encounter, rather > > than in the algorithm itself. It's sort of like saying dictionary > > access (based on hashing) is O(1), based on the variety of input data > > that is normally used. There is such a thing as pathological > > (e.g. malicious) input data that causes a lot of hash collisions, making > > O(n) access in the size of the dictionary. > > > > I suppose abnormal input should also be taken into account in examples > > involving parsing if one parses potentially hostile data. OK, the hash key function has to be seeded with some randomization in the begining of a hash table from some data independent factors. Also for those items hashed to the same key with a length >=16 I think an insertion sort is good in soring a sorted list of items hashed to the same key of a length 16 to 256 or even larger which indicates the hash function should be changed from time to time in the occasions of resizing the hash table when the table is under a lot operations of item inserstions and deletions which are much greater than the number of item searches in the same period. -- http://mail.python.org/mailman/listinfo/python-list
Re: Abuse of Big Oh notation
Ian Kelly writes: > The difference between the two is that the former is bounded by a > constant that is fundamental to the algorithm at hand,... S is > clearly bounded by the constraints of actual shoes, so we can safely > treat S as a constant and call it O(N). Thanks, that is a good explain of what I was trying to get at. One quibble is in the parsing example, the constant is inherent in the distribution of input data one expects to normally encounter, rather than in the algorithm itself. It's sort of like saying dictionary access (based on hashing) is O(1), based on the variety of input data that is normally used. There is such a thing as pathological (e.g. malicious) input data that causes a lot of hash collisions, making O(n) access in the size of the dictionary. I suppose abnormal input should also be taken into account in examples involving parsing if one parses potentially hostile data. -- http://mail.python.org/mailman/listinfo/python-list
Re: Abuse of Big Oh notation
By chance and luckily, first attempt. IDLE, Windows 7.0 Pro 32, Pentium Dual Core 2.6, RAM 2 Go Py 3.2.3 >>> timeit.repeat("('€'*100+'€'*100).replace('€', 'œ')") [1.6939567134893707, 1.672874290786993, 1.6761219212298073] Py 3.3.0b2 >>> timeit.repeat("('€'*100+'€'*100).replace('€', 'œ')") [7.924470733910917, 7.8554985620787345, 7.878623849091914] Console c:\python32\python -m timeit "('€'*100+'€'*100).replace('€' , 'œ')" 100 loops, best of 3: 1.48 usec per loop c:\python33\python -m timeit "('€'*100+'€'*100).replace('€' , 'œ')" 10 loops, best of 3: 7.62 usec per loop Note The used characters are not members of the latin-1 coding scheme (btw an *unusable* coding). They are however charaters in cp1252 and mac-roman. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: Abuse of Big Oh notation
On Mon, Aug 20, 2012 at 10:09 AM, Chris Angelico wrote: > On Tue, Aug 21, 2012 at 2:01 AM, Paul Rubin wrote: >> Analogy: how big a box is required to hold a pair of shoes? In a purely >> theoretical sense we might say O(S) where S is the shoe size, treating >> shoe size as an arbitrary independent variable. But in the real world, >> shoe size is controlled by the size of human feet, which is bounded by a >> constant for biological reasons. You don't have to consider shoes the >> size of Jupiter. So it is O(1). > > By that argument, everything is amortized O(1), because there's a > limit on every variable. You can't possibly be working with a data set > greater than the total sum of storage space in the entire world. That > still doesn't mean that bubble sort and heap sort are equivalently > efficient. The difference between the two is that the former is bounded by a constant that is fundamental to the algorithm at hand, whereas the latter is bounded only by available resources. By any practical consideration the latter must be considered a variable, but the former need not be. Paul discusses above the asymptotic growth of a variable as O(S) where S is shoe size, but really this measure makes no sense in the first place. The question that this notation seeks to answer is, "What is the big-Oh behaviour of this variable as shoe size increases indefinitely (i.e. to infinity)?" and the answer in this case is "Shoe size does not increase indefinitely"; the question is invalid. A more logical question might be, "How much material do I need to construct N shoes of size S?" The answer to this question would presumably be some constant factor of N * S**2, which is O(N * S**2). Although N can be assumed to vary freely (up to nonsensical quantities like the mass of the entire universe), S is clearly bounded by the constraints of actual shoes, so we can safely treat S as a constant and call it O(N). -- http://mail.python.org/mailman/listinfo/python-list
Re: Abuse of Big Oh notation
On 20 August 2012 17:01, Paul Rubin wrote: > Oscar Benjamin writes: > > No it doen't. It is still O(k). The point of big O notation is to > > understand the asymptotic behaviour of one variable as it becomes > > large because of changes in other variables. > > Actually, two separate problems got mixed together late at night. In > neither case is k an independent variable that ranges over all possible > values. In both cases it is selected or observed by measurement (i.e. > it is a dependent variable determined by something that is itself not > independent). > > 1) Access in a rope: here, k is basically determined by the pointer size > of the computer, which in CPython (the implementation we're discussing) > the pointer size is 4 or 8 bytes (constants) in all instances AFAIK. k > should be a big enough that the pointer and allocation overhead is small > compared to bloating the strings with UCS-2 or UCS-4, and small enough > to not add much scan time. It seems realistic to say k<=128 for this > (several times smaller is probably fine). 128 is of course a constant > and not a variable. We are not concerned about hypothetical computers > with billion bit pointers. > Okay, I see what you mean. If k is a hard-coded constant then it's not unreasonable to say that O(k) is constant time in relation to the input data (however big k is). Oscar. -- http://mail.python.org/mailman/listinfo/python-list
Re: Abuse of Big Oh notation
On Tue, Aug 21, 2012 at 2:01 AM, Paul Rubin wrote: > Analogy: how big a box is required to hold a pair of shoes? In a purely > theoretical sense we might say O(S) where S is the shoe size, treating > shoe size as an arbitrary independent variable. But in the real world, > shoe size is controlled by the size of human feet, which is bounded by a > constant for biological reasons. You don't have to consider shoes the > size of Jupiter. So it is O(1). By that argument, everything is amortized O(1), because there's a limit on every variable. You can't possibly be working with a data set greater than the total sum of storage space in the entire world. That still doesn't mean that bubble sort and heap sort are equivalently efficient. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Abuse of Big Oh notation
Oscar Benjamin writes: > No it doen't. It is still O(k). The point of big O notation is to > understand the asymptotic behaviour of one variable as it becomes > large because of changes in other variables. Actually, two separate problems got mixed together late at night. In neither case is k an independent variable that ranges over all possible values. In both cases it is selected or observed by measurement (i.e. it is a dependent variable determined by something that is itself not independent). 1) Access in a rope: here, k is basically determined by the pointer size of the computer, which in CPython (the implementation we're discussing) the pointer size is 4 or 8 bytes (constants) in all instances AFAIK. k should be a big enough that the pointer and allocation overhead is small compared to bloating the strings with UCS-2 or UCS-4, and small enough to not add much scan time. It seems realistic to say k<=128 for this (several times smaller is probably fine). 128 is of course a constant and not a variable. We are not concerned about hypothetical computers with billion bit pointers. 2) Parsing tokens: here, k is drawn from a fixed probability distribution such that its expectation value is again a small constant (this is an assertion about the data looks like in typical parsing applications in the real world, not what is mathematically possible). The average-case complexity (I said "amortized" earlier but should have said "average-case") is proportional to that constant, which is to say, O(1). Of course there is still more wiggle room about what is "typical". Analogy: how big a box is required to hold a pair of shoes? In a purely theoretical sense we might say O(S) where S is the shoe size, treating shoe size as an arbitrary independent variable. But in the real world, shoe size is controlled by the size of human feet, which is bounded by a constant for biological reasons. You don't have to consider shoes the size of Jupiter. So it is O(1). -- http://mail.python.org/mailman/listinfo/python-list
Re: Abuse of Big Oh notation
On Sun, 19 Aug 2012 16:42:03 -0700, Paul Rubin wrote: Steven D'Aprano writes: > Of course *if* k is constant, O(k) is constant too, but k is not > constant. In context we are talking about string indexing and slicing. > There is no value of k, say, k = 2, for which you can say "People will > sometimes ask for string[2] but never ask for string[3]". That is absurd. The context was parsing, e.g. recognizing a token like "a" or "foo" in a human-written chunk of text. Occasionally it might be "sesquipidalian" or some even worse outlier, but one can reasonably put a fixed and relatively small upper bound on the expected value of k. That makes the amortized complexity O(1), I think. No it doen't. It is still O(k). The point of big O notation is to understand the asymptotic behaviour of one variable as it becomes large because of changes in other variables. If k is small then you can often guess that O(k) is small. To say that an operation is O(k), however, is a statement about what happens when k is big (and is not refuted by saying that k is typically not big). Oscar -- http://mail.python.org/mailman/listinfo/python-list
Re: Abuse of Big Oh notation
Steven D'Aprano writes: > Of course *if* k is constant, O(k) is constant too, but k is not > constant. In context we are talking about string indexing and slicing. > There is no value of k, say, k = 2, for which you can say "People will > sometimes ask for string[2] but never ask for string[3]". That is absurd. The context was parsing, e.g. recognizing a token like "a" or "foo" in a human-written chunk of text. Occasionally it might be "sesquipidalian" or some even worse outlier, but one can reasonably put a fixed and relatively small upper bound on the expected value of k. That makes the amortized complexity O(1), I think. -- http://mail.python.org/mailman/listinfo/python-list
Abuse of Big Oh notation [was Re: How do I display unicode value stored in a string variable using ord()]
On Sun, 19 Aug 2012 10:48:06 -0700, Paul Rubin wrote: > Terry Reedy writes: >> I would call it O(k), where k is a selectable constant. Slowing access >> by a factor of 100 is hardly acceptable to me. > > If k is constant then O(k) is the same as O(1). That is how O notation > works. You might as well say that if N is constant, O(N**2) is constant too and just like magic you have now made Bubble Sort a constant-time sort function! That's not how it works. Of course *if* k is constant, O(k) is constant too, but k is not constant. In context we are talking about string indexing and slicing. There is no value of k, say, k = 2, for which you can say "People will sometimes ask for string[2] but never ask for string[3]". That is absurd. Since k can vary from 0 to N-1, we can say that the average string index lookup is k = (N-1)//2 which clearly depends on N. -- Steven -- http://mail.python.org/mailman/listinfo/python-list