Re: unicode() vs. s.decode()
On Sat, 08 Aug 2009 19:00:11 +0200, Thorsten Kampe wrote: >> I was running it one million times to mitigate influences on the timing >> by other background processes which is a common technique when >> benchmarking. > > Err, no. That is what "repeat" is for and it defaults to 3 ("This means > that other processes running on the same computer may interfere with the > timing. The best thing to do when accurate timing is necessary is to > repeat the timing a few times and use the best time. [...] the default > of 3 repetitions is probably enough in most cases.") It's useful to look at the timeit module to see what the author(s) think. Let's start with the repeat() method. In the Timer docstring: "The repeat() method is a convenience to call timeit() multiple times and return a list of results." and the repeat() method's own docstring: "This is a convenience function that calls the timeit() repeatedly, returning a list of results. The first argument specifies how many times to call timeit(), defaulting to 3; the second argument specifies the timer argument, defaulting to one million." So it's quite obvious that the module author(s), and possibly even Tim Peters himself, consider repeat() to be a mere convenience method. There's nothing you can do with repeat() that can't be done with the timeit() method itself. Notice that both repeat() and timeit() methods take an argument to specify how many times to execute the code snippet. Why not just execute it once? The module doesn't say, but the answer is a basic measurement technique: if your clock is accurate to (say) a millisecond, and you measure a single event as taking a millisecond, then your relative error is roughly 100%. But if you time 1000 events, and measure the total time as 1 second, the relative error is now 0.1%. The authors of the timeit module obvious considered this an important factor: not only did they allow you to specify the number of times to execute the code snippet (defaulting to one million, not to one) but they had this to say: [quote] Command line usage: python timeit.py [-n N] [-r N] [-s S] [-t] [-c] [-h] [statement] Options: -n/--number N: how many times to execute 'statement' [...] If -n is not given, a suitable number of loops is calculated by trying successive powers of 10 until the total time is at least 0.2 seconds. [end quote] In other words, when calling the timeit module from the command line, by default it will choose a value for n that gives a sufficiently small relative error. It's not an accident that timeit gives you two "count" parameters: the number of times to execute the code snippet per timing, and the number of timings. They control (partly) for different sources of error. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
-On [20090808 20:07], Thorsten Kampe (thors...@thorstenkampe.de) wrote: >In real life people won't even notice whether an application takes one or >two minutes to complete. I think you are quite wrong here. I have worked with optical engineers who needed to calculate grating numbers for their lenses. If they can have a calculation program that runs in 1 minute instead of 2 they can effectively double their output during the day (since they run calculations hundreds to thousand times a day to get the most optimal results with minor tweaks). I think you are being a bit too easy on hand waving here that mere minute runtimes are not noticeable. -- Jeroen Ruigrok van der Werven / asmodai イェルーン ラウフロック ヴァン デル ウェルヴェン http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B When we have not what we like, we must like what we have... -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
Michael Fötsch wrote: > If speed is your primary concern, this will give you even better > performance than unicode(): > > decoder = codecs.lookup("utf-8").decode > for i in xrange(100): > decoder("äöüÄÖÜß")[0] Hmm, that could be interesting. I will give it a try. > However, there's also a functional difference between unicode() and > str.decode(): > > unicode() always raises an exception when you try to decode a unicode > object. str.decode() will first try to encode a unicode object using the > default encoding (usually "ascii"), which might or might not work. Thanks for pointing that out. So in my case I'd consider that also a plus for using unicode(). Ciao, Michael. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
* Michael Ströder (Fri, 07 Aug 2009 03:25:03 +0200) > Thorsten Kampe wrote: > > * Michael Ströder (Thu, 06 Aug 2009 18:26:09 +0200) > > timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(1000) > >> 17.23644495010376 > > timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(1000) > >> 72.087096929550171 > >> > >> That is significant! So the winner is: > >> > >> unicode('äöüÄÖÜß','utf-8') > > > > Unless you are planning to write a loop that decodes "äöüÄÖÜß" one > > million times, these benchmarks are meaningless. > > Well, I can tell you I would not have posted this here and checked it if it > would be meaningless for me. You don't have to read and answer this thread if > it's meaningless to you. Again: if you think decoding "äöüÄÖÜß" one million times is a real world use case for your module then go for unicode(). Otherwise the time you spent benchmarking artificial cases like this is just wasted time. In real life people won't even notice whether an application takes one or two minutes to complete. Use whatever you prefer (decode() or unicode()). If you experience performance bottlenecks when you're done, test whether changing decode() to unicode() makes a difference. /That/ is relevant. Thorsten -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
Thorsten Kampe wrote: > lines". That *is* *exactly* nothing. > > Another guy claims he gets times between 2.9 and 6.2 seconds when > running decode/unicode in various manifestations over "18 million over a sample of 60 words (sorry for not being able to explain myself clear enough so that everyone understands) while my current project is 18e6 words, that is the overall running time will be 87 vs. 186 seconds, which is fairly noticeable. > words" (or is it 600 million?) and says "the differences are pretty > significant". 600 million is the size of the whole corpus, that translates to 48 minutes vs. 1h43min. That already is a huge difference (going to lunch during noon or waiting another hour until it runs over - and you can bet it is _very_ noticeable when I am hungry :-)). With 9 different versions of the corpus (that is, what we are really using now) that goes to 7.2 hours (or even less with python3.1!) vs. 15 hours. Being able to re-run the whole corpus generation in one working day (and then go on with the next issues) vs. working overtime or delivering the corpus one day later is a huge difference. Like, being one day behind the schedule. > I think I don't have to comment on that. Indeed, the numbers are self-explanatory. > > If you increase the number of loops to one million or one billion or > whatever even the slightest completely negligible difference will occur. > The same thing will happen if you just increase the corpus of words to a > million, trillion or whatever. The performance implications of that are > exactly none. > I am not sure I understood that. Must be my English :-) -- --- | Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ | | __..--^^^--..__garabik @ kassiopeia.juls.savba.sk | --- Antivirus alert: file .signature infected by signature virus. Hi! I'm a signature virus! Copy me into your signature file to help me spread! -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
Thorsten Kampe wrote: > * garabik-news-2005...@kassiopeia.juls.savba.sk (Fri, 7 Aug 2009 > 17:41:38 + (UTC)) >> Thorsten Kampe wrote: >> > If you increase the number of loops to one million or one billion or >> > whatever even the slightest completely negligible difference will >> > occur. The same thing will happen if you just increase the corpus of >> > words to a million, trillion or whatever. The performance >> > implications of that are exactly none. >> >> I am not sure I understood that. Must be my English :-) > > I guess you understand me very well and I understand you very well. If I did not. Really. But then it has been explained to me, so I think I do now :-) > the performance gain you want to prove doesn't show with 600,000 words, > you test again with 18,000,000 words and if that is not impressive > enough with 600,000,000 words. Great. > Huh? 18e6 words is what I am working with _now_. Most of the data is already collected, there are going to be few more books, but that's all. And the optimization I was talking about means going home from work one hour later or earlier. Quite noticeable for me. 600e6 words is the main corpus. Data is already there and wait to be processed in some time. Once we finih our current project. That is real life, no thought experiment. > Or if a million repetitions of your "improved" code don't show the > expected "performance advantage" you run it a billion times. Even > greater. Keep on optimzing. No, we do not have one billion words (yet - I assume you are talking about American billion - if you are talking about European billion, we would be masters of the world with a billion word corpus!). However, that might change once we start collecting www data (which is a separate project, to be started in a year or two) Then, we'll do some more optimiation because the time differences will be more noticeable. Easy as that. -- --- | Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ | | __..--^^^--..__garabik @ kassiopeia.juls.savba.sk | --- Antivirus alert: file .signature infected by signature virus. Hi! I'm a signature virus! Copy me into your signature file to help me spread! -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
* alex23 (Fri, 7 Aug 2009 06:53:22 -0700 (PDT)) > Thorsten Kampe wrote: > > Bollocks. No one will even notice whether a code sequence runs 2.7 or > > 5.7 seconds. That's completely artificial benchmarking. > > But that's not what you first claimed: > > > I don't think any measurable speed increase will be > > noticeable between those two. > > But please, keep changing your argument so you don't have to admit you > were wrong. Bollocks. Please note the word "noticeable". "noticeable" as in recognisable as in reasonably experiencable or as in whatever. One guy claims he has times between 2.7 and 5.7 seconds when benchmarking more or less randomly generated "one million different lines". That *is* *exactly* nothing. Another guy claims he gets times between 2.9 and 6.2 seconds when running decode/unicode in various manifestations over "18 million words" (or is it 600 million?) and says "the differences are pretty significant". I think I don't have to comment on that. If you increase the number of loops to one million or one billion or whatever even the slightest completely negligible difference will occur. The same thing will happen if you just increase the corpus of words to a million, trillion or whatever. The performance implications of that are exactly none. Thorsten -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
* Michael Ströder (Sat, 08 Aug 2009 15:09:23 +0200) > Thorsten Kampe wrote: > > * Steven D'Aprano (08 Aug 2009 03:29:43 GMT) > >> But why assume that the program takes 8 minutes to run? Perhaps it takes > >> 8 seconds to run, and 6 seconds of that is the decoding. Then halving > >> that reduces the total runtime from 8 seconds to 5, which is a noticeable > >> speed increase to the user, and significant if you then run that program > >> tens of thousands of times. > > > > Exactly. That's why it doesn't make sense to benchmark decode()/unicode > > () isolated - meaning out of the context of your actual program. > > Thorsten, the point is you're too arrogant to admit that making such a general > statement like you did without knowing *anything* about the context is simply > false. I made a general statement to a very general question ("These both expressions are equivalent but which is faster or should be used for any reason?"). If you have specific needs or reasons then you obviously failed to provide that specific "context" in your question. > >> By all means, reminding people that pre-mature optimization is a > >> waste of time, but it's possible to take that attitude too far to Planet > >> Bizarro. At the point that you start insisting, and emphasising, that a > >> three second time difference is "*exactly*" zero, > > > > Exactly. Because it was not generated in a real world use case but by > > running a simple loop one millions times. Why one million times? Because > > by running it "only" one hundred thousand times the difference would > > have seen even less relevant. > > I was running it one million times to mitigate influences on the timing by > other background processes which is a common technique when benchmarking. Err, no. That is what "repeat" is for and it defaults to 3 ("This means that other processes running on the same computer may interfere with the timing. The best thing to do when accurate timing is necessary is to repeat the timing a few times and use the best time. [...] the default of 3 repetitions is probably enough in most cases.") Three times - not one million times. You choose one million times (for the loop) when the thing you're testing is very fast (like decoding) and you don't want results in the 0.0n range. Which is what you asked for and what you got. > > I already gave good advice: > > 1. don't benchmark > > 2. don't benchmark until you have an actual performance issue > > 3. if you benchmark then the whole application and not single commands > > You don't know anything about what I'm doing and what my aim is. So your > general rules don't apply. See above. You asked a general question, you got a general answer. > > It's really easy: Michael has working code. With that he can easily > > write two versions - one that uses decode() and one that uses unicode(). > > Yes, I have working code which was originally written before .decode() being > added in Python 2.2. Therefore I wondered whether it would be nice for > readability to replace unicode() by s.decode() since the software does not > support Python versions prior 2.3 anymore anyway. But one aspect is also > performance and hence my question and testing. You haven't done any testing yet. Running decode/unicode one million times in a loop is not testing. If you don't believe me then read at least Martelli's Optimization chapter in Python in a nutshell (the chapter is available via Google books). Thorsten -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
Michael Ströder wrote: > >>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(1000) > 17.23644495010376 > >>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(1000) > 72.087096929550171 > > That is significant! So the winner is: > > unicode('äöüÄÖÜß','utf-8') Which proves that benchmark results can be misleading sometimes. :-) unicode() becomes *slower* when you try "UTF-8" in uppercase, or an entirely different codec, say "cp1252": >>> timeit.Timer("unicode('äöüÄÖÜß','UTF-8')").timeit(100) 2.5777881145477295 >>> timeit.Timer("'äöüÄÖÜß'.decode('UTF-8')").timeit(100) 1.8430399894714355 >>> timeit.Timer("unicode('äöüÄÖÜß','cp1252')").timeit(100) 2.3622498512268066 >>> timeit.Timer("'äöüÄÖÜß'.decode('cp1252')").timeit(100) 1.7812771797180176 The reason seems to be that unicode() bypasses codecs.lookup() if the encoding is one of "utf-8", "latin-1", "mbcs", or "ascii". OTOH, str.decode() always calls codecs.lookup(). If speed is your primary concern, this will give you even better performance than unicode(): decoder = codecs.lookup("utf-8").decode for i in xrange(100): decoder("äöüÄÖÜß")[0] However, there's also a functional difference between unicode() and str.decode(): unicode() always raises an exception when you try to decode a unicode object. str.decode() will first try to encode a unicode object using the default encoding (usually "ascii"), which might or might not work. Kind Regards, M.F. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
Thorsten Kampe wrote: > * Steven D'Aprano (08 Aug 2009 03:29:43 GMT) >> But why assume that the program takes 8 minutes to run? Perhaps it takes >> 8 seconds to run, and 6 seconds of that is the decoding. Then halving >> that reduces the total runtime from 8 seconds to 5, which is a noticeable >> speed increase to the user, and significant if you then run that program >> tens of thousands of times. > > Exactly. That's why it doesn't make sense to benchmark decode()/unicode > () isolated - meaning out of the context of your actual program. Thorsten, the point is you're too arrogant to admit that making such a general statement like you did without knowing *anything* about the context is simply false. So this is not a technial matter. It's mainly an issue with your attitude. >> By all means, reminding people that pre-mature optimization is a >> waste of time, but it's possible to take that attitude too far to Planet >> Bizarro. At the point that you start insisting, and emphasising, that a >> three second time difference is "*exactly*" zero, > > Exactly. Because it was not generated in a real world use case but by > running a simple loop one millions times. Why one million times? Because > by running it "only" one hundred thousand times the difference would > have seen even less relevant. I was running it one million times to mitigate influences on the timing by other background processes which is a common technique when benchmarking. I was mainly interested in the percentage which is indeed significant. The absolute times also strongly depend on the hardware where the software is running. So your comment about the absolute times are complete nonsense. I'm eager that this software should also run with acceptable response times on hardware much slower than my development machine. > I already gave good advice: > 1. don't benchmark > 2. don't benchmark until you have an actual performance issue > 3. if you benchmark then the whole application and not single commands You don't know anything about what I'm doing and what my aim is. So your general rules don't apply. > It's really easy: Michael has working code. With that he can easily > write two versions - one that uses decode() and one that uses unicode(). Yes, I have working code which was originally written before .decode() being added in Python 2.2. Therefore I wondered whether it would be nice for readability to replace unicode() by s.decode() since the software does not support Python versions prior 2.3 anymore anyway. But one aspect is also performance and hence my question and testing. Ciao, Michael. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
* garabik-news-2005...@kassiopeia.juls.savba.sk (Fri, 7 Aug 2009 17:41:38 + (UTC)) > Thorsten Kampe wrote: > > If you increase the number of loops to one million or one billion or > > whatever even the slightest completely negligible difference will > > occur. The same thing will happen if you just increase the corpus of > > words to a million, trillion or whatever. The performance > > implications of that are exactly none. > > I am not sure I understood that. Must be my English :-) I guess you understand me very well and I understand you very well. If the performance gain you want to prove doesn't show with 600,000 words, you test again with 18,000,000 words and if that is not impressive enough with 600,000,000 words. Great. Or if a million repetitions of your "improved" code don't show the expected "performance advantage" you run it a billion times. Even greater. Keep on optimzing. Thorsten -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
* alex23 (Fri, 7 Aug 2009 10:45:29 -0700 (PDT)) > garabik-news-2005...@kassiopeia.juls.savba.sk wrote: > > I am not sure I understood that. Must be my English :-) > > I just parsed it as "blah blah blah I won't admit I'm wrong" and > didn't miss anything substantive. Alex, there are still a number of performance optimizations that require a thorough optimizer like you. Like using short identifiers instead of long ones. I guess you could easily prove that by comparing "a = 0" to "a_long_identifier = 0" and running it one hundred trillion times. The performance gain could easily add up to *days*. Keep us updated. Thorsten -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
* Steven D'Aprano (08 Aug 2009 03:29:43 GMT) > On Fri, 07 Aug 2009 17:13:07 +0200, Thorsten Kampe wrote: > > One guy claims he has times between 2.7 and 5.7 seconds when > > benchmarking more or less randomly generated "one million different > > lines". That *is* *exactly* nothing. > > We agree that in the grand scheme of things, a difference of 2.7 seconds > versus 5.7 seconds is a trivial difference if your entire program takes > (say) 8 minutes to run. You won't even notice it. Exactly. > But why assume that the program takes 8 minutes to run? Perhaps it takes > 8 seconds to run, and 6 seconds of that is the decoding. Then halving > that reduces the total runtime from 8 seconds to 5, which is a noticeable > speed increase to the user, and significant if you then run that program > tens of thousands of times. Exactly. That's why it doesn't make sense to benchmark decode()/unicode () isolated - meaning out of the context of your actual program. > By all means, reminding people that pre-mature optimization is a > waste of time, but it's possible to take that attitude too far to Planet > Bizarro. At the point that you start insisting, and emphasising, that a > three second time difference is "*exactly*" zero, Exactly. Because it was not generated in a real world use case but by running a simple loop one millions times. Why one million times? Because by running it "only" one hundred thousand times the difference would have seen even less relevant. > it seems to me that this is about you winning rather than you giving > good advice. I already gave good advice: 1. don't benchmark 2. don't benchmark until you have an actual performance issue 3. if you benchmark then the whole application and not single commands It's really easy: Michael has working code. With that he can easily write two versions - one that uses decode() and one that uses unicode(). He can benchmark these with some real world input he often uses by running it a hundred or a thousand times (even a million if he likes). Then he can compare the results. I doubt that there will be any noticeable difference. Thorsten -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
On Fri, 07 Aug 2009 17:13:07 +0200, Thorsten Kampe wrote: > One guy claims he has times between 2.7 and 5.7 seconds when > benchmarking more or less randomly generated "one million different > lines". That *is* *exactly* nothing. We agree that in the grand scheme of things, a difference of 2.7 seconds versus 5.7 seconds is a trivial difference if your entire program takes (say) 8 minutes to run. You won't even notice it. But why assume that the program takes 8 minutes to run? Perhaps it takes 8 seconds to run, and 6 seconds of that is the decoding. Then halving that reduces the total runtime from 8 seconds to 5, which is a noticeable speed increase to the user, and significant if you then run that program tens of thousands of times. The Python dev team spend significant time and effort to get improvements of the order of 10%, and you're pooh-poohing an improvement of the order of 100%. By all means, reminding people that pre-mature optimization is a waste of time, but it's possible to take that attitude too far to Planet Bizarro. At the point that you start insisting, and emphasising, that a three second time difference is "*exactly*" zero, it seems to me that this is about you winning rather than you giving good advice. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
On Fri, 07 Aug 2009 12:00:42 +0200, Thorsten Kampe wrote: > Bollocks. No one will even notice whether a code sequence runs 2.7 or > 5.7 seconds. That's completely artificial benchmarking. You think users won't notice a doubling of execution time? Well, that explains some of the apps I'm forced to use... A two-second running time for (say) a command-line tool is already noticeable. A five-second one is *very* noticeable -- long enough to be a drag, short enough that you aren't tempted to go off and do something else while you're waiting for it to finish. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
garabik-news-2005...@kassiopeia.juls.savba.sk wrote: > I am not sure I understood that. Must be my English :-) I just parsed it as "blah blah blah I won't admit I'm wrong" and didn't miss anything substantive. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
Thorsten Kampe wrote: > Bollocks. No one will even notice whether a code sequence runs 2.7 or > 5.7 seconds. That's completely artificial benchmarking. But that's not what you first claimed: > I don't think any measurable speed increase will be > noticeable between those two. But please, keep changing your argument so you don't have to admit you were wrong. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
Thorsten Kampe wrote: > * Steven D'Aprano (06 Aug 2009 19:17:30 GMT) >> What if you're writing a loop which takes one million different lines of >> text and decodes them once each? >> >> >>> setup = 'L = ["abc"*(n%100) for n in xrange(100)]' >> >>> t1 = timeit.Timer('for line in L: line.decode("utf-8")', setup) >> >>> t2 = timeit.Timer('for line in L: unicode(line, "utf-8")', setup) >> >>> t1.timeit(number=1) >> 5.6751680374145508 >> >>> t2.timeit(number=1) >> 2.682251165771 >> >> Seems like a pretty meaningful difference to me. > > Bollocks. No one will even notice whether a code sequence runs 2.7 or > 5.7 seconds. That's completely artificial benchmarking. > For a real-life example, I have often a file with one word per line, and I run python scripts to apply some (sometimes fairy trivial) transformation over it. REAL example, reading lines with word, lemma, tag separated by tabs from stdin and writing word into stdout, unless it starts with '<' (~6e5 lines, python2.5, user times, warm cache, I hope the comments are self-explanatory) no unicode user0m2.380s decode('utf-8'), encode('utf-8') user0m3.560s sys.stdout = codecs.getwriter('utf-8')(sys.stdout);sys.stdin = codecs.getreader('utf-8')(sys.stdin) user0m6.180s unicode(line, 'utf8'), encode('utf-8') user0m3.820s unicode(line, 'utf-8'), encode('utf-8') user0m2.880sa python3.1 user0m1.560s Since I have something like 18 million words in my currenct project (and > 600 million overall) and I often tweak some parameters and re-run the > transformations, the differences are pretty significant. Personally, I have been surprised by: 1) bad performance of the codecs wrapper (I expected it to be on par with unicode(x,'utf-8'), mayble slightly better due to less function calls 2) good performance of python3.1 (utf-8 locale) -- --- | Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ | | __..--^^^--..__garabik @ kassiopeia.juls.savba.sk | --- Antivirus alert: file .signature infected by signature virus. Hi! I'm a signature virus! Copy me into your signature file to help me spread! -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
* Steven D'Aprano (06 Aug 2009 19:17:30 GMT) > On Thu, 06 Aug 2009 20:05:52 +0200, Thorsten Kampe wrote: > > > That is significant! So the winner is: > > > > > > unicode('äöüÄÖÜß','utf-8') > > > > Unless you are planning to write a loop that decodes "äöüÄÖÜß" one > > million times, these benchmarks are meaningless. > > What if you're writing a loop which takes one million different lines of > text and decodes them once each? > > >>> setup = 'L = ["abc"*(n%100) for n in xrange(100)]' > >>> t1 = timeit.Timer('for line in L: line.decode("utf-8")', setup) > >>> t2 = timeit.Timer('for line in L: unicode(line, "utf-8")', setup) > >>> t1.timeit(number=1) > 5.6751680374145508 > >>> t2.timeit(number=1) > 2.682251165771 > > Seems like a pretty meaningful difference to me. Bollocks. No one will even notice whether a code sequence runs 2.7 or 5.7 seconds. That's completely artificial benchmarking. Thorsten -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
On Fri, 07 Aug 2009 08:04:51 +0100, Mark Lawrence wrote: > I believe that the comment "these benchmarks are meaningless" refers to > the length of the strings being used in the tests. Surely something > involving thousands or millions of characters is more meaningful? Or to > go the other way, you are unlikely to write for c in 'äöüÄÖÜß': > u = unicode(c, 'utf-8') > ... > Yes? There are all sorts of potential use-cases. A day or two ago, somebody posted a question involving tens of thousands of lines of tens of thousands of characters each (don't quote me, I'm going by memory). On the other hand, it doesn't require much imagination to think of a use- case where there are millions of lines each of a dozen or so characters, and you want to process it line by line: noun: cat noun: dog verb: café ... As always, before optimizing, you should profile to be sure you are actually optimizing and not wasting your time. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
Michael Ströder wrote: Thorsten Kampe wrote: * Michael Ströder (Thu, 06 Aug 2009 18:26:09 +0200) timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(1000) 17.23644495010376 timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(1000) 72.087096929550171 That is significant! So the winner is: unicode('äöüÄÖÜß','utf-8') Unless you are planning to write a loop that decodes "äöüÄÖÜß" one million times, these benchmarks are meaningless. Well, I can tell you I would not have posted this here and checked it if it would be meaningless for me. You don't have to read and answer this thread if it's meaningless to you. Ciao, Michael. I believe that the comment "these benchmarks are meaningless" refers to the length of the strings being used in the tests. Surely something involving thousands or millions of characters is more meaningful? Or to go the other way, you are unlikely to write for c in 'äöüÄÖÜß': u = unicode(c, 'utf-8') ... Yes? -- Kindest regards. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
Jason Tackaberry urandom.ca> writes: > On Thu, 2009-08-06 at 01:31 +, John Machin wrote: > > Suggested further avenues of investigation: > > > > (1) Try the timing again with "cp1252" and "utf8" and "utf_8" > > > > (2) grep "utf-8" /Objects/unicodeobject.c > > Very pedagogical of you. :) Indeed, it looks like bigger player in the > performance difference is the fact that the code path for unicode(s, > enc) short-circuits the codec registry for common encodings (which > includes 'utf-8' specifically), whereas s.decode('utf-8') necessarily > consults the codec registry. So the next question (the answer to which may benefit all users of .encode() and .decode()) is: Why does consulting the codec registry take so long, and can this be improved? -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
Thorsten Kampe wrote: > * Michael Ströder (Thu, 06 Aug 2009 18:26:09 +0200) > timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(1000) >> 17.23644495010376 > timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(1000) >> 72.087096929550171 >> >> That is significant! So the winner is: >> >> unicode('äöüÄÖÜß','utf-8') > > Unless you are planning to write a loop that decodes "äöüÄÖÜß" one > million times, these benchmarks are meaningless. Well, I can tell you I would not have posted this here and checked it if it would be meaningless for me. You don't have to read and answer this thread if it's meaningless to you. Ciao, Michael. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
On Thu, 06 Aug 2009 20:05:52 +0200, Thorsten Kampe wrote: > > That is significant! So the winner is: > > > > unicode('äöüÄÖÜß','utf-8') > > Unless you are planning to write a loop that decodes "äöüÄÖÜß" one > million times, these benchmarks are meaningless. What if you're writing a loop which takes one million different lines of text and decodes them once each? >>> setup = 'L = ["abc"*(n%100) for n in xrange(100)]' >>> t1 = timeit.Timer('for line in L: line.decode("utf-8")', setup) >>> t2 = timeit.Timer('for line in L: unicode(line, "utf-8")', setup) >>> t1.timeit(number=1) 5.6751680374145508 >>> t2.timeit(number=1) 2.682251165771 Seems like a pretty meaningful difference to me. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
* Michael Ströder (Thu, 06 Aug 2009 18:26:09 +0200) > Thorsten Kampe wrote: > > * Michael Ströder (Wed, 05 Aug 2009 16:43:09 +0200) > > I don't think any measurable speed increase will be noticeable > > between those two. > > Well, seems not to be true. Try yourself. I did (my console has UTF-8 as > charset): > > Python 2.6 (r26:66714, Feb 3 2009, 20:52:03) > [GCC 4.3.2 [gcc-4_3-branch revision 141291]] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import timeit > >>> timeit.Timer("'äöüÄÖÜß'.decode('utf-8')").timeit(100) > 7.2721178531646729 > >>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(100) > 7.1302499771118164 > >>> timeit.Timer("unicode('äöüÄÖÜß','utf8')").timeit(100) > 8.3726329803466797 > >>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(100) > 1.8622009754180908 > >>> timeit.Timer("unicode('äöüÄÖÜß','utf8')").timeit(100) > 8.651669979095459 > >>> > > Comparing again the two best combinations: > > >>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(1000) > 17.23644495010376 > >>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(1000) > 72.087096929550171 > > That is significant! So the winner is: > > unicode('äöüÄÖÜß','utf-8') Unless you are planning to write a loop that decodes "äöüÄÖÜß" one million times, these benchmarks are meaningless. Thorsten -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
Thorsten Kampe wrote: > * Michael Ströder (Wed, 05 Aug 2009 16:43:09 +0200) >> These both expressions are equivalent but which is faster or should be >> used for any reason? >> >> u = unicode(s,'utf-8') >> >> u = s.decode('utf-8') # looks nicer > > "decode" was added in Python 2.2 for the sake of symmetry to encode(). Yes, and I like the style. But... > It's essentially the same as unicode() and I wouldn't be surprised if it > is exactly the same. Did you try? > I don't think any measurable speed increase will be noticeable between > those two. Well, seems not to be true. Try yourself. I did (my console has UTF-8 as charset): Python 2.6 (r26:66714, Feb 3 2009, 20:52:03) [GCC 4.3.2 [gcc-4_3-branch revision 141291]] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import timeit >>> timeit.Timer("'äöüÄÖÜß'.decode('utf-8')").timeit(100) 7.2721178531646729 >>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(100) 7.1302499771118164 >>> timeit.Timer("unicode('äöüÄÖÜß','utf8')").timeit(100) 8.3726329803466797 >>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(100) 1.8622009754180908 >>> timeit.Timer("unicode('äöüÄÖÜß','utf8')").timeit(100) 8.651669979095459 >>> Comparing again the two best combinations: >>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(1000) 17.23644495010376 >>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(1000) 72.087096929550171 That is significant! So the winner is: unicode('äöüÄÖÜß','utf-8') Ciao, Michael. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
* Michael Ströder (Wed, 05 Aug 2009 16:43:09 +0200) > These both expressions are equivalent but which is faster or should be > used for any reason? > > u = unicode(s,'utf-8') > > u = s.decode('utf-8') # looks nicer "decode" was added in Python 2.2 for the sake of symmetry to encode(). It's essentially the same as unicode() and I wouldn't be surprised if it is exactly the same. I don't think any measurable speed increase will be noticeable between those two. Thorsten -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
On Thu, 2009-08-06 at 01:31 +, John Machin wrote: > Faster by an enormous margin; attributing this to the cost of attribute lookup > seems implausible. Ok, fair point. I don't think the time difference fully registered when I composed that message. Testing a global access (LOAD_GLOBAL) versus an attribute access on a global object (LOAD_GLOBAL + LOAD_ATTR) shows that the latter is about 40% slower than the former. So that certainly doesn't account for the difference. > Suggested further avenues of investigation: > > (1) Try the timing again with "cp1252" and "utf8" and "utf_8" > > (2) grep "utf-8" /Objects/unicodeobject.c Very pedagogical of you. :) Indeed, it looks like bigger player in the performance difference is the fact that the code path for unicode(s, enc) short-circuits the codec registry for common encodings (which includes 'utf-8' specifically), whereas s.decode('utf-8') necessarily consults the codec registry. Cheers, Jason. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
Jason Tackaberry urandom.ca> writes: > On Wed, 2009-08-05 at 16:43 +0200, Michael Ströder wrote: > > These both expressions are equivalent but which is faster or should be used > > for any reason? > > u = unicode(s,'utf-8') > > u = s.decode('utf-8') # looks nicer > > It is sometimes non-obvious which constructs are faster than others in > Python. I also regularly have these questions, but it's pretty easy to > run quick (albeit naive) benchmarks to see. > > The first thing to try is to have a look at the bytecode for each: [snip] > The presence of LOAD_ATTR in the first form hints that this is probably > going to be slower. Next, actually try it: > > >>> import timeit > >>> timeit.timeit('"foobarbaz".decode("utf-8")') > 1.698289155960083 > >>> timeit.timeit('unicode("foobarbaz", "utf-8")') > 0.53305888175964355 > > So indeed, uncode(s, 'utf-8') is faster by a fair margin. Faster by an enormous margin; attributing this to the cost of attribute lookup seems implausible. Suggested further avenues of investigation: (1) Try the timing again with "cp1252" and "utf8" and "utf_8" (2) grep "utf-8" /Objects/unicodeobject.c HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
unicode() has LOAD_GLOBAL which s.decode() does not. Is it generally the case that LOAD_ATTR is slower than LOAD_GLOBAL that lead to your intuition that the former would probably be slower? Or some other intuition? Of course, the results from timeit are a different thing - I ask about the intuition in the disassembler output. Thanks. > > The presence of LOAD_ATTR in the first form hints that this is probably > going to be slower. Next, actually try it: > -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode() vs. s.decode()
On Wed, 2009-08-05 at 16:43 +0200, Michael Ströder wrote: > These both expressions are equivalent but which is faster or should be used > for any reason? > > u = unicode(s,'utf-8') > > u = s.decode('utf-8') # looks nicer It is sometimes non-obvious which constructs are faster than others in Python. I also regularly have these questions, but it's pretty easy to run quick (albeit naive) benchmarks to see. The first thing to try is to have a look at the bytecode for each: >>> import dis >>> dis.dis(lambda s: s.decode('utf-8')) 1 0 LOAD_FAST0 (s) 3 LOAD_ATTR0 (decode) 6 LOAD_CONST 0 ('utf-8') 9 CALL_FUNCTION1 12 RETURN_VALUE >>> dis.dis(lambda s: unicode(s, 'utf-8')) 1 0 LOAD_GLOBAL 0 (unicode) 3 LOAD_FAST0 (s) 6 LOAD_CONST 0 ('utf-8') 9 CALL_FUNCTION2 12 RETURN_VALUE The presence of LOAD_ATTR in the first form hints that this is probably going to be slower. Next, actually try it: >>> import timeit >>> timeit.timeit('"foobarbaz".decode("utf-8")') 1.698289155960083 >>> timeit.timeit('unicode("foobarbaz", "utf-8")') 0.53305888175964355 So indeed, uncode(s, 'utf-8') is faster by a fair margin. On the other hand, unless you need to do this in a tight loop several tens of thousands of times, I'd prefer the slower form s.decode('utf-8') because it's, as you pointed out, cleaner and more readable code. Cheers, Jason. -- http://mail.python.org/mailman/listinfo/python-list