Re: unicode() vs. s.decode()

2009-08-09 Thread Steven D'Aprano
On Sat, 08 Aug 2009 19:00:11 +0200, Thorsten Kampe wrote:

>> I was running it one million times to mitigate influences on the timing
>> by other background processes which is a common technique when
>> benchmarking.
> 
> Err, no. That is what "repeat" is for and it defaults to 3 ("This means
> that other processes running on the same computer may interfere with the
> timing. The best thing to do when accurate timing is necessary is to
> repeat the timing a few times and use the best time. [...] the default
> of 3 repetitions is probably enough in most cases.")


It's useful to look at the timeit module to see what the author(s) think.

Let's start with the repeat() method. In the Timer docstring:

"The repeat() method is a convenience to call timeit() multiple times and 
return a list of results."

and the repeat() method's own docstring:

"This is a convenience function that calls the timeit() repeatedly, 
returning a list of results.  The first argument specifies how many times 
to call timeit(), defaulting to 3; the second argument specifies the 
timer argument, defaulting to one million."

So it's quite obvious that the module author(s), and possibly even Tim 
Peters himself, consider repeat() to be a mere convenience method. 
There's nothing you can do with repeat() that can't be done with the 
timeit() method itself.

Notice that both repeat() and timeit() methods take an argument to 
specify how many times to execute the code snippet. Why not just execute 
it once? The module doesn't say, but the answer is a basic measurement 
technique: if your clock is accurate to (say) a millisecond, and you 
measure a single event as taking a millisecond, then your relative error 
is roughly 100%. But if you time 1000 events, and measure the total time 
as 1 second, the relative error is now 0.1%.

The authors of the timeit module obvious considered this an important 
factor: not only did they allow you to specify the number of times to 
execute the code snippet (defaulting to one million, not to one) but they 
had this to say:

[quote]
Command line usage:
python timeit.py [-n N] [-r N] [-s S] [-t] [-c] [-h] [statement]

Options:
  -n/--number N: how many times to execute 'statement'
 [...]

If -n is not given, a suitable number of loops is calculated by trying
successive powers of 10 until the total time is at least 0.2 seconds.
[end quote]

In other words, when calling the timeit module from the command line, by 
default it will choose a value for n that gives a sufficiently small 
relative error.

It's not an accident that timeit gives you two "count" parameters: the 
number of times to execute the code snippet per timing, and the number of 
timings. They control (partly) for different sources of error.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-09 Thread Jeroen Ruigrok van der Werven
-On [20090808 20:07], Thorsten Kampe (thors...@thorstenkampe.de) wrote:
>In real life people won't even notice whether an application takes one or
>two minutes to complete.

I think you are quite wrong here.

I have worked with optical engineers who needed to calculate grating numbers
for their lenses. If they can have a calculation program that runs in 1
minute instead of 2 they can effectively double their output during the day
(since they run calculations hundreds to thousand times a day to get the
most optimal results with minor tweaks).

I think you are being a bit too easy on hand waving here that mere minute
runtimes are not noticeable.

-- 
Jeroen Ruigrok van der Werven  / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
When we have not what we like, we must like what we have...
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-08 Thread Michael Ströder
Michael Fötsch wrote:
> If speed is your primary concern, this will give you even better
> performance than unicode():
> 
>   decoder = codecs.lookup("utf-8").decode
>   for i in xrange(100):
>   decoder("äöüÄÖÜß")[0]

Hmm, that could be interesting. I will give it a try.

> However, there's also a functional difference between unicode() and
> str.decode():
> 
> unicode() always raises an exception when you try to decode a unicode
> object. str.decode() will first try to encode a unicode object using the
> default encoding (usually "ascii"), which might or might not work.

Thanks for pointing that out. So in my case I'd consider that also a plus for
using unicode().

Ciao, Michael.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-08 Thread Thorsten Kampe
* Michael Ströder (Fri, 07 Aug 2009 03:25:03 +0200)
> Thorsten Kampe wrote:
> > * Michael Ströder (Thu, 06 Aug 2009 18:26:09 +0200)
> > timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(1000)
> >> 17.23644495010376
> > timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(1000)
> >> 72.087096929550171
> >>
> >> That is significant! So the winner is:
> >>
> >> unicode('äöüÄÖÜß','utf-8')
> > 
> > Unless you are planning to write a loop that decodes "äöüÄÖÜß" one 
> > million times, these benchmarks are meaningless.
> 
> Well, I can tell you I would not have posted this here and checked it if it
> would be meaningless for me. You don't have to read and answer this thread if
> it's meaningless to you.

Again: if you think decoding "äöüÄÖÜß" one million times is a real world 
use case for your module then go for unicode(). Otherwise the time you 
spent benchmarking artificial cases like this is just wasted time. In 
real life people won't even notice whether an application takes one or 
two minutes to complete.

Use whatever you prefer (decode() or unicode()). If you experience 
performance bottlenecks when you're done, test whether changing decode() 
to unicode() makes a difference. /That/ is relevant.

Thorsten
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-08 Thread garabik-news-2005-05
Thorsten Kampe  wrote:
 
> lines". That *is* *exactly* nothing.
> 
> Another guy claims he gets times between 2.9 and 6.2 seconds when 
> running decode/unicode in various manifestations over "18 million 


over a sample of 60 words (sorry for not being able to explain
myself clear enough so that everyone understands)
while my current project is 18e6 words, that is the overall running time
will be 87 vs. 186 seconds, which is fairly noticeable.

> words" (or is it 600 million?) and says "the differences are pretty 
> significant". 

600 million is the size of the whole corpus, that translates to
48 minutes vs. 1h43min. That already is a huge difference (going to
lunch during noon or waiting another hour until it runs over - and 
you can bet it is _very_ noticeable when I am hungry :-)).

With 9 different versions of the corpus (that is, what we are really
using now) that goes to 7.2 hours (or even less with python3.1!) vs. 15
hours. Being able to re-run the whole corpus generation in one working
day (and then go on with the next issues) vs. working overtime or
delivering the corpus one day later is a huge difference. Like, being
one day behind the schedule.

> I think I don't have to comment on that.

Indeed, the numbers are self-explanatory.

> 
> If you increase the number of loops to one million or one billion or 
> whatever even the slightest completely negligible difference will occur. 
> The same thing will happen if you just increase the corpus of words to a 
> million, trillion or whatever. The performance implications of that are 
> exactly none.
> 

I am not sure I understood that. Must be my English :-)

-- 
 ---
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__garabik @ kassiopeia.juls.savba.sk |
 ---
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-08 Thread garabik-news-2005-05
Thorsten Kampe  wrote:
> * garabik-news-2005...@kassiopeia.juls.savba.sk (Fri, 7 Aug 2009 
> 17:41:38 + (UTC))
>> Thorsten Kampe  wrote:
>> > If you increase the number of loops to one million or one billion or
>> > whatever even the slightest completely negligible difference will
>> > occur. The same thing will happen if you just increase the corpus of
>> > words to a million, trillion or whatever. The performance
>> > implications of that are exactly none.
>> 
>> I am not sure I understood that. Must be my English :-)
> 
> I guess you understand me very well and I understand you very well. If 

I did not. Really. But then it has been explained to me, so I think I do
now :-)

> the performance gain you want to prove doesn't show with 600,000 words, 
> you test again with 18,000,000 words and if that is not impressive 
> enough with 600,000,000 words. Great.
> 

Huh? 
18e6 words is what I am working with _now_. Most of the data is already
collected, there are going to be few more books, but that's all. And the
optimization I was talking about means going home from work one hour
later or earlier. Quite noticeable for me.
600e6 words is the main corpus. Data is already there and wait to be
processed in some time. Once we finih our current project. That is 
real life, no thought experiment.


> Or if a million repetitions of your "improved" code don't show the 
> expected "performance advantage" you run it a billion times. Even 
> greater. Keep on optimzing.

No, we do not have one billion words (yet - I assume you are talking
about American billion - if you are talking about European billion, we
would be masters of the world with a billion word corpus!).
However, that might change once we start collecting www data (which is a
separate project, to be started in a year or two)
Then, we'll do some more optimiation because the time differences will
be more noticeable. Easy as that.


-- 
 ---
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__garabik @ kassiopeia.juls.savba.sk |
 ---
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-08 Thread Thorsten Kampe
* alex23 (Fri, 7 Aug 2009 06:53:22 -0700 (PDT))
> Thorsten Kampe  wrote:
> > Bollocks. No one will even notice whether a code sequence runs 2.7 or
> > 5.7 seconds. That's completely artificial benchmarking.
> 
> But that's not what you first claimed:
> 
> > I don't think any measurable speed increase will be
> > noticeable between those two.
> 
> But please, keep changing your argument so you don't have to admit you
> were wrong.

Bollocks. Please note the word "noticeable". "noticeable" as in 
recognisable as in reasonably experiencable or as in whatever.

One guy claims he has times between 2.7 and 5.7 seconds when 
benchmarking more or less randomly generated "one million different 
lines". That *is* *exactly* nothing.

Another guy claims he gets times between 2.9 and 6.2 seconds when 
running decode/unicode in various manifestations over "18 million 
words" (or is it 600 million?) and says "the differences are pretty 
significant". I think I don't have to comment on that.

If you increase the number of loops to one million or one billion or 
whatever even the slightest completely negligible difference will occur. 
The same thing will happen if you just increase the corpus of words to a 
million, trillion or whatever. The performance implications of that are 
exactly none.

Thorsten
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-08 Thread Thorsten Kampe
* Michael Ströder (Sat, 08 Aug 2009 15:09:23 +0200)
> Thorsten Kampe wrote:
> > * Steven D'Aprano (08 Aug 2009 03:29:43 GMT)
> >> But why assume that the program takes 8 minutes to run? Perhaps it takes 
> >> 8 seconds to run, and 6 seconds of that is the decoding. Then halving 
> >> that reduces the total runtime from 8 seconds to 5, which is a noticeable 
> >> speed increase to the user, and significant if you then run that program 
> >> tens of thousands of times.
> > 
> > Exactly. That's why it doesn't make sense to benchmark decode()/unicode
> > () isolated - meaning out of the context of your actual program.
> 
> Thorsten, the point is you're too arrogant to admit that making such a general
> statement like you did without knowing *anything* about the context is simply
> false.

I made a general statement to a very general question ("These both 
expressions are equivalent but which is faster or should be used for any 
reason?"). If you have specific needs or reasons then you obviously 
failed to provide that specific "context" in your question.
 
> >> By all means, reminding people that pre-mature optimization is a 
> >> waste of time, but it's possible to take that attitude too far to Planet 
> >> Bizarro. At the point that you start insisting, and emphasising, that a 
> >> three second time difference is "*exactly*" zero,
> > 
> > Exactly. Because it was not generated in a real world use case but by 
> > running a simple loop one millions times. Why one million times? Because 
> > by running it "only" one hundred thousand times the difference would 
> > have seen even less relevant.
> 
> I was running it one million times to mitigate influences on the timing by
> other background processes which is a common technique when benchmarking.

Err, no. That is what "repeat" is for and it defaults to 3 ("This means 
that other processes running on the same computer may interfere with the 
timing. The best thing to do when accurate timing is necessary is to 
repeat the timing a few times and use the best time. [...] the default 
of 3 repetitions is probably enough in most cases.")

Three times - not one million times. You choose one million times (for 
the loop) when the thing you're testing is very fast (like decoding) and 
you don't want results in the 0.0n range. Which is what you asked 
for and what you got.

> > I already gave good advice:
> > 1. don't benchmark
> > 2. don't benchmark until you have an actual performance issue
> > 3. if you benchmark then the whole application and not single commands
> 
> You don't know anything about what I'm doing and what my aim is. So your
> general rules don't apply.

See above. You asked a general question, you got a general answer.
 
> > It's really easy: Michael has working code. With that he can easily 
> > write two versions - one that uses decode() and one that uses unicode().
> 
> Yes, I have working code which was originally written before .decode() being
> added in Python 2.2. Therefore I wondered whether it would be nice for
> readability to replace unicode() by s.decode() since the software does not
> support Python versions prior 2.3 anymore anyway. But one aspect is also
> performance and hence my question and testing.

You haven't done any testing yet. Running decode/unicode one million 
times in a loop is not testing. If you don't believe me then read at 
least Martelli's Optimization chapter in Python in a nutshell (the 
chapter is available via Google books).

Thorsten
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-08 Thread Michael Fötsch

Michael Ströder wrote:
> >>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(1000)
> 17.23644495010376
> >>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(1000)
> 72.087096929550171
>
> That is significant! So the winner is:
>
> unicode('äöüÄÖÜß','utf-8')

Which proves that benchmark results can be misleading sometimes. :-)

unicode() becomes *slower* when you try "UTF-8" in uppercase, or an 
entirely different codec, say "cp1252":


  >>> timeit.Timer("unicode('äöüÄÖÜß','UTF-8')").timeit(100)
  2.5777881145477295
  >>> timeit.Timer("'äöüÄÖÜß'.decode('UTF-8')").timeit(100)
  1.8430399894714355
  >>> timeit.Timer("unicode('äöüÄÖÜß','cp1252')").timeit(100)
  2.3622498512268066
  >>> timeit.Timer("'äöüÄÖÜß'.decode('cp1252')").timeit(100)
  1.7812771797180176

The reason seems to be that unicode() bypasses codecs.lookup() if the 
encoding is one of "utf-8", "latin-1", "mbcs", or "ascii". OTOH, 
str.decode() always calls codecs.lookup().


If speed is your primary concern, this will give you even better 
performance than unicode():


  decoder = codecs.lookup("utf-8").decode
  for i in xrange(100):
  decoder("äöüÄÖÜß")[0]


However, there's also a functional difference between unicode() and 
str.decode():


unicode() always raises an exception when you try to decode a unicode 
object. str.decode() will first try to encode a unicode object using the 
default encoding (usually "ascii"), which might or might not work.


Kind Regards,
M.F.

--
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-08 Thread Michael Ströder
Thorsten Kampe wrote:
> * Steven D'Aprano (08 Aug 2009 03:29:43 GMT)
>> But why assume that the program takes 8 minutes to run? Perhaps it takes 
>> 8 seconds to run, and 6 seconds of that is the decoding. Then halving 
>> that reduces the total runtime from 8 seconds to 5, which is a noticeable 
>> speed increase to the user, and significant if you then run that program 
>> tens of thousands of times.
> 
> Exactly. That's why it doesn't make sense to benchmark decode()/unicode
> () isolated - meaning out of the context of your actual program.

Thorsten, the point is you're too arrogant to admit that making such a general
statement like you did without knowing *anything* about the context is simply
false. So this is not a technial matter. It's mainly an issue with your 
attitude.

>> By all means, reminding people that pre-mature optimization is a 
>> waste of time, but it's possible to take that attitude too far to Planet 
>> Bizarro. At the point that you start insisting, and emphasising, that a 
>> three second time difference is "*exactly*" zero,
> 
> Exactly. Because it was not generated in a real world use case but by 
> running a simple loop one millions times. Why one million times? Because 
> by running it "only" one hundred thousand times the difference would 
> have seen even less relevant.

I was running it one million times to mitigate influences on the timing by
other background processes which is a common technique when benchmarking. I
was mainly interested in the percentage which is indeed significant. The
absolute times also strongly depend on the hardware where the software is
running. So your comment about the absolute times are complete nonsense. I'm
eager that this software should also run with acceptable response times on
hardware much slower than my development machine.

> I already gave good advice:
> 1. don't benchmark
> 2. don't benchmark until you have an actual performance issue
> 3. if you benchmark then the whole application and not single commands

You don't know anything about what I'm doing and what my aim is. So your
general rules don't apply.

> It's really easy: Michael has working code. With that he can easily 
> write two versions - one that uses decode() and one that uses unicode().

Yes, I have working code which was originally written before .decode() being
added in Python 2.2. Therefore I wondered whether it would be nice for
readability to replace unicode() by s.decode() since the software does not
support Python versions prior 2.3 anymore anyway. But one aspect is also
performance and hence my question and testing.

Ciao, Michael.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-08 Thread Thorsten Kampe
* garabik-news-2005...@kassiopeia.juls.savba.sk (Fri, 7 Aug 2009 
17:41:38 + (UTC))
> Thorsten Kampe  wrote:
> > If you increase the number of loops to one million or one billion or
> > whatever even the slightest completely negligible difference will
> > occur. The same thing will happen if you just increase the corpus of
> > words to a million, trillion or whatever. The performance
> > implications of that are exactly none.
> 
> I am not sure I understood that. Must be my English :-)

I guess you understand me very well and I understand you very well. If 
the performance gain you want to prove doesn't show with 600,000 words, 
you test again with 18,000,000 words and if that is not impressive 
enough with 600,000,000 words. Great.

Or if a million repetitions of your "improved" code don't show the 
expected "performance advantage" you run it a billion times. Even 
greater. Keep on optimzing.

Thorsten
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-08 Thread Thorsten Kampe
* alex23 (Fri, 7 Aug 2009 10:45:29 -0700 (PDT))
> garabik-news-2005...@kassiopeia.juls.savba.sk wrote:
> > I am not sure I understood that. Must be my English :-)
> 
> I just parsed it as "blah blah blah I won't admit I'm wrong" and
> didn't miss anything substantive.

Alex, there are still a number of performance optimizations that require 
a thorough optimizer like you. Like using short identifiers instead of 
long ones. I guess you could easily prove that by comparing "a = 0" to 
"a_long_identifier = 0" and running it one hundred trillion times. The 
performance gain could easily add up to *days*. Keep us updated.

Thorsten
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-08 Thread Thorsten Kampe
* Steven D'Aprano (08 Aug 2009 03:29:43 GMT)
> On Fri, 07 Aug 2009 17:13:07 +0200, Thorsten Kampe wrote:
> > One guy claims he has times between 2.7 and 5.7 seconds when
> > benchmarking more or less randomly generated "one million different
> > lines". That *is* *exactly* nothing.
> 
> We agree that in the grand scheme of things, a difference of 2.7 seconds 
> versus 5.7 seconds is a trivial difference if your entire program takes 
> (say) 8 minutes to run. You won't even notice it.

Exactly.

> But why assume that the program takes 8 minutes to run? Perhaps it takes 
> 8 seconds to run, and 6 seconds of that is the decoding. Then halving 
> that reduces the total runtime from 8 seconds to 5, which is a noticeable 
> speed increase to the user, and significant if you then run that program 
> tens of thousands of times.

Exactly. That's why it doesn't make sense to benchmark decode()/unicode
() isolated - meaning out of the context of your actual program.

> By all means, reminding people that pre-mature optimization is a 
> waste of time, but it's possible to take that attitude too far to Planet 
> Bizarro. At the point that you start insisting, and emphasising, that a 
> three second time difference is "*exactly*" zero,

Exactly. Because it was not generated in a real world use case but by 
running a simple loop one millions times. Why one million times? Because 
by running it "only" one hundred thousand times the difference would 
have seen even less relevant.

> it seems to me that this is about you winning rather than you giving
> good advice.

I already gave good advice:
1. don't benchmark
2. don't benchmark until you have an actual performance issue
3. if you benchmark then the whole application and not single commands

It's really easy: Michael has working code. With that he can easily 
write two versions - one that uses decode() and one that uses unicode(). 
He can benchmark these with some real world input he often uses by 
running it a hundred or a thousand times (even a million if he likes). 
Then he can compare the results. I doubt that there will be any 
noticeable difference.

Thorsten
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-07 Thread Steven D'Aprano
On Fri, 07 Aug 2009 17:13:07 +0200, Thorsten Kampe wrote:

> One guy claims he has times between 2.7 and 5.7 seconds when
> benchmarking more or less randomly generated "one million different
> lines". That *is* *exactly* nothing.


We agree that in the grand scheme of things, a difference of 2.7 seconds 
versus 5.7 seconds is a trivial difference if your entire program takes 
(say) 8 minutes to run. You won't even notice it.

But why assume that the program takes 8 minutes to run? Perhaps it takes 
8 seconds to run, and 6 seconds of that is the decoding. Then halving 
that reduces the total runtime from 8 seconds to 5, which is a noticeable 
speed increase to the user, and significant if you then run that program 
tens of thousands of times.

The Python dev team spend significant time and effort to get improvements 
of the order of 10%, and you're pooh-poohing an improvement of the order 
of 100%. By all means, reminding people that pre-mature optimization is a 
waste of time, but it's possible to take that attitude too far to Planet 
Bizarro. At the point that you start insisting, and emphasising, that a 
three second time difference is "*exactly*" zero, it seems to me that 
this is about you winning rather than you giving good advice.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-07 Thread Steven D'Aprano
On Fri, 07 Aug 2009 12:00:42 +0200, Thorsten Kampe wrote:

> Bollocks. No one will even notice whether a code sequence runs 2.7 or
> 5.7 seconds. That's completely artificial benchmarking.

You think users won't notice a doubling of execution time? Well, that 
explains some of the apps I'm forced to use...

A two-second running time for (say) a command-line tool is already 
noticeable. A five-second one is *very* noticeable -- long enough to be a 
drag, short enough that you aren't tempted to go off and do something 
else while you're waiting for it to finish.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-07 Thread alex23
garabik-news-2005...@kassiopeia.juls.savba.sk wrote:
> I am not sure I understood that. Must be my English :-)

I just parsed it as "blah blah blah I won't admit I'm wrong" and
didn't miss anything substantive.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-07 Thread alex23
Thorsten Kampe  wrote:
> Bollocks. No one will even notice whether a code sequence runs 2.7 or
> 5.7 seconds. That's completely artificial benchmarking.

But that's not what you first claimed:

> I don't think any measurable speed increase will be
> noticeable between those two.

But please, keep changing your argument so you don't have to admit you
were wrong.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-07 Thread garabik-news-2005-05
Thorsten Kampe  wrote:
> * Steven D'Aprano (06 Aug 2009 19:17:30 GMT)
>> What if you're writing a loop which takes one million different lines of 
>> text and decodes them once each?
>> 
>> >>> setup = 'L = ["abc"*(n%100) for n in xrange(100)]'
>> >>> t1 = timeit.Timer('for line in L: line.decode("utf-8")', setup)
>> >>> t2 = timeit.Timer('for line in L: unicode(line, "utf-8")', setup)
>> >>> t1.timeit(number=1)
>> 5.6751680374145508
>> >>> t2.timeit(number=1)
>> 2.682251165771
>> 
>> Seems like a pretty meaningful difference to me.
> 
> Bollocks. No one will even notice whether a code sequence runs 2.7 or 
> 5.7 seconds. That's completely artificial benchmarking.
>

For a real-life example, I have often a file with one word per line, and
I run python scripts to apply some (sometimes fairy trivial)
transformation over it. REAL example, reading lines with word, lemma,
tag separated by tabs from stdin and writing word into stdout, unless it
starts with '<' (~6e5 lines, python2.5, user times, warm cache, I hope
the comments are self-explanatory)

no unicode
user0m2.380s

decode('utf-8'), encode('utf-8')
user0m3.560s

sys.stdout = codecs.getwriter('utf-8')(sys.stdout);sys.stdin = 
codecs.getreader('utf-8')(sys.stdin)
user0m6.180s

unicode(line, 'utf8'), encode('utf-8')
user0m3.820s

unicode(line, 'utf-8'), encode('utf-8')
user0m2.880sa

python3.1
user0m1.560s

Since I have something like 18 million words in my currenct project (and
 > 600 million overall) and I often tweak some parameters and re-run the
 > transformations, the differences are pretty significant.

Personally, I have been surprised by:
1) bad performance of the codecs wrapper (I expected it to be on par with 
   unicode(x,'utf-8'), mayble slightly better due to less function calls
2) good performance of python3.1 (utf-8 locale)


-- 
 ---
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__garabik @ kassiopeia.juls.savba.sk |
 ---
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-07 Thread Thorsten Kampe
* Steven D'Aprano (06 Aug 2009 19:17:30 GMT)
> On Thu, 06 Aug 2009 20:05:52 +0200, Thorsten Kampe wrote:
> > > That is significant! So the winner is:
> > > 
> > > unicode('äöüÄÖÜß','utf-8')
> > 
> > Unless you are planning to write a loop that decodes "äöüÄÖÜß" one
> > million times, these benchmarks are meaningless.
> 
> What if you're writing a loop which takes one million different lines of 
> text and decodes them once each?
> 
> >>> setup = 'L = ["abc"*(n%100) for n in xrange(100)]'
> >>> t1 = timeit.Timer('for line in L: line.decode("utf-8")', setup)
> >>> t2 = timeit.Timer('for line in L: unicode(line, "utf-8")', setup)
> >>> t1.timeit(number=1)
> 5.6751680374145508
> >>> t2.timeit(number=1)
> 2.682251165771
> 
> Seems like a pretty meaningful difference to me.

Bollocks. No one will even notice whether a code sequence runs 2.7 or 
5.7 seconds. That's completely artificial benchmarking.

Thorsten
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-07 Thread Steven D'Aprano
On Fri, 07 Aug 2009 08:04:51 +0100, Mark Lawrence wrote:

> I believe that the comment "these benchmarks are meaningless" refers to
> the length of the strings being used in the tests.  Surely something
> involving thousands or millions of characters is more meaningful? Or to
> go the other way, you are unlikely to write for c in 'äöüÄÖÜß':
>  u = unicode(c, 'utf-8')
>  ...
> Yes?

There are all sorts of potential use-cases. A day or two ago, somebody 
posted a question involving tens of thousands of lines of tens of 
thousands of characters each (don't quote me, I'm going by memory). On 
the other hand, it doesn't require much imagination to think of a use-
case where there are millions of lines each of a dozen or so characters, 
and you want to process it line by line:


noun: cat
noun: dog
verb: café
...


As always, before optimizing, you should profile to be sure you are 
actually optimizing and not wasting your time.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-07 Thread Mark Lawrence

Michael Ströder wrote:

Thorsten Kampe wrote:

* Michael Ströder (Thu, 06 Aug 2009 18:26:09 +0200)

timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(1000)

17.23644495010376

timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(1000)

72.087096929550171

That is significant! So the winner is:

unicode('äöüÄÖÜß','utf-8')
Unless you are planning to write a loop that decodes "äöüÄÖÜß" one 
million times, these benchmarks are meaningless.


Well, I can tell you I would not have posted this here and checked it if it
would be meaningless for me. You don't have to read and answer this thread if
it's meaningless to you.

Ciao, Michael.
I believe that the comment "these benchmarks are meaningless" refers to 
the length of the strings being used in the tests.  Surely something 
involving thousands or millions of characters is more meaningful? Or to 
go the other way, you are unlikely to write

for c in 'äöüÄÖÜß':
u = unicode(c, 'utf-8')
...
Yes?

--
Kindest regards.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-06 Thread John Machin
Jason Tackaberry  urandom.ca> writes:

> On Thu, 2009-08-06 at 01:31 +, John Machin wrote:

> > Suggested further avenues of investigation:
> > 
> > (1) Try the timing again with "cp1252" and "utf8" and "utf_8"
> > 
> > (2) grep "utf-8" /Objects/unicodeobject.c
> 
> Very pedagogical of you. :)  Indeed, it looks like bigger player in the
> performance difference is the fact that the code path for unicode(s,
> enc) short-circuits the codec registry for common encodings (which
> includes 'utf-8' specifically), whereas s.decode('utf-8') necessarily
> consults the codec registry.

So the next question (the answer to which may benefit all users
of .encode() and .decode()) is:

Why does consulting the codec registry take so long,
and can this be improved?



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-06 Thread Michael Ströder
Thorsten Kampe wrote:
> * Michael Ströder (Thu, 06 Aug 2009 18:26:09 +0200)
> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(1000)
>> 17.23644495010376
> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(1000)
>> 72.087096929550171
>>
>> That is significant! So the winner is:
>>
>> unicode('äöüÄÖÜß','utf-8')
> 
> Unless you are planning to write a loop that decodes "äöüÄÖÜß" one 
> million times, these benchmarks are meaningless.

Well, I can tell you I would not have posted this here and checked it if it
would be meaningless for me. You don't have to read and answer this thread if
it's meaningless to you.

Ciao, Michael.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-06 Thread Steven D'Aprano
On Thu, 06 Aug 2009 20:05:52 +0200, Thorsten Kampe wrote:

> > That is significant! So the winner is:
> > 
> > unicode('äöüÄÖÜß','utf-8')
> 
> Unless you are planning to write a loop that decodes "äöüÄÖÜß" one
> million times, these benchmarks are meaningless.

What if you're writing a loop which takes one million different lines of 
text and decodes them once each?


>>> setup = 'L = ["abc"*(n%100) for n in xrange(100)]'
>>> t1 = timeit.Timer('for line in L: line.decode("utf-8")', setup)
>>> t2 = timeit.Timer('for line in L: unicode(line, "utf-8")', setup)
>>> t1.timeit(number=1)
5.6751680374145508
>>> t2.timeit(number=1)
2.682251165771


Seems like a pretty meaningful difference to me.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-06 Thread Thorsten Kampe
* Michael Ströder (Thu, 06 Aug 2009 18:26:09 +0200)
> Thorsten Kampe wrote:
> > * Michael Ströder (Wed, 05 Aug 2009 16:43:09 +0200)
> > I don't think any measurable speed increase will be noticeable
> > between those two.
> 
> Well, seems not to be true. Try yourself. I did (my console has UTF-8 as 
> charset):
> 
> Python 2.6 (r26:66714, Feb  3 2009, 20:52:03)
> [GCC 4.3.2 [gcc-4_3-branch revision 141291]] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import timeit
> >>> timeit.Timer("'äöüÄÖÜß'.decode('utf-8')").timeit(100)
> 7.2721178531646729
> >>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(100)
> 7.1302499771118164
> >>> timeit.Timer("unicode('äöüÄÖÜß','utf8')").timeit(100)
> 8.3726329803466797
> >>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(100)
> 1.8622009754180908
> >>> timeit.Timer("unicode('äöüÄÖÜß','utf8')").timeit(100)
> 8.651669979095459
> >>>
> 
> Comparing again the two best combinations:
> 
> >>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(1000)
> 17.23644495010376
> >>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(1000)
> 72.087096929550171
> 
> That is significant! So the winner is:
> 
> unicode('äöüÄÖÜß','utf-8')

Unless you are planning to write a loop that decodes "äöüÄÖÜß" one 
million times, these benchmarks are meaningless.

Thorsten
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-06 Thread Michael Ströder
Thorsten Kampe wrote:
> * Michael Ströder (Wed, 05 Aug 2009 16:43:09 +0200)
>> These both expressions are equivalent but which is faster or should be
>> used for any reason?
>>
>> u = unicode(s,'utf-8')
>>
>> u = s.decode('utf-8') # looks nicer
> 
> "decode" was added in Python 2.2 for the sake of symmetry to encode(). 

Yes, and I like the style. But...

> It's essentially the same as unicode() and I wouldn't be surprised if it 
> is exactly the same.

Did you try?

> I don't think any measurable speed increase will be noticeable between
> those two.

Well, seems not to be true. Try yourself. I did (my console has UTF-8 as 
charset):

Python 2.6 (r26:66714, Feb  3 2009, 20:52:03)
[GCC 4.3.2 [gcc-4_3-branch revision 141291]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> timeit.Timer("'äöüÄÖÜß'.decode('utf-8')").timeit(100)
7.2721178531646729
>>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(100)
7.1302499771118164
>>> timeit.Timer("unicode('äöüÄÖÜß','utf8')").timeit(100)
8.3726329803466797
>>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(100)
1.8622009754180908
>>> timeit.Timer("unicode('äöüÄÖÜß','utf8')").timeit(100)
8.651669979095459
>>>

Comparing again the two best combinations:

>>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(1000)
17.23644495010376
>>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(1000)
72.087096929550171

That is significant! So the winner is:

unicode('äöüÄÖÜß','utf-8')

Ciao, Michael.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-06 Thread Thorsten Kampe
* Michael Ströder (Wed, 05 Aug 2009 16:43:09 +0200)
> These both expressions are equivalent but which is faster or should be
> used for any reason?
> 
> u = unicode(s,'utf-8')
> 
> u = s.decode('utf-8') # looks nicer

"decode" was added in Python 2.2 for the sake of symmetry to encode(). 
It's essentially the same as unicode() and I wouldn't be surprised if it 
is exactly the same. I don't think any measurable speed increase will be 
noticeable between those two.

Thorsten
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-06 Thread Jason Tackaberry
On Thu, 2009-08-06 at 01:31 +, John Machin wrote:
> Faster by an enormous margin; attributing this to the cost of attribute lookup
> seems implausible.

Ok, fair point.  I don't think the time difference fully registered when
I composed that message.

Testing a global access (LOAD_GLOBAL) versus an attribute access on a
global object (LOAD_GLOBAL + LOAD_ATTR) shows that the latter is about
40% slower than the former.  So that certainly doesn't account for the
difference.


> Suggested further avenues of investigation:
> 
> (1) Try the timing again with "cp1252" and "utf8" and "utf_8"
> 
> (2) grep "utf-8" /Objects/unicodeobject.c

Very pedagogical of you. :)  Indeed, it looks like bigger player in the
performance difference is the fact that the code path for unicode(s,
enc) short-circuits the codec registry for common encodings (which
includes 'utf-8' specifically), whereas s.decode('utf-8') necessarily
consults the codec registry.

Cheers,
Jason.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-05 Thread John Machin
Jason Tackaberry  urandom.ca> writes:
> On Wed, 2009-08-05 at 16:43 +0200, Michael Ströder wrote:
> > These both expressions are equivalent but which is faster or should be used
> > for any reason?
> > u = unicode(s,'utf-8')
> > u = s.decode('utf-8') # looks nicer
> 
> It is sometimes non-obvious which constructs are faster than others in
> Python.  I also regularly have these questions, but it's pretty easy to
> run quick (albeit naive) benchmarks to see.
> 
> The first thing to try is to have a look at the bytecode for each:
[snip] 
> The presence of LOAD_ATTR in the first form hints that this is probably
> going to be slower.   Next, actually try it:
> 
> >>> import timeit
> >>> timeit.timeit('"foobarbaz".decode("utf-8")')
> 1.698289155960083
> >>> timeit.timeit('unicode("foobarbaz", "utf-8")')
> 0.53305888175964355
> 
> So indeed, uncode(s, 'utf-8') is faster by a fair margin.

Faster by an enormous margin; attributing this to the cost of attribute lookup
seems implausible.

Suggested further avenues of investigation:

(1) Try the timing again with "cp1252" and "utf8" and "utf_8"

(2) grep "utf-8" /Objects/unicodeobject.c

HTH,
John

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-05 Thread 1x7y2z9
unicode() has LOAD_GLOBAL which s.decode() does not.  Is it generally
the case that LOAD_ATTR is slower than LOAD_GLOBAL that lead to your
intuition that the former would probably be slower?  Or some other
intuition?
Of course, the results from timeit are a different thing - I ask about
the intuition in the disassembler output.
Thanks.

>
> The presence of LOAD_ATTR in the first form hints that this is probably
> going to be slower.   Next, actually try it:
>
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode() vs. s.decode()

2009-08-05 Thread Jason Tackaberry
On Wed, 2009-08-05 at 16:43 +0200, Michael Ströder wrote:
> These both expressions are equivalent but which is faster or should be used
> for any reason?
>
> u = unicode(s,'utf-8')
> 
> u = s.decode('utf-8') # looks nicer

It is sometimes non-obvious which constructs are faster than others in
Python.  I also regularly have these questions, but it's pretty easy to
run quick (albeit naive) benchmarks to see.

The first thing to try is to have a look at the bytecode for each:

>>> import dis
>>> dis.dis(lambda s: s.decode('utf-8'))
  1   0 LOAD_FAST0 (s)
  3 LOAD_ATTR0 (decode)
  6 LOAD_CONST   0 ('utf-8')
  9 CALL_FUNCTION1
 12 RETURN_VALUE
>>> dis.dis(lambda s: unicode(s, 'utf-8'))
  1   0 LOAD_GLOBAL  0 (unicode)
  3 LOAD_FAST0 (s)
  6 LOAD_CONST   0 ('utf-8')
  9 CALL_FUNCTION2
 12 RETURN_VALUE  

The presence of LOAD_ATTR in the first form hints that this is probably
going to be slower.   Next, actually try it:

>>> import timeit
>>> timeit.timeit('"foobarbaz".decode("utf-8")')
1.698289155960083
>>> timeit.timeit('unicode("foobarbaz", "utf-8")')
0.53305888175964355

So indeed, uncode(s, 'utf-8') is faster by a fair margin.

On the other hand, unless you need to do this in a tight loop several
tens of thousands of times, I'd prefer the slower form s.decode('utf-8')
because it's, as you pointed out, cleaner and more readable code.

Cheers,
Jason.

-- 
http://mail.python.org/mailman/listinfo/python-list