Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Victor Stinner
I added a _PyUnicodeWriter internal API to optimize str%args and
str.format(args). It uses a buffer which is overallocated, so it's
basically like CPython str += str optimization. I still don't know how
efficient it is on Windows, since realloc() is slow on Windows (at
least on old Windows versions).

We should add an official and public API to concatenate strings. I
know that PyPy has already its own API. Example:

writer = UnicodeWriter()
for item in data:
writer += item   # i guess that it's faster than writer.append(item)
return str(writer) # or writer.getvalue() ?

I don't care of the exact implementation of UnicodeWriter, it just
have to be as fast or faster than ''.join(data).

I don't remember if _PyUnicodeWriter is faster than StringIO or
slower. I created an issue for that:
http://bugs.python.org/issue15612

Victor

2013/2/12 Maciej Fijalkowski :
> Hi
>
> We recently encountered a performance issue in stdlib for pypy. It
> turned out that someone commited a performance "fix" that uses += for
> strings instead of "".join() that was there before.
>
> Now this hurts pypy (we can mitigate it to some degree though) and
> possible Jython and IronPython too.
>
> How people feel about generally not having += on long strings in
> stdlib (since the refcount = 1 thing is a hack)?
>
> What about other performance improvements in stdlib that are
> problematic for pypy or others?
>
> Personally I would like cleaner code in stdlib vs speeding up CPython.
> Typically that also helps pypy so I'm not unbiased.
>
> Cheers,
> fijal
> ___
> Python-Dev mailing list
> [email protected]
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> http://mail.python.org/mailman/options/python-dev/victor.stinner%40gmail.com
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Maciej Fijalkowski
On Wed, Feb 13, 2013 at 10:02 AM, Victor Stinner
 wrote:
> I added a _PyUnicodeWriter internal API to optimize str%args and
> str.format(args). It uses a buffer which is overallocated, so it's
> basically like CPython str += str optimization. I still don't know how
> efficient it is on Windows, since realloc() is slow on Windows (at
> least on old Windows versions).
>
> We should add an official and public API to concatenate strings. I
> know that PyPy has already its own API. Example:
>
> writer = UnicodeWriter()
> for item in data:
> writer += item   # i guess that it's faster than writer.append(item)
> return str(writer) # or writer.getvalue() ?
>
> I don't care of the exact implementation of UnicodeWriter, it just
> have to be as fast or faster than ''.join(data).
>
> I don't remember if _PyUnicodeWriter is faster than StringIO or
> slower. I created an issue for that:
> http://bugs.python.org/issue15612
>
> Victor

it's in __pypy__.builders (StringBuilder and UnicodeBuilder). The API
does not really matter, as long as there is a way to preallocate
certain size (which I don't think there is in StringIO for example).
bytearray comes close but has a relatively inconvinient API and any
pure-python bytearray wrapper will not be fast on CPython.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Lennart Regebro
On Tue, Feb 12, 2013 at 10:03 PM, Maciej Fijalkowski  wrote:
> Hi
>
> We recently encountered a performance issue in stdlib for pypy. It
> turned out that someone commited a performance "fix" that uses += for
> strings instead of "".join() that was there before.

Can someone show the actual diff? Of this?

I'm making a talk about outdated patterns in Python at DjangoCon EU,
prompted by this question, and obsessive avoidance of string
concatenation. But all the tests I've done show that ''.join() still
is faster or as fast, except when you are joining very few strings,
like for example two strings, in which case concatenation is faster or
as fast. Both under PyPy and CPython. So I'd like to know in which
case ''.hoin() is faster on PyPy and += faster on CPython.

Code with times

x = 10
s1 = 'X'* x
s2 = 'X'* x

for i in xrange(500):
 s1 += s2

Python 3.3: 0.049 seconds
PyPy 1.9: 24.217 seconds

PyPy indeed is much much slower than CPython here.
But let's look at the join case:

x = 10
s1 = 'X'* x
s2 = 'X'* x

for i in xrange(500):
 s1 = ''.join((s1, s2))

Python 3.3: 18.969 seconds
PyPy 1.9: 62.539 seconds

Here PyPy needs twice the time, and CPython needs 387 times as long
time. Both are slower.

The best case is of course to make a long list of strings and join them:

x = 10
s1 = 'X'* x
s2 = 'X'* x

l = [s1]
for i in xrange(500):
 l.append(s2)

s1 = ''.join(l)

Python 3.3: 0.052 seconds
PyPy 1.9: 0.117 seconds

That's not always feasible though.


//Lennart
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Larry Hastings

On 02/12/2013 05:25 PM, Christian Tismer wrote:

Ropes have been implemented by Carl-Friedrich Bolz in 2007 as I remember.
No idea what the impact was, if any at all.
Would ropes be an answer (and a simple way to cope with string mutation
patterns) as an alternative implementation, and therefore still justify
the usage of that pattern?


I've always hated the "".join(array) idiom for "fast" string 
concatenation--it's ugly and it flies in the face of TOOWTDI.  I think 
everyone should use "x = a + b + c + d" for string concatenation, and we 
should just make that fast.


In 2006 I proposed "lazy string concatenation", a sort of rope that hid 
the details inside the string object.  If a and b are strings, a+b 
returned a string object that internally lazily contained references to 
a and b, and only computed its value if you asked for it.  Here's the 
Unicode version:


   http://bugs.python.org/issue1629305

Why didn't it get accepted?  I lumped in lazy slicing, a bad move as it 
was more controversial.  That and the possibility that macros like 
PyUnicode_AS_UNICODE could now possibly fail, which would have meant 
checking 400+ call sites to ensure they handle the possibility of 
failure.  This latter work has already happened with the new efficient 
Unicode representation patch.


I keep thinking it's time to revive the lazy string concatenation patch.


//arry/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Chris Withers

On 12/02/2013 21:03, Maciej Fijalkowski wrote:

We recently encountered a performance issue in stdlib for pypy. It
turned out that someone commited a performance "fix" that uses += for
strings instead of "".join() that was there before.


That's... interesting.

I fixed a performance bug in httplib some years ago by doing the exact 
opposite; += -> ''.join(). In that case, it changed downloading a file 
from 20 minutes to 3 seconds. That was likely on Python 2.5.



How people feel about generally not having += on long strings in
stdlib (since the refcount = 1 thing is a hack)?


+1 from me.

Chris

--
Simplistix - Content Management, Batch Processing & Python Consulting
- http://www.simplistix.co.uk
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Antoine Pitrou
Le Wed, 13 Feb 2013 09:02:07 +0100,
Victor Stinner  a écrit :
> I added a _PyUnicodeWriter internal API to optimize str%args and
> str.format(args). It uses a buffer which is overallocated, so it's
> basically like CPython str += str optimization. I still don't know how
> efficient it is on Windows, since realloc() is slow on Windows (at
> least on old Windows versions).
> 
> We should add an official and public API to concatenate strings.

There's io.StringIO already.

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Serhiy Storchaka

On 12.02.13 23:03, Maciej Fijalkowski wrote:

How people feel about generally not having += on long strings in
stdlib (since the refcount = 1 thing is a hack)?


Sometimes the use of += for strings or bytes is appropriate. For 
example, I deliberately used += for bytes instead b''.join() (note that 
there is even no such hack for bytes) in zipfile module where in most 
cases one of component is empty, and the concatenation of nonempty 
components only happens once. b''.join() was noticeably slower here.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Steven D'Aprano

On 13/02/13 19:52, Larry Hastings wrote:


I've always hated the "".join(array) idiom for "fast" string concatenation
--it's ugly and it flies in the face of TOOWTDI. I think everyone should
use "x = a + b + c + d" for string concatenation, and we should just make
 that fast.



"".join(array) is much nicer looking than:

# ridiculous and impractical for more than a few items
array[0] + array[1] + array[2] + ... + array[N]

or:

# not an expression
result = ""
for s in array:
result += s

or even:

# currently prohibited, and not obvious
sum(array, "")

although I will admit to a certain fondness towards

# even less obvious than sum
map(operator.add, array)


and join has been the obvious way to do repeated concatenation of many substrings since at 
least Python 1.5 when it was spelled "string.join(array [, sep=" "]).




--
Steven
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Serhiy Storchaka

On 13.02.13 09:52, Nick Coghlan wrote:

On Wed, Feb 13, 2013 at 5:42 PM, Alexandre Vassalotti
 wrote:

I don't think so. Ropes are really useful when you work with gigabytes of
data, but unfortunately they don't make good general-purpose strings.
Monolithic arrays are much more efficient and simple for the typical
use-cases we have in Python.


If I recall correctly, io.StringIO and io.BytesIO have been updated to
use ropes internally in 3.3.


io.BytesIO has not yet. But it will be in 3.4 (issue #15381).

On the other hand, there is a plan for rewriting StringIO to more 
effective continuous buffer implementation (issue #15612).


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Xavier Morel
On 2013-02-13, at 12:37 , Steven D'Aprano wrote:
> 
># even less obvious than sum
>map(operator.add, array)

That one does not work, it'll try to call the binary `add` with each
item of the array when the map iterator is reified, erroring out.

functools.reduce(operator.add, array, '')

would work though, it's an other way to spell `sum` without the
string prohibition.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Steven D'Aprano

On 13/02/13 20:09, Chris Withers wrote:

On 12/02/2013 21:03, Maciej Fijalkowski wrote:

We recently encountered a performance issue in stdlib for pypy. It
turned out that someone commited a performance "fix" that uses += for
strings instead of "".join() that was there before.


That's... interesting.

I fixed a performance bug in httplib some years ago by doing the exact opposite; 
+= -> ''.join(). In that case, it changed downloading a file from 20 minutes to 
3 seconds. That was likely on Python 2.5.



I remember it well.

http://mail.python.org/pipermail/python-dev/2009-August/091125.html

I frequently link to this thread as an example of just how bad repeated string 
concatenation can be, how painful it can be to debug, and how even when the 
optimization is fast on one system, it may fail and be slow on another system.



--
Steven
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Steven D'Aprano

On 13/02/13 22:46, Xavier Morel wrote:

On 2013-02-13, at 12:37 , Steven D'Aprano wrote:


# even less obvious than sum
map(operator.add, array)


That one does not work, it'll try to call the binary `add` with each
item of the array when the map iterator is reified, erroring out.

 functools.reduce(operator.add, array, '')

would work though, it's an other way to spell `sum` without the
string prohibition.


Oops, you are right of course, I was thinking reduce but it came out map.
Thanks for the correction.


--
Steven
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] efficient string concatenation (yep, from 2004)

2013-02-13 Thread Christian Tismer

On 13.02.13 08:42, Lennart Regebro wrote:

Something is needed - a patch for PyPy or for the documentation I guess.

Not arguing that it wouldn't be good, but I disagree that it is needed.

This is only an issue when you, as in your proof, have a loop that
does concatenation. This is usually when looping over a list of
strings that should be concatenated together. Doing so in a loop with
concatenation may be the natural way for people new to Python, but the
"natural" way to do it in Python is with a ''.join() call.

This:

 s = ''.join(('X' for x in xrange(x)))

Is more than twice as fast in Python 2.7 than your example. It is in
fact also slower in PyPy 1.9 than Python 2.7, but only with a factor
of two:

Python 2.7:
time for 1000 concats = 0.887
Pypy 1.9:
time for 1000 concats = 1.600

(And of course s = 'X'* x takes only a bout a hundredth of the time,
but that's cheating. ;-)



This is not about how to write efficient concatenation and not
for me. It is also not about a constant factor, which I don't really
care about but in situations where speed matters.

This is about a possible algorithmic trap, where code written for
CPython may behave well with some roughly O(n) behavior,
and by switching to PyPy you get a surprise when the same
code now has O(n**2) behavior. Such runtime explosions can damage
the trust in PyPy, with code sitting in some module which you even
did not write but "pip install"-ed it.

So this is important to know, especially for newcomers, and for people
who are giving advice to them.
For algorithmic compatibility, there should no longer
be a feature with this drastic side effect, if that cannot be supported by
all other dialects.

To avoid such hidden traps in larger code bases, documentation is
needed that clearly gives a warning saying "don't do that", like CS
students learn for most other languages.

cheers - chris

--
Christian Tismer :^)   
Software Consulting  : Have a break! Take a ride on Python's
Karl-Liebknecht-Str. 121 :*Starship* http://starship.python.net/
14482 Potsdam: PGP key -> http://pgp.uni-mainz.de
phone +49 173 24 18 776  fax +49 (30) 700143-0023
PGP 0x57F3BF04   9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
  whom do you want to sponsor today?   http://www.stackless.com/

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] efficient string concatenation (yep, from 2004)

2013-02-13 Thread Steven D'Aprano

On 13/02/13 10:53, Christian Tismer wrote:

Hi friends,

_efficient string concatenation_ has been a topic in 2004.
Armin Rigo proposed a patch with the name of the subject,
more precisely:

/[Patches] [ python-Patches-980695 ] efficient string concatenation//
//on sourceforge.net, on 2004-06-28.//
/
This patch was finally added to Python 2.4 on 2004-11-30.

Some people might remember the larger discussion if such a patch should be
accepted at all, because it changes the programming style for many of us
from "don't do that, stupid" to "well, you may do it in CPython", which has 
quite
some impact on other implementations (is it fast on Jython, now?).


I disagree. If you look at the archives on the python-list@ and [email protected]
mailing lists, you will see that whenever string concatenation comes up, the 
common
advice given is to use join.

The documentation for strings is also clear that you should not rely on this
optimization:

http://docs.python.org/2/library/stdtypes.html#typesseq

And quadratic performance for repeated concatenation is not unique to Python:
it applies to pretty much any language with immutable strings, including Java,
C++, Lua and Javascript.



It changed for instance my programming and teaching style a lot, of course!


Why do you say, "Of course"? It should not have changed anything.

Best practice remains the same:

- we should still use join for repeated concatenations;

- we should still avoid + except for small cases which are not performance 
critical;

- we should still teach beginners to use join;

- while this optimization is nice to have, we cannot rely on it being there
  when it matters.

It's not just Jython and IronPython that can't make use of this optimization. It
can, and does, fail on CPython as well, as it is sensitive to memory
allocation details. See for example:

http://utcc.utoronto.ca/~cks/space/blog/python/ExaminingStringConcatOpt

and here for a cautionary tale about what can happen when the optimization fails
under CPython:

http://mail.python.org/pipermail/python-dev/2009-August/091125.html



--
Steven
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Serhiy Storchaka

On 13.02.13 10:52, Larry Hastings wrote:

I've always hated the "".join(array) idiom for "fast" string
concatenation--it's ugly and it flies in the face of TOOWTDI.  I think
everyone should use "x = a + b + c + d" for string concatenation, and we
should just make that fast.


I prefer "x = '%s%s%s%s' % (a, b, c, d)" when string's number is more 
than 3 and some of them are literal strings.



___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] efficient string concatenation (yep, from 2004)

2013-02-13 Thread Christian Tismer

On 13.02.13 13:10, Steven D'Aprano wrote:

On 13/02/13 10:53, Christian Tismer wrote:

Hi friends,

_efficient string concatenation_ has been a topic in 2004.
Armin Rigo proposed a patch with the name of the subject,
more precisely:

/[Patches] [ python-Patches-980695 ] efficient string concatenation//
//on sourceforge.net, on 2004-06-28.//
/
This patch was finally added to Python 2.4 on 2004-11-30.

Some people might remember the larger discussion if such a patch 
should be

accepted at all, because it changes the programming style for many of us
from "don't do that, stupid" to "well, you may do it in CPython", 
which has quite

some impact on other implementations (is it fast on Jython, now?).


I disagree. If you look at the archives on the python-list@ and 
[email protected]
mailing lists, you will see that whenever string concatenation comes 
up, the common

advice given is to use join.

The documentation for strings is also clear that you should not rely 
on this

optimization:

http://docs.python.org/2/library/stdtypes.html#typesseq

And quadratic performance for repeated concatenation is not unique to 
Python:
it applies to pretty much any language with immutable strings, 
including Java,

C++, Lua and Javascript.


It changed for instance my programming and teaching style a lot, of 
course!


Why do you say, "Of course"? It should not have changed anything.


You are right, I was actually over the top with my rant and never recommend
string concatenation when working with real amounts of data.
The surprise was just so big.

I tend to use whatever fits best for small initialization of some modules,
where the fact that concat is cheap lets me stop thinking of big Oh.
Although it probably does not matter much, it makes me feel incomfortable
to do something with potentially bad asymptotics.



Best practice remains the same:

- we should still use join for repeated concatenations;

- we should still avoid + except for small cases which are not 
performance critical;


- we should still teach beginners to use join;

- while this optimization is nice to have, we cannot rely on it being 
there

  when it matters.


I agree that CPython does say this clearly.
Actually I was complaining about the PyPy documentation which does not
mention this, and because PyPy is so very compatible already.

2004 when this stuff came up was the time where PyPy already was
quite active, but the Psyco mindset was still around, too.
Maybe my slightly shocked reaction originates from there, and my
implicit assumption was never corrected ;-)

cheers - chris

--
Christian Tismer :^)   
Software Consulting  : Have a break! Take a ride on Python's
Karl-Liebknecht-Str. 121 :*Starship* http://starship.python.net/
14482 Potsdam: PGP key -> http://pgp.uni-mainz.de
phone +49 173 24 18 776  fax +49 (30) 700143-0023
PGP 0x57F3BF04   9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
  whom do you want to sponsor today?   http://www.stackless.com/

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Daniel Holth
On Wed, Feb 13, 2013 at 7:10 AM, Serhiy Storchaka wrote:

> On 13.02.13 10:52, Larry Hastings wrote:
>
>> I've always hated the "".join(array) idiom for "fast" string
>> concatenation--it's ugly and it flies in the face of TOOWTDI.  I think
>> everyone should use "x = a + b + c + d" for string concatenation, and we
>> should just make that fast.
>>
>
> I prefer "x = '%s%s%s%s' % (a, b, c, d)" when string's number is more than
> 3 and some of them are literal strings.
>

Fixed: x = ('%s' *  len(abcd)) % abcd
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Lennart Regebro
On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka  wrote:
> I prefer "x = '%s%s%s%s' % (a, b, c, d)" when string's number is more than 3
> and some of them are literal strings.

This has the benefit of being slow both on CPython and PyPy. Although
using .format() is even slower. :-)

//Lennart
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Christian Tismer

On 13.02.13 14:17, Daniel Holth wrote:
On Wed, Feb 13, 2013 at 7:10 AM, Serhiy Storchaka > wrote:


On 13.02.13 10:52, Larry Hastings wrote:

I've always hated the "".join(array) idiom for "fast" string
concatenation--it's ugly and it flies in the face of TOOWTDI.
 I think
everyone should use "x = a + b + c + d" for string
concatenation, and we
should just make that fast.


I prefer "x = '%s%s%s%s' % (a, b, c, d)" when string's number is
more than 3 and some of them are literal strings.


Fixed: x = ('%s' *  len(abcd)) % abcd



Which becomes in the new formatting style

x = ('{}' *  len(abcd)).format(*abcd)

hmm, hmm, not soo nice

--
Christian Tismer :^)   
Software Consulting  : Have a break! Take a ride on Python's
Karl-Liebknecht-Str. 121 :*Starship* http://starship.python.net/
14482 Potsdam: PGP key -> http://pgp.uni-mainz.de
phone +49 173 24 18 776  fax +49 (30) 700143-0023
PGP 0x57F3BF04   9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
  whom do you want to sponsor today?   http://www.stackless.com/

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Chris Withers

On 13/02/2013 11:53, Steven D'Aprano wrote:

I fixed a performance bug in httplib some years ago by doing the exact
opposite; += -> ''.join(). In that case, it changed downloading a file
from 20 minutes to 3 seconds. That was likely on Python 2.5.



I remember it well.

http://mail.python.org/pipermail/python-dev/2009-August/091125.html

I frequently link to this thread as an example of just how bad repeated
string concatenation can be, how painful it can be to debug, and how
even when the optimization is fast on one system, it may fail and be
slow on another system.


Amusing is that 
http://mail.python.org/pipermail/python-dev/2009-August/thread.html#91125 doesn't 
even list the email where I found the problem...


Chris

--
Simplistix - Content Management, Batch Processing & Python Consulting
- http://www.simplistix.co.uk
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Amaury Forgeot d'Arc
2013/2/13 Lennart Regebro 

> On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka 
> wrote:
> > I prefer "x = '%s%s%s%s' % (a, b, c, d)" when string's number is more
> than 3
> > and some of them are literal strings.
>
> This has the benefit of being slow both on CPython and PyPy. Although
> using .format() is even slower. :-)


Did you really try it?
PyPy is really fast with str.__mod__, when the format string is a constant.
Yes, it's jitted.

-- 
Amaury Forgeot d'Arc
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Christian Tismer

On 13.02.13 15:27, Amaury Forgeot d'Arc wrote:


2013/2/13 Lennart Regebro mailto:[email protected]>>

On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka
mailto:[email protected]>> wrote:
> I prefer "x = '%s%s%s%s' % (a, b, c, d)" when string's number is
more than 3
> and some of them are literal strings.

This has the benefit of being slow both on CPython and PyPy. Although
using .format() is even slower. :-)


Did you really try it?
PyPy is really fast with str.__mod__, when the format string is a 
constant.

Yes, it's jitted.


How about the .format() style: Is that jitted as well?
In order to get people to prefer .format over __mod__,
it would be nice if PyPy made this actually _faster_ :-)

--
Christian Tismer :^)   
Software Consulting  : Have a break! Take a ride on Python's
Karl-Liebknecht-Str. 121 :*Starship* http://starship.python.net/
14482 Potsdam: PGP key -> http://pgp.uni-mainz.de
phone +49 173 24 18 776  fax +49 (30) 700143-0023
PGP 0x57F3BF04   9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
  whom do you want to sponsor today?   http://www.stackless.com/

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Serhiy Storchaka
On 13.02.13 15:23, Lennart Regebro wrote:
> On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka  wrote:
>> I prefer "x = '%s%s%s%s' % (a, b, c, d)" when string's number is more than 3
>> and some of them are literal strings.
> 
> This has the benefit of being slow both on CPython and PyPy. Although
> using .format() is even slower. :-)

Only slightly.

$ ./python -m timeit -s "spam = 'spam'; ham = 'ham'"  "spam + ' = ' + ham + 
'\n'"
100 loops, best of 3: 0.501 usec per loop
$ ./python -m timeit -s "spam = 'spam'; ham = 'ham'"  "''.join([spam, ' = ', 
ham, '\n'])"
100 loops, best of 3: 0.504 usec per loop
$ ./python -m timeit -s "spam = 'spam'; ham = 'ham'"  "'%s = %s\n' % (spam, 
ham)"
100 loops, best of 3: 0.524 usec per loop

But the last variant looks better for me.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] efficient string concatenation (yep, from 2004)

2013-02-13 Thread Nick Coghlan
On Wed, Feb 13, 2013 at 10:06 PM, Christian Tismer  wrote:
> To avoid such hidden traps in larger code bases, documentation is
> needed that clearly gives a warning saying "don't do that", like CS
> students learn for most other languages.

How much more explicit do you want us to be?

"""6. CPython implementation detail: If s and t are both strings, some
Python implementations such as CPython can usually perform an in-place
optimization for assignments of the form s = s + t or s += t. When
applicable, this optimization makes quadratic run-time much less
likely. This optimization is both version and implementation
dependent. For performance sensitive code, it is preferable to use the
str.join() method which assures consistent linear concatenation
performance across versions and implementations."""

from http://docs.python.org/2/library/stdtypes.html#typesseq

So please don't blame us for people not reading a warning that is already there.

Since my rewrite of the sequence docs, Python 3 doesn't even
acknowledge the hack's existence and is quite explicit about what you
need to do to get reliably linear behaviour:

"""6. Concatenating immutable sequences always results in a new
object. This means that building up a sequence by repeated
concatenation will have a quadratic runtime cost in the total sequence
length. To get a linear runtime cost, you must switch to one of the
alternatives below:

if concatenating str objects, you can build a list and use
str.join() at the end or else write to a io.StringIO instance and
retrieve its value when complete
if concatenating bytes objects, you can similarly use bytes.join()
or io.BytesIO, or you can do in-place concatenation with a bytearray
object. bytearray objects are mutable and have an efficient
overallocation mechanism
if concatenating tuple objects, extend a list instead
for other types, investigate the relevant class documentation"""

from http://docs.python.org/3/library/stdtypes.html#common-sequence-operations

Deliberately *relying* on the += hack to avoid quadratic runtime is
just plain wrong, and our documentation already says so.

If anyone really thinks it will help, I can add a CPython
implementation note back in to the Python 3 docs as well, pointing out
that CPython performance measurements may hide broken algorithmic
complexity related to string concatenation, but the corresponding note
in Python 2 doesn't seem to have done much good :P

Regards,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Amaury Forgeot d'Arc
2013/2/13 Christian Tismer 

> On 13.02.13 15:27, Amaury Forgeot d'Arc wrote:
>
>
> 2013/2/13 Lennart Regebro 
>
>> On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka 
>> wrote:
>> > I prefer "x = '%s%s%s%s' % (a, b, c, d)" when string's number is more
>> than 3
>> > and some of them are literal strings.
>>
>>  This has the benefit of being slow both on CPython and PyPy. Although
>> using .format() is even slower. :-)
>
>
> Did you really try it?
> PyPy is really fast with str.__mod__, when the format string is a constant.
> Yes, it's jitted.
>
>
> How about the .format() style: Is that jitted as well?
> In order to get people to prefer .format over __mod__,
> it would be nice if PyPy made this actually _faster_ :-)


.format() is jitted as well.
But it's still slower than str.__mod__ (about 25%)
I suppose it can be further optimized.

-- 
Amaury Forgeot d'Arc
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Lennart Regebro
On Wed, Feb 13, 2013 at 3:27 PM, Amaury Forgeot d'Arc
 wrote:
>
> 2013/2/13 Lennart Regebro 
>>
>> On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka 
>> wrote:
>> > I prefer "x = '%s%s%s%s' % (a, b, c, d)" when string's number is more
>> > than 3
>> > and some of them are literal strings.
>>
>> This has the benefit of being slow both on CPython and PyPy. Although
>> using .format() is even slower. :-)
>
>
> Did you really try it?

Yes.

> PyPy is really fast with str.__mod__, when the format string is a constant.
> Yes, it's jitted.

Simple concatenation: s1 = s1 + s2
PyPy-1.9 time for 100 concats of 1 length strings = 7.133
CPython time for 100 concats of 1 length strings = 0.005

Making a list of strings and joining after the loop: s1 = ''.join(l)
PyPy-1.9 time for 100 concats of 1 length strings = 0.005
CPython time for 100 concats of 1 length strings = 0.003

Old formatting: s1 = '%s%s' % (s1, s2)
PyPy-1.9 time for 100 concats of 1 length strings = 20.924
CPython time for 100 concats of 1 length strings = 3.787

New formatting: s1 = '{0}{1}'.format(s1, s2)
PyPy-1.9 time for 100 concats of 1 length strings = 13.249
CPython time for 100 concats of 1 length strings = 3.751


I have, by the way, yet to find a usecase where the fastest method in
CPython is not also the fastest in PyPy.

//Lennart
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Lennart Regebro
On Wed, Feb 13, 2013 at 3:27 PM, Amaury Forgeot d'Arc
 wrote:
> Yes, it's jitted.

Admittedly, I have no idea in which cases the JIT kicks in, and what I
should do to make that happen to make sure I have the best possible
real-life test cases.

//Lennart
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Serhiy Storchaka

On 13.02.13 15:17, Daniel Holth wrote:

On Wed, Feb 13, 2013 at 7:10 AM, Serhiy Storchaka mailto:[email protected]>> wrote:
I prefer "x = '%s%s%s%s' % (a, b, c, d)" when string's number is
more than 3 and some of them are literal strings.

Fixed: x = ('%s' *  len(abcd)) % abcd


No, you don't need this for the constant number of strings. Because 
almost certainly some of strings will be literals, you can write this in 
a more nice way. Compare:


'config[' + key + '] = ' + value + '\n'
''.join(['config[', key, '] = ', value, '\n'])
'config[%s] = %s\n' % (key, value)


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Amaury Forgeot d'Arc
2013/2/13 Lennart Regebro 

> On Wed, Feb 13, 2013 at 3:27 PM, Amaury Forgeot d'Arc
>  wrote:
> > Yes, it's jitted.
>
> Admittedly, I have no idea in which cases the JIT kicks in, and what I
> should do to make that happen to make sure I have the best possible
> real-life test cases.
>

PyPy JIT kicks in only after 1000 iterations.
I usually use timeit.
It's funny to see how the "1000 loops" line is 5 times faster than the "100
loops":

$ ./pypy-c -m timeit -v -s "a,b,c,d='1234'" "'{}{}{}{}'.format(a,b,c,d)"
10 loops -> 2.19e-05 secs
100 loops -> 0.000122 secs
1000 loops -> 0.00601 secs
1 loops -> 0.000363 secs
10 loops -> 0.00528 secs
100 loops -> 0.0533 secs
1000 loops -> 0.528 secs
raw times: 0.521 0.52 0.51
1000 loops, best of 3: 0.051 usec per loop


-- 
Amaury Forgeot d'Arc
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] efficient string concatenation (yep, from 2004)

2013-02-13 Thread Christian Tismer

Hey Nick,

On 13.02.13 15:44, Nick Coghlan wrote:

On Wed, Feb 13, 2013 at 10:06 PM, Christian Tismer  wrote:

To avoid such hidden traps in larger code bases, documentation is
needed that clearly gives a warning saying "don't do that", like CS
students learn for most other languages.

How much more explicit do you want us to be?

"""6. CPython implementation detail: If s and t are both strings, some
Python implementations such as CPython can usually perform an in-place
optimization for assignments of the form s = s + t or s += t. When
applicable, this optimization makes quadratic run-time much less
likely. This optimization is both version and implementation
dependent. For performance sensitive code, it is preferable to use the
str.join() method which assures consistent linear concatenation
performance across versions and implementations."""

from http://docs.python.org/2/library/stdtypes.html#typesseq

So please don't blame us for people not reading a warning that is already there.


I don't, really not. This was a cross-posting effect.
I was using the PyPy documentation, only, and there a lot of things
are mentioned, but this behavioral difference was missing.
Python-dev was not addressed at all.


...
Deliberately *relying* on the += hack to avoid quadratic runtime is
just plain wrong, and our documentation already says so.

If anyone really thinks it will help, I can add a CPython
implementation note back in to the Python 3 docs as well, pointing out
that CPython performance measurements may hide broken algorithmic
complexity related to string concatenation, but the corresponding note
in Python 2 doesn't seem to have done much good :P



Well, while we are at it:
Yes, it says so as a note at the end of
http://docs.python.org/2/library/stdtypes.html#typesseq

I doubt that many people read that far, and they do not search 
documentation about

sequence types when they are adding some strings together.
People seem to have a tendency to just try something out instead and see 
if it
works. That even seems to get worse the better and bigger the Python 
documentation

grows. ;-)

Maybe it would be a good idea to remove that concat optimization completely?
Then people will wake up and read the docs to find out what's wrong ;-)
No, does not help, because their test cases will not cover the reality.

-
Thinking a bit more about it.

If you think about docs improvement, I don't believe it helps to make
the very complete reference documentation even more complete.
Completeness is great, don't take me wrong! But what people read
is what pops right into their face, and I think that could be added.

I think before getting people to work through long and
complete documentation, it is probably easier to wake their interest
by something like
"Hey, are you doing things this way?"
And then there is a short, concise list of bad and good things, maybe
even dynamic as in WingWare's "Wing Tips" or any better approach.

From that easily reachable, only a few pages long tabular
collection of short hints and caveats there could be linkage to the 
existing, real

documentation that explains things in more detail.
Maybe that could be a way to get people to actually read.

Just an idea.

cheers - Chris


p.s.:
Other nice attempts that don't seem to really work:

Some hints like
http://docs.python.org/2/howto/doanddont.html
are not bad, although that is hidden in the HowTO section, does only
address a few things,
and also the sub-title "in-depth documents on specific topics" is not
what they seek in the first place while hacking on some code.

Looking things up in a quick ref like
http://rgruet.free.fr/PQR27/PQR2.7.html
is very concise but does also _not_ mention what to avoid.
Others exist, like
http://infohost.nmt.edu/tcc/help/pubs/python/web/

By the way, the first thing I find via google is:
http://www.python.org/doc/QuickRef.html
which is quite funny (v1.3)

--
Christian Tismer :^)   
Software Consulting  : Have a break! Take a ride on Python's
Karl-Liebknecht-Str. 121 :*Starship* http://starship.python.net/
14482 Potsdam: PGP key -> http://pgp.uni-mainz.de
phone +49 173 24 18 776  fax +49 (30) 700143-0023
PGP 0x57F3BF04   9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
  whom do you want to sponsor today?   http://www.stackless.com/

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread MRAB

On 2013-02-13 13:23, Lennart Regebro wrote:

On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka  wrote:

I prefer "x = '%s%s%s%s' % (a, b, c, d)" when string's number is more than 3
and some of them are literal strings.


This has the benefit of being slow both on CPython and PyPy. Although
using .format() is even slower. :-)


How about adding a class method for catenation:

str.cat(a, b, c, d)
str.cat([a, b, c, d]) # Equivalent to "".join([a, b, c, d])

Each argument could be a string or a list of strings.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Maciej Fijalkowski
On Wed, Feb 13, 2013 at 7:33 PM, MRAB  wrote:
> On 2013-02-13 13:23, Lennart Regebro wrote:
>>
>> On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka 
>> wrote:
>>>
>>> I prefer "x = '%s%s%s%s' % (a, b, c, d)" when string's number is more
>>> than 3
>>> and some of them are literal strings.
>>
>>
>> This has the benefit of being slow both on CPython and PyPy. Although
>> using .format() is even slower. :-)
>>
> How about adding a class method for catenation:
>
> str.cat(a, b, c, d)
> str.cat([a, b, c, d]) # Equivalent to "".join([a, b, c, d])
>
> Each argument could be a string or a list of strings.
>
>
> ___
> Python-Dev mailing list
> [email protected]
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> http://mail.python.org/mailman/options/python-dev/fijall%40gmail.com

I actually wonder.

There seems to be the consensus to avoid += (to some extent). Can
someone commit the change to urrllib then? I'm talking about reverting
http://bugs.python.org/issue1285086 specifically
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Brett Cannon
On Wed, Feb 13, 2013 at 1:06 PM, Maciej Fijalkowski wrote:

> On Wed, Feb 13, 2013 at 7:33 PM, MRAB  wrote:
> > On 2013-02-13 13:23, Lennart Regebro wrote:
> >>
> >> On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka 
> >> wrote:
> >>>
> >>> I prefer "x = '%s%s%s%s' % (a, b, c, d)" when string's number is more
> >>> than 3
> >>> and some of them are literal strings.
> >>
> >>
> >> This has the benefit of being slow both on CPython and PyPy. Although
> >> using .format() is even slower. :-)
> >>
> > How about adding a class method for catenation:
> >
> > str.cat(a, b, c, d)
> > str.cat([a, b, c, d]) # Equivalent to "".join([a, b, c, d])
> >
> > Each argument could be a string or a list of strings.
> >
> >
> > ___
> > Python-Dev mailing list
> > [email protected]
> > http://mail.python.org/mailman/listinfo/python-dev
> > Unsubscribe:
> > http://mail.python.org/mailman/options/python-dev/fijall%40gmail.com
>
> I actually wonder.
>
> There seems to be the consensus to avoid += (to some extent). Can
> someone commit the change to urrllib then? I'm talking about reverting
> http://bugs.python.org/issue1285086 specifically


Please re-open the bug with a comment as to why and I'm sure someone will
get to it.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Maciej Fijalkowski
On Wed, Feb 13, 2013 at 8:24 PM, Brett Cannon  wrote:
>
>
>
> On Wed, Feb 13, 2013 at 1:06 PM, Maciej Fijalkowski 
> wrote:
>>
>> On Wed, Feb 13, 2013 at 7:33 PM, MRAB  wrote:
>> > On 2013-02-13 13:23, Lennart Regebro wrote:
>> >>
>> >> On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka 
>> >> wrote:
>> >>>
>> >>> I prefer "x = '%s%s%s%s' % (a, b, c, d)" when string's number is more
>> >>> than 3
>> >>> and some of them are literal strings.
>> >>
>> >>
>> >> This has the benefit of being slow both on CPython and PyPy. Although
>> >> using .format() is even slower. :-)
>> >>
>> > How about adding a class method for catenation:
>> >
>> > str.cat(a, b, c, d)
>> > str.cat([a, b, c, d]) # Equivalent to "".join([a, b, c, d])
>> >
>> > Each argument could be a string or a list of strings.
>> >
>> >
>> > ___
>> > Python-Dev mailing list
>> > [email protected]
>> > http://mail.python.org/mailman/listinfo/python-dev
>> > Unsubscribe:
>> > http://mail.python.org/mailman/options/python-dev/fijall%40gmail.com
>>
>> I actually wonder.
>>
>> There seems to be the consensus to avoid += (to some extent). Can
>> someone commit the change to urrllib then? I'm talking about reverting
>> http://bugs.python.org/issue1285086 specifically
>
>
> Please re-open the bug with a comment as to why and I'm sure someone will
> get to it.

I can't re-open the bug, my account is kind of lame (and seriously,
why do you guys *do* have multiple layers of bug tracker accounts?)

Cheers,
fijal
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Brett Cannon
On Wed, Feb 13, 2013 at 1:27 PM, Maciej Fijalkowski wrote:

> On Wed, Feb 13, 2013 at 8:24 PM, Brett Cannon  wrote:
> >
> >
> >
> > On Wed, Feb 13, 2013 at 1:06 PM, Maciej Fijalkowski 
> > wrote:
> >>
> >> On Wed, Feb 13, 2013 at 7:33 PM, MRAB 
> wrote:
> >> > On 2013-02-13 13:23, Lennart Regebro wrote:
> >> >>
> >> >> On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka <
> [email protected]>
> >> >> wrote:
> >> >>>
> >> >>> I prefer "x = '%s%s%s%s' % (a, b, c, d)" when string's number is
> more
> >> >>> than 3
> >> >>> and some of them are literal strings.
> >> >>
> >> >>
> >> >> This has the benefit of being slow both on CPython and PyPy. Although
> >> >> using .format() is even slower. :-)
> >> >>
> >> > How about adding a class method for catenation:
> >> >
> >> > str.cat(a, b, c, d)
> >> > str.cat([a, b, c, d]) # Equivalent to "".join([a, b, c, d])
> >> >
> >> > Each argument could be a string or a list of strings.
> >> >
> >> >
> >> > ___
> >> > Python-Dev mailing list
> >> > [email protected]
> >> > http://mail.python.org/mailman/listinfo/python-dev
> >> > Unsubscribe:
> >> > http://mail.python.org/mailman/options/python-dev/fijall%40gmail.com
> >>
> >> I actually wonder.
> >>
> >> There seems to be the consensus to avoid += (to some extent). Can
> >> someone commit the change to urrllib then? I'm talking about reverting
> >> http://bugs.python.org/issue1285086 specifically
> >
> >
> > Please re-open the bug with a comment as to why and I'm sure someone will
> > get to it.
>
> I can't re-open the bug, my account is kind of lame


Then leave a comment and I will re-open it.


> (and seriously,
> why do you guys *do* have multiple layers of bug tracker accounts?)
>

You obviously have not had users argue with your decision by constantly
flipping a bug back open. =)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Christian Tismer

On 13.02.13 19:06, Maciej Fijalkowski wrote:

On Wed, Feb 13, 2013 at 7:33 PM, MRAB  wrote:

On 2013-02-13 13:23, Lennart Regebro wrote:

On Wed, Feb 13, 2013 at 1:10 PM, Serhiy Storchaka 
wrote:

I prefer "x = '%s%s%s%s' % (a, b, c, d)" when string's number is more
than 3
and some of them are literal strings.


This has the benefit of being slow both on CPython and PyPy. Although
using .format() is even slower. :-)


How about adding a class method for catenation:

 str.cat(a, b, c, d)
 str.cat([a, b, c, d]) # Equivalent to "".join([a, b, c, d])

Each argument could be a string or a list of strings.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/fijall%40gmail.com

I actually wonder.

There seems to be the consensus to avoid += (to some extent). Can
someone commit the change to urrllib then? I'm talking about reverting
http://bugs.python.org/issue1285086 specifically


So _is_ += faster in certain library funcs than ''.join() ?
If that's the case, the behavior of string concat could be something 
that might be added

to some implementation info, if speed really matters.

The library function then could take this info and use the appropriate code
path to always be fast, during module initialisation.
This is also quite explicit, since it tells the reader not to use in-place
add when it is not optimized.

If += is anyway a bit slower than other ways, forget it.
I would then maybe add a commend somewhere that says
"avoiding '+=' because it is not reliable" or something.

cheers - chris

--
Christian Tismer :^)   
Software Consulting  : Have a break! Take a ride on Python's
Karl-Liebknecht-Str. 121 :*Starship* http://starship.python.net/
14482 Potsdam: PGP key -> http://pgp.uni-mainz.de
phone +49 173 24 18 776  fax +49 (30) 700143-0023
PGP 0x57F3BF04   9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
  whom do you want to sponsor today?   http://www.stackless.com/

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Marking GC details as CPython-only

2013-02-13 Thread Maciej Fijalkowski
Hi

I've tried (and failed) to find what GC details (especially finalizer
semantics) are CPython only and which ones are not. The best I could
find was the documentation of __del__ here:
http://docs.python.org/2/reference/datamodel.html

Things were pypy differs:

* finalizers in pypy will be called only once, even if the object is
resurrected. I'm not sure if this is detail or we're just plain
incompatible.

* pypy breaks cycles and runs finalizers in random order (but
topologically correct), hence gc.garbage is always empty. I *think*
this part is really just an implementation detail

* we're discussing right now about running multiple finalizers. We
want to run them in order, but if there is a link a -> b and a becomes
unreachable, we want to reserve the right to call finalizer a then
finalizer b, even if a.__del__ resurrects a. What do you think?

Overall, the __del__ is baaad.

Cheers,
fijal
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Serhiy Storchaka

On 13.02.13 20:40, Christian Tismer wrote:

If += is anyway a bit slower than other ways, forget it.
I would then maybe add a commend somewhere that says
"avoiding '+=' because it is not reliable" or something.


+= is a fastest way (in any implementation) if you concatenates only two 
strings.



___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Marking GC details as CPython-only

2013-02-13 Thread Xavier Morel
On 2013-02-13, at 19:48 , Maciej Fijalkowski wrote:

> Hi
> 
> I've tried (and failed) to find what GC details (especially finalizer
> semantics) are CPython only and which ones are not. The best I could
> find was the documentation of __del__ here:
> http://docs.python.org/2/reference/datamodel.html
> 
> Things were pypy differs:
> 
> * finalizers in pypy will be called only once, even if the object is
> resurrected. I'm not sure if this is detail or we're just plain
> incompatible.
> 
> * pypy breaks cycles and runs finalizers in random order (but
> topologically correct), hence gc.garbage is always empty. I *think*
> this part is really just an implementation detail
> 
> * we're discussing right now about running multiple finalizers. We
> want to run them in order, but if there is a link a -> b and a becomes
> unreachable, we want to reserve the right to call finalizer a then
> finalizer b, even if a.__del__ resurrects a. What do you think?
> 
> Overall, the __del__ is baaad.
> 
> Cheers,
> fijal

There may be one more, although I'm not sure whether it's a GC artifact
or something completely unspecified: if a context manager is part of a
suspended stack (because it's in a generator) when the program
terminates, cpython will run __exit__ but pypy will not

--
# -*- encoding: utf-8 -*-
class C(object):
def __enter__(self):
print ("entering")
def __exit__(self, *args):
print ("exiting")

def gen():
with C():
yield

r = gen()
next(r)
--
$ python2 test.py
entering
exiting
$ python3 test.py
entering
exiting
$ pypy test.py
entering
$
--

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] efficient string concatenation (yep, from 2004)

2013-02-13 Thread R. David Murray
On Wed, 13 Feb 2013 18:07:22 +0100, Christian Tismer  
wrote:
> I think before getting people to work through long and
> complete documentation, it is probably easier to wake their interest
> by something like
> "Hey, are you doing things this way?"
> And then there is a short, concise list of bad and good things, maybe
> even dynamic as in WingWare's "Wing Tips" or any better approach.
> 
>  From that easily reachable, only a few pages long tabular
> collection of short hints and caveats there could be linkage to the 
> existing, real
> documentation that explains things in more detail.
> Maybe that could be a way to get people to actually read.

There used to be a HOWTO with this goal, but its opinions were
considered outdated and/or contentious, and it was deleted:

http://docs.python.org/2.6/howto/doanddont.html

--David
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Marking GC details as CPython-only

2013-02-13 Thread Maciej Fijalkowski
On Wed, Feb 13, 2013 at 9:09 PM, Xavier Morel  wrote:
> On 2013-02-13, at 19:48 , Maciej Fijalkowski wrote:
>
>> Hi
>>
>> I've tried (and failed) to find what GC details (especially finalizer
>> semantics) are CPython only and which ones are not. The best I could
>> find was the documentation of __del__ here:
>> http://docs.python.org/2/reference/datamodel.html
>>
>> Things were pypy differs:
>>
>> * finalizers in pypy will be called only once, even if the object is
>> resurrected. I'm not sure if this is detail or we're just plain
>> incompatible.
>>
>> * pypy breaks cycles and runs finalizers in random order (but
>> topologically correct), hence gc.garbage is always empty. I *think*
>> this part is really just an implementation detail
>>
>> * we're discussing right now about running multiple finalizers. We
>> want to run them in order, but if there is a link a -> b and a becomes
>> unreachable, we want to reserve the right to call finalizer a then
>> finalizer b, even if a.__del__ resurrects a. What do you think?
>>
>> Overall, the __del__ is baaad.
>>
>> Cheers,
>> fijal
>
> There may be one more, although I'm not sure whether it's a GC artifact
> or something completely unspecified: if a context manager is part of a
> suspended stack (because it's in a generator) when the program
> terminates, cpython will run __exit__ but pypy will not
>
> --
> # -*- encoding: utf-8 -*-
> class C(object):
> def __enter__(self):
> print ("entering")
> def __exit__(self, *args):
> print ("exiting")
>
> def gen():
> with C():
> yield
>
> r = gen()
> next(r)
> --
> $ python2 test.py
> entering
> exiting
> $ python3 test.py
> entering
> exiting
> $ pypy test.py
> entering
> $
> --
>
> ___
> Python-Dev mailing list
> [email protected]
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> http://mail.python.org/mailman/options/python-dev/fijall%40gmail.com

I think it's well documented you should not rely on stuff like that
being run at the exit of the interpreter. I think we'll try harder to
run finalizers at the end of the interpreter (right now we only flush
files). File the issue.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Marking GC details as CPython-only

2013-02-13 Thread Antoine Pitrou
On Wed, 13 Feb 2013 20:48:08 +0200
Maciej Fijalkowski  wrote:
> 
> Things were pypy differs:
> 
> * finalizers in pypy will be called only once, even if the object is
> resurrected. I'm not sure if this is detail or we're just plain
> incompatible.

I think this should be a detail.

> * pypy breaks cycles and runs finalizers in random order (but
> topologically correct), hence gc.garbage is always empty. I *think*
> this part is really just an implementation detail

Agreed.

> * we're discussing right now about running multiple finalizers. We
> want to run them in order, but if there is a link a -> b and a becomes
> unreachable, we want to reserve the right to call finalizer a then
> finalizer b, even if a.__del__ resurrects a. What do you think?

I think resurrecting objects from __del__ is crazy, so IMO what you
suggest is fine.

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Marking GC details as CPython-only

2013-02-13 Thread Armin Rigo
Hi,

On Wed, Feb 13, 2013 at 8:22 PM, Maciej Fijalkowski  wrote:
> I think it's well documented you should not rely on stuff like that
> being run at the exit of the interpreter.

Actually right now, at the exit of the interpreter, we just leave the
program without caring about running any __del__.  This might mean
that in a short-running script no __del__ is ever run.  I'd add this
question to your original list: is it good enough, or should we try
harder to run destructors at the exit?


A bientôt,

Armin.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Marking GC details as CPython-only

2013-02-13 Thread Antoine Pitrou
On Wed, 13 Feb 2013 20:30:18 +0100
Armin Rigo  wrote:
> Hi,
> 
> On Wed, Feb 13, 2013 at 8:22 PM, Maciej Fijalkowski  wrote:
> > I think it's well documented you should not rely on stuff like that
> > being run at the exit of the interpreter.
> 
> Actually right now, at the exit of the interpreter, we just leave the
> program without caring about running any __del__.  This might mean
> that in a short-running script no __del__ is ever run.  I'd add this
> question to your original list: is it good enough, or should we try
> harder to run destructors at the exit?

Destructors should be run at exit like they would be in any other
finalization situation. Anything else is dangerous, since important
resources may not be finalized, committed, or released.

(and by destructors I also mean weakref callbacks)

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Lennart Regebro
On Wed, Feb 13, 2013 at 7:06 PM, Maciej Fijalkowski  wrote:
> I actually wonder.
>
> There seems to be the consensus to avoid += (to some extent). Can
> someone commit the change to urrllib then? I'm talking about reverting
> http://bugs.python.org/issue1285086 specifically

That's unquoting of URLs, strings that aren't particularly long,
normally. And it's not in any tight loops. I'm astonished that any
change makes any noticeable speed difference here at all.

//Lennart
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Marking GC details as CPython-only

2013-02-13 Thread Maciej Fijalkowski
On Wed, Feb 13, 2013 at 9:40 PM, Antoine Pitrou  wrote:
> On Wed, 13 Feb 2013 20:30:18 +0100
> Armin Rigo  wrote:
>> Hi,
>>
>> On Wed, Feb 13, 2013 at 8:22 PM, Maciej Fijalkowski  wrote:
>> > I think it's well documented you should not rely on stuff like that
>> > being run at the exit of the interpreter.
>>
>> Actually right now, at the exit of the interpreter, we just leave the
>> program without caring about running any __del__.  This might mean
>> that in a short-running script no __del__ is ever run.  I'd add this
>> question to your original list: is it good enough, or should we try
>> harder to run destructors at the exit?
>
> Destructors should be run at exit like they would be in any other
> finalization situation. Anything else is dangerous, since important
> resources may not be finalized, committed, or released.
>
> (and by destructors I also mean weakref callbacks)
>
> Regards
>
> Antoine.

I think Antoine is right (despite the fact that CPython docs clearly
state that __del__s might not be run on the interpreter exit actually)

Cheers,
fijal
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Lennart Regebro
On Wed, Feb 13, 2013 at 4:02 PM, Amaury Forgeot d'Arc
 wrote:
> 2013/2/13 Lennart Regebro 
>>
>> On Wed, Feb 13, 2013 at 3:27 PM, Amaury Forgeot d'Arc
>>  wrote:
>> > Yes, it's jitted.
>>
>> Admittedly, I have no idea in which cases the JIT kicks in, and what I
>> should do to make that happen to make sure I have the best possible
>> real-life test cases.
>
>
> PyPy JIT kicks in only after 1000 iterations.

Actually, my test code mixed iterations and string length up when
printing the results, so the tests I showed was not 100 iterations
with 10.000 long string, but 10.000 iterations with 100 long strings.

No matter what the iteration/string length is .format() is the slowest
or second slowest of all string concatenation methods I've tried and
'%s%s' % just marginally faster. This both on PyPy and CPython and
irrespective of string length.

I'll stick my neck out and say that using formatting for concatenation
is probably an anti-pattern.

//Lennart
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Marking GC details as CPython-only

2013-02-13 Thread Barry Warsaw
On Feb 13, 2013, at 08:30 PM, Armin Rigo wrote:

>Actually right now, at the exit of the interpreter, we just leave the
>program without caring about running any __del__.  This might mean
>that in a short-running script no __del__ is ever run.  I'd add this
>question to your original list: is it good enough, or should we try
>harder to run destructors at the exit?

I've seen *tons* of small Python scripts that don't care about what happens,
if anything, at program exit.  Some have comments making that quite explicit.
Sometimes, they even do so as performance improvements!  When you care about
start up costs, you often also care about tear down costs.

Such scripts just expect that all their resources will get freed when the
process exits.  Of course, they're not always right (e.g. clean up tmp files),
but it's pretty common, and I'm sure not just in Python.  OTOH, relying on
__del__ to clean up your tmp files seems rather dubious (well, frankly, so
does most uses of __del__).

Cheers,
-Barry
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Marking GC details as CPython-only

2013-02-13 Thread Richard Oudkerk

On 13/02/2013 7:25pm, Antoine Pitrou wrote:

I think resurrecting objects from __del__ is crazy, so IMO what you
suggest is fine.


You mean like subprocess.Popen.__del__?  I quite agree.

--
Richard

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] efficient string concatenation (yep, from 2004)

2013-02-13 Thread Greg Ewing

Steven D'Aprano wrote:
The documentation for strings is also clear that you should not rely on 
this

optimization:


> ...
>

It
can, and does, fail on CPython as well, as it is sensitive to memory
allocation details.


If it's that unreliable, why was it ever implemented
in the first place?

--
Greg
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] efficient string concatenation (yep, from 2004)

2013-02-13 Thread Maciej Fijalkowski
On Wed, Feb 13, 2013 at 11:17 PM, Greg Ewing
 wrote:
> Steven D'Aprano wrote:
>>
>> The documentation for strings is also clear that you should not rely on
>> this
>> optimization:
>>
>> ...
>
>>
>>
>> It
>> can, and does, fail on CPython as well, as it is sensitive to memory
>> allocation details.
>
>
> If it's that unreliable, why was it ever implemented
> in the first place?
>

Because someone thought it's a good idea probably and other people
asked for a review said +1 :)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] efficient string concatenation (yep, from 2004)

2013-02-13 Thread Christian Tismer
Hi Lennart,

Sent from my Ei4Steve

On Feb 13, 2013, at 8:42, Lennart Regebro  wrote:

>> Something is needed - a patch for PyPy or for the documentation I guess.
> 
> Not arguing that it wouldn't be good, but I disagree that it is needed.
> 
> This is only an issue when you, as in your proof, have a loop that
> does concatenation. This is usually when looping over a list of
> strings that should be concatenated together. Doing so in a loop with
> concatenation may be the natural way for people new to Python, but the
> "natural" way to do it in Python is with a ''.join() call.
> 
> This:
> 
>s = ''.join(('X' for x in xrange(x)))
> 
> Is more than twice as fast in Python 2.7 than your example. It is in
> fact also slower in PyPy 1.9 than Python 2.7, but only with a factor
> of two:
> 
> Python 2.7:
> time for 1000 concats = 0.887
> Pypy 1.9:
> time for 1000 concats = 1.600
> 
> (And of course s = 'X'* x takes only a bout a hundredth of the time,
> but that's cheating. ;-)
> 
> //Lennart

This all does not really concern me, as long as it roughly has the same order 
of magnitude, or better the same big Oh. 
I'm not concerned by a constant factor. 
I'm concerned by a freezing machine that suddenly gets 1 times slower
because the algorithms never explicitly state their algorithmic complexity. 
( I think I said this too often, today?)

As a side note:
Something similar happened to me when somebody used "range" in Python3.3. 
He ran the same code on Python 2.7. 
with a crazy effect of having to re-boot:
Range() on 2.7 with arguments from some arbitrary input file. A newbie error
that was hard to understand, because
he was tought thinking 'xrange' when writing 'range'. Hard for me to understand 
because I am no longer able to make these errors at all, or even expect them. 

Cheers - Chris
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] efficient string concatenation (yep, from 2004)

2013-02-13 Thread Christian Tismer

On 13.02.13 22:52, Maciej Fijalkowski wrote:

On Wed, Feb 13, 2013 at 11:17 PM, Greg Ewing
 wrote:

Steven D'Aprano wrote:

The documentation for strings is also clear that you should not rely on
this
optimization:

...

It
can, and does, fail on CPython as well, as it is sensitive to memory
allocation details.


If it's that unreliable, why was it ever implemented
in the first place?



The _trick_ was very good, the idea was - uhm - arguable.
I wished I had objected, but at that time I was only fascinated.

-- chris

(There are parallels on a larger scale, but I'm shutting up, intentionally.)

--
Christian Tismer :^)   
Software Consulting  : Have a break! Take a ride on Python's
Karl-Liebknecht-Str. 121 :*Starship* http://starship.python.net/
14482 Potsdam: PGP key -> http://pgp.uni-mainz.de
phone +49 173 24 18 776  fax +49 (30) 700143-0023
PGP 0x57F3BF04   9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
  whom do you want to sponsor today?   http://www.stackless.com/

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Victor Stinner
Hi,

I wrote quick hack to expose _PyUnicodeWriter as _string.UnicodeWriter:
http://www.haypocalc.com/tmp/string_unicode_writer.patch

And I wrote a (micro-)benchmark:
http://www.haypocalc.com/tmp/bench_join.py
( The benchmark uses only ASCII string, it would be interesting to
test latin1, BMP and non-BMP characters too. )

UnicodeWriter (using the "writer += str" API) is the fastest method in
most cases, except for data = ['a'*10**4] * 10**2 (in this case, it's
8x slower!). I guess that the overhead comes for the overallocation
which then require to shrink the buffer (shrinking may copy the whole
string). The overallocation factor may be adapted depending on the
size.

If computing the final length is cheap (eg. if it's always the same),
it's always faster to use UnicodeWriter with a preallocated buffer.
The "UnicodeWriter +=; preallocate" test uses a precomputed length
(ok, it's cheating!).

I also implemented UnicodeWriter.append method to measure the overhead
of a method lookup: it's expensive :-)

--

Platform: Linux-3.6.10-2.fc16.x86_64-x86_64-with-fedora-16-Verne
Python unicode implementation: PEP 393
Date: 2013-02-14 01:00:06
CFLAGS: -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
SCM: hg revision=659ef9d360ae+ tag=tip branch=default date="2013-02-13
15:25 +"
CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
Python version: 3.4.0a0 (default:659ef9d360ae+, Feb 14 2013, 00:35:19)
[GCC 4.6.3 20120306 (Red Hat 4.6.3-2)]
Bits: int=32, long=64, long long=64, pointer=64

[ data = ['a'] * 10**2 ]

4.21 us: UnicodeWriter +=; preallocate
4.86 us (+15%): UnicodeWriter append; lookup attr once
4.99 us (+18%): UnicodeWriter +=

6.35 us (+51%): str += str
6.45 us (+53%): io.StringIO; lookup attr once
7.02 us (+67%): "".join(list)
7.46 us (+77%): UnicodeWriter append
8.77 us (+108%): io.StringIO

[ data = ['abc'] * 10**4 ]

356 us: UnicodeWriter append; lookup attr once
375 us (+5%): UnicodeWriter +=; preallocate
376 us (+6%): UnicodeWriter +=

495 us (+39%): io.StringIO; lookup attr once
614 us (+73%): "".join(list)
629 us (+77%): UnicodeWriter append
716 us (+101%): str += str
737 us (+107%): io.StringIO

[ data = ['a'*10**4] * 10**1 ]

3.67 us: str += str
3.76 us: UnicodeWriter +=; preallocate

3.95 us (+8%): UnicodeWriter +=
4.01 us (+9%): UnicodeWriter append; lookup attr once
4.06 us (+11%): "".join(list)
4.24 us (+15%): UnicodeWriter append
4.59 us (+25%): io.StringIO; lookup attr once
4.77 us (+30%): io.StringIO

[ data = ['a'*10**4] * 10**2 ]

41.2 us: UnicodeWriter +=; preallocate
43.8 us (+6%): str += str
45.4 us (+10%): "".join(list)
45.9 us (+11%): io.StringIO; lookup attr once
48.3 us (+17%): io.StringIO

370 us (+797%): UnicodeWriter +=
370 us (+798%): UnicodeWriter append; lookup attr once
377 us (+816%): UnicodeWriter append

[ data = ['a'*10**4] * 10**4 ]

38.9 ms: UnicodeWriter +=; preallocate
39 ms: "".join(list)
39.1 ms: io.StringIO; lookup attr once
39.4 ms: UnicodeWriter append; lookup attr once
39.5 ms: io.StringIO
39.6 ms: UnicodeWriter +=
40.1 ms: str += str
40.1 ms: UnicodeWriter append

Victor

2013/2/13 Antoine Pitrou :
> Le Wed, 13 Feb 2013 09:02:07 +0100,
> Victor Stinner  a écrit :
>> I added a _PyUnicodeWriter internal API to optimize str%args and
>> str.format(args). It uses a buffer which is overallocated, so it's
>> basically like CPython str += str optimization. I still don't know how
>> efficient it is on Windows, since realloc() is slow on Windows (at
>> least on old Windows versions).
>>
>> We should add an official and public API to concatenate strings.
>
> There's io.StringIO already.
>
> Regards
>
> Antoine.
>
>
> ___
> Python-Dev mailing list
> [email protected]
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> http://mail.python.org/mailman/options/python-dev/victor.stinner%40gmail.com
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Steven D'Aprano

On 14/02/13 01:18, Chris Withers wrote:

On 13/02/2013 11:53, Steven D'Aprano wrote:

I fixed a performance bug in httplib some years ago by doing the exact
opposite; += -> ''.join(). In that case, it changed downloading a file
from 20 minutes to 3 seconds. That was likely on Python 2.5.



I remember it well.

http://mail.python.org/pipermail/python-dev/2009-August/091125.html

I frequently link to this thread as an example of just how bad repeated
string concatenation can be, how painful it can be to debug, and how
even when the optimization is fast on one system, it may fail and be
slow on another system.


Amusing is that 
http://mail.python.org/pipermail/python-dev/2009-August/thread.html#91125 
doesn't even list the email where I found the problem...


That's because it wasn't solved until the following month.

http://mail.python.org/pipermail/python-dev/2009-September/thread.html#91581



--
Steven
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [pypy-dev] efficient string concatenation (yep, from 2004)

2013-02-13 Thread Steven D'Aprano

On 14/02/13 01:44, Nick Coghlan wrote:


Deliberately *relying* on the += hack to avoid quadratic runtime is
just plain wrong, and our documentation already says so.


+1

I'm not sure that there's any evidence that people in general are *relying* on 
the += hack. More likely they write the first code they think of, which is +=, 
and never considered the consequences or test it thoroughly. Even if they test 
it, they only test it on one version of one implementation on one platform, and 
likely only with small N.

Besides, if you know that N will always be small, then using += is not wrong.

I think we have a tendency to sometimes overreact in cases like this. I don't 
think we need to do any more than we are already doing: the tutor@ and 
python-list@ mailing lists already try to educate users to use join, the docs 
recommend to use join, the Internet is filled with code that correctly uses 
join. What more can we do? I see no evidence that the Python community is awash 
with coders who write code with atrocious performance characteristics, or at 
least no more than any other language.



--
Steven
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Usage of += on strings in loops in stdlib

2013-02-13 Thread Antoine Pitrou
On Thu, 14 Feb 2013 01:21:40 +0100
Victor Stinner  wrote:
> 
> UnicodeWriter (using the "writer += str" API) is the fastest method in
> most cases, except for data = ['a'*10**4] * 10**2 (in this case, it's
> 8x slower!). I guess that the overhead comes for the overallocation
> which then require to shrink the buffer (shrinking may copy the whole
> string). The overallocation factor may be adapted depending on the
> size.

How about testing on Windows?

> If computing the final length is cheap (eg. if it's always the same),
> it's always faster to use UnicodeWriter with a preallocated buffer.

That's not a particularly surprising discovery, is it? ;-)

Regards

Antoine.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com