On Mon, Mar 30, 2020 at 10:07:30AM -0700, Andrew Barnert via Python-ideas wrote:
> Why? What’s the benefit of building a mutable string around a virtual
> file object wrapped around a buffer (with all the extra complexities
> and performance costs that involves, like incremental Unicode encoding
> and decoding) instead of just building it around a buffer directly?
The quote about adding another abstraction layer solving every problem
except the problem of having too many abstraction layers comes to mind.
But let's please not hijack this proposal by making it about a full-
blown mutable string object. Paul's proposal is simple: add `+=` as an
alias to `.write` to StringIO and BytesIO.
We have the str concat optimization to cater for people who want to
concatenate strings using `buf += str`. You are absolutely right that
the correct cross-platform way of doing it is to accumulate a list then
join it, but that's an idiom that doesn't come easily to many people.
Hence even people who know better sometimes prefer the `buf += str`
idiom, and hence the repeated arguments about making join a list method.
(But you must accumulate the list with append, not with list
concatenation, or you are back to quadratic behaviour.)
It seems to me that the least invasive change to write efficient, good
looking code is Paul's suggestion to use StringIO or BytesIO with the
proposed `+=` operator. Side by side:
# best read using a fixed-width font
buf = '' buf = [] buf = io.StringIO()
for s in strings: for s in strings: for s in strings:
buf += s buf.append(s) buf += s
buf = ''.join(buf) buf = buf.getvalue()
Clearly the first is prettiest, which is why people use it. (It goes
without saying that *pretty* is a matter of opinion.) It needs no extra
conversion at the end, which is nice. But it's not cross-platform, and
even in CPython it's a bit risky.
The middle is the most correct, but honestly, it's not that pretty. Many
people *really* hate the fact that join is a string method and would
rather write `buf.join('')`.
The third is, in my opinion, quite nice. With the status quo
`buf.write(s)`, it's much less nice.
Paul's point about refactoring should be treated more seriously. If you
have code that currently has a bunch of `buf += s` scattered around in
many places, changing to the middle idiom is difficult:
1. you have to change the buffer initialisation;
2. you have to add an extra conversion to the end;
3. and you have to change every single `buf += s` to `buf.append(s)`.
With Paul's proposal, 1 and 2 still apply, but that's just two lines.
Three if you include the `import io`. But step 3 is gone. You don't have
to change any of the buffer concatenations to appends.
Now that's not such a big deal when all of the concatenations are right
there in one little loop, but if they are scattered around dozens of
methods or functions it can be a significant refactoring step.
> More generally, a StringIO is neither the obvious way
If I were new to Python, and wanted to build a string, and knew that
repeated concatenation was slow, I'd probably look for some sort of
String Builder or String IO class before thinking of *list append*.
Especially if I came from a Java background.
> nor the fastest way
It's pretty close though.
On my test, accumulating 500,000 strings into a list versus a StringIO
buffer, then building a string, took 27.5 versus 31.6 ms. Using a string
took 36.4 ms. So it's faster than the optimized string concat, and
within arm's reach of list+join.
Replacing buf.write with `+=` might, theoretically, shave off a bit of
the overhead of attribute lookup. That would close the distance a
fraction. And maybe there are other future optimizations that could
follow. Or maybe not.
> nor the recommended way to build strings on the fly in Python, so
> why do you agree with the OP that we need to make it better for that
> purpose? Just to benefit people who want to write C++ instead of
> Python?
If writing `buf += s` is writing C++ instead of Python, then you have
spent much of this thread defending the optimization added in version
2.4 to allow people to write C++ instead of Python. So why are you
suddenly against it now when the underlying buffer changes from str to
StringIO?
When I was younger and still smarting from being on the losing side of
the Pascal vs C holy wars, I really hated the idea of adding `+=` to
Python because it would encourage people to write C instead of Python. I
got over it :-)
--
Steven
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/[email protected]/message/FMY6HEBW4A7AVQDDADNMMTIS66TP5CDB/
Code of Conduct: http://python.org/psf/codeofconduct/