On Mon, Mar 30, 2020 at 10:07:30AM -0700, Andrew Barnert via Python-ideas wrote:

> Why? What’s the benefit of building a mutable string around a virtual 
> file object wrapped around a buffer (with all the extra complexities 
> and performance costs that involves, like incremental Unicode encoding 
> and decoding) instead of just building it around a buffer directly?

The quote about adding another abstraction layer solving every problem
except the problem of having too many abstraction layers comes to mind.

But let's please not hijack this proposal by making it about a full- 
blown mutable string object. Paul's proposal is simple: add `+=` as an 
alias to `.write` to StringIO and BytesIO.

We have the str concat optimization to cater for people who want to 
concatenate strings using `buf += str`. You are absolutely right that 
the correct cross-platform way of doing it is to accumulate a list then 
join it, but that's an idiom that doesn't come easily to many people. 
Hence even people who know better sometimes prefer the `buf += str` 
idiom, and hence the repeated arguments about making join a list method.

(But you must accumulate the list with append, not with list 
concatenation, or you are back to quadratic behaviour.)

It seems to me that the least invasive change to write efficient, good 
looking code is Paul's suggestion to use StringIO or BytesIO with the 
proposed `+=` operator. Side by side:

    # best read using a fixed-width font
    buf = ''            buf = []              buf = io.StringIO()
    for s in strings:   for s in strings:     for s in strings:
        buf += s            buf.append(s)         buf += s
                        buf = ''.join(buf)    buf = buf.getvalue()

Clearly the first is prettiest, which is why people use it. (It goes 
without saying that *pretty* is a matter of opinion.) It needs no extra 
conversion at the end, which is nice. But it's not cross-platform, and 
even in CPython it's a bit risky.

The middle is the most correct, but honestly, it's not that pretty. Many 
people *really* hate the fact that join is a string method and would 
rather write `buf.join('')`.

The third is, in my opinion, quite nice. With the status quo 
`buf.write(s)`, it's much less nice.

Paul's point about refactoring should be treated more seriously. If you 
have code that currently has a bunch of `buf += s` scattered around in 
many places, changing to the middle idiom is difficult:

1. you have to change the buffer initialisation;
2. you have to add an extra conversion to the end;
3. and you have to change every single `buf += s` to `buf.append(s)`.

With Paul's proposal, 1 and 2 still apply, but that's just two lines. 
Three if you include the `import io`. But step 3 is gone. You don't have 
to change any of the buffer concatenations to appends.

Now that's not such a big deal when all of the concatenations are right 
there in one little loop, but if they are scattered around dozens of 
methods or functions it can be a significant refactoring step.

> More generally, a StringIO is neither the obvious way 

If I were new to Python, and wanted to build a string, and knew that 
repeated concatenation was slow, I'd probably look for some sort of 
String Builder or String IO class before thinking of *list append*. 
Especially if I came from a Java background.

> nor the fastest way 

It's pretty close though.

On my test, accumulating 500,000 strings into a list versus a StringIO 
buffer, then building a string, took 27.5 versus 31.6 ms. Using a string 
took 36.4 ms. So it's faster than the optimized string concat, and 
within arm's reach of list+join.

Replacing buf.write with `+=` might, theoretically, shave off a bit of 
the overhead of attribute lookup. That would close the distance a 
fraction. And maybe there are other future optimizations that could 
follow. Or maybe not.


> nor the recommended way to build strings on the fly in Python, so 
> why do you agree with the OP that we need to make it better for that 
> purpose? Just to benefit people who want to write C++ instead of 
> Python?

If writing `buf += s` is writing C++ instead of Python, then you have 
spent much of this thread defending the optimization added in version 
2.4 to allow people to write C++ instead of Python. So why are you 
suddenly against it now when the underlying buffer changes from str to 
StringIO?

When I was younger and still smarting from being on the losing side of 
the Pascal vs C holy wars, I really hated the idea of adding `+=` to 
Python because it would encourage people to write C instead of Python. I 
got over it :-)


-- 
Steven
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/FMY6HEBW4A7AVQDDADNMMTIS66TP5CDB/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to