[Python-ideas] Re: Explicitly defining a string buffer object (aka StringIO += operator)

Andrew Barnert via Python-ideas Sun, 29 Mar 2020 12:32:01 -0700

> On Mar 29, 2020, at 10:57, Paul Sokolovsky <pmis...@gmail.com> wrote:
> 
> 
> It is a well-known anti-pattern to use a string as a string buffer, to
> construct a long (perhaps very long) string piece-wise. A running
> example is:
> 
> buf = ""
> for i in range(50000):
>   buf += "foo"
> print(buf)
> 
> An alternative is to use a buffer-like object explicitly designed for
> incremental updates, which for Python is io.StringIO:


It’s usually an even better alternative to just put the strings into a list of 
strings (or to write a generator that yields them), and then pass that to the 
the join method. This is recommended in the official Python FAQ. It’s usually 
about 40% faster than using StringIO or relying on the string-concat 
optimization in CPython, it’s efficient across all implementations of Python, 
and it’s obvious _why_ it’s efficient. It can sometimes take more memory, but 
the tradeoffs is usually worth it.

This has been well known in the Python community for decades. People coming 
from C++ look for something like stringstream and find StringIO; people coming 
from Java look for something like StringBuilder and build their own version 
around StringIO; people who are comfortable with Python use str.join. So 
third-party libraries that don’t do that are likely either (a) not expecting 
large amounts of data (and therefore probably suboptimal in other areas), or 
(b) written by someone who doesn’t really get Python.

So what is StringIO for? For being a file object, but in memory rather than 
representing a file. Its API is exactly the same as every other file object, 
because that’s the whole point of it.

> As can be seen, this requires changing the way buffer is constructed
> (usually in one place), the way buffer value is taken (usually in one
> place), but more importantly, it requires changing each line which
> adds content to a buffer, and there can be many of those for more
> complex algorithms, leading to a code less clear than the original code,
> requiring noise-like changes, and complicating updates to 3rd-party code
> which needs optimization.
> 
> To address this, this RFC proposes to add an __iadd__ method (i.e.
> implementing "+=" operator) to io.StringIO and io.BytesIO objects,
> making it the exact alias of .write() method. This will allow for
> the code very parallel to the original str-using code:

So your goal is to allow people to use badly-written third-party libs designed 
around the string-concat antipattern, without fixing those libs, by feeding 
them StringIO objects when they expected str objects?

This seems like a solution to a theoretical problem that might work for some 
instances of that problem. But do you have any actual examples of third-party 
libs that have this problem, and that (obviously) break if you give them 
StringIO objects, but would not break when passed a StringIO with __iadd__?

> But it wasn't always like that, with CPython2.7.17:
> 
> $ python2.7 str_iadd-vs-StringIO_write.py 
> 2.10510993004
> 0.0399420261383
> 
> But Python2 is dead, right?

Yes. Not as in “nobody will ever run it again”, but definitely as in “no new 
feature you add to Python will be backported”. Python 2.7 the language and 
CPython 2.7 the implementation have been feature-frozen for years now, and now 
they’re not even supported by the Python organization at all. So, trying to 
improve the behavior of Python 2.7 code by making a proposal for Python won’t 
get you anywhere. Adding StringIO.__iadd__ to Python 3.10 will not help anyone 
using Python 2.7.

In fact, even if you somehow convinced everyone to make the extraordinary 
decision to re-open Python 2.7 and make a new 2.7.18 release with this feature 
backported, it still wouldn’t help the vast majority of people using Python 
2.7, because most people using Python 2.7 are using stable systems with stable 
versions that they don’t update for years. That’s why they’re still using 2.7 
in the first place: because 2.7.16 is what comes with the Linux LTS they’ve 
settled on for deployment, or it’s what comes with the macOS version they use 
for their dev boxes, or Jython doesn’t have a 3.x version yet, or whatever. So 
a new feature in 2.7.18 wouldn’t get to them for years, if ever.

It’s also worth noting that the io module is very slow in most Python 2.x 
implementations. There’s a separate (and older) StringIO module, and for 
CPython an accelerated cStringIO, and you almost certainly want to use those, 
not io, here. (Except, of course, that what you really want to use is join 
anyway.)

> Ok, let's see how Jython3 and IronPython3
> fair. To my surprise, there're no (public releases of) such. Both
> projects sit firmly in the Python2 territory.

The last IronPython release, 2.7.9, was in 2018. As the release notes for that 
version say, “With this release, we will shift the majority of work to 
IronPython3.” Of course IronPython3 isn’t ready for prime time yet, but it’s 
not because they’re still firmly in Python2 territory and still making major 
improvements to their 2.7 branch, it’s because it’s taking a long time to 
finish their 3.x branch (in part because they no longer have Microsoft and 
Unity throwing resources at the project). They’re not adding new features to 
2.7 any more than CPython is. (They are working on a 2.7.10; but it’s just 
2.7.9 with support for more .NET runtimes plus porting some security fixes from 
the last CPython 2.7 stdlib.) I don’t know the situation with Jython as well, 
but I believe it’s similar. 

> Consequently, other implementations have 2 choices:
> 
> 1. Succumb to applying the same mis-optimization for string type as
> CPython3. (With the understanding that for speed-optimized projects,
> implementing mis-optimizations will eat into performance budget, and
> for memory-optimized projects, it likely will lead to noticeable
> memory bloat.)
> 2. Struggle against inefficient-by-concept usage, and promote usage of
> the correct object types for incremental construction of string content.
> This would require improving ergonomics of existing string buffer
> object, to make its usage less painful for both writing new code and
> refactoring existing.

3. Recognize that Python and CPython have been promoting str.join for this 
problem for decades, and most performance-critical code is already doing that, 
and make sure that solution is efficient, and recognjze that poorly-written 
code is uncommon but does exist, and may take a bit more work to optimize than 
a 1-line change to optimize, but that’s acceptable—and not the responsibility 
of any alternate Python implementation to help with.

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/HAUQFGGDALRU3NFPGLQN2SSBNNWDBMKZ/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Explicitly defining a string buffer object (aka StringIO += operator)

Reply via email to