[issue25709] Problem with string concatenation and utf-8 cache.

2016-02-12 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

b9c8f1c80f47 added a new head. Should we merge 3.3 -> 3.4 -> 3.5 -> default?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2016-02-12 Thread Ned Deily

Changes by Ned Deily :


--
stage: patch review -> resolved

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2016-02-11 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
assignee: serhiy.storchaka -> georg.brandl

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2016-02-11 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

> I wait only Greg's approving for 3.3. If I'll not get it in a day, I'll 
> commit the patch for 3.4+.

Maybe it was my fault. I made a mistake in Georg's name.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2016-02-11 Thread Georg Brandl

Georg Brandl added the comment:

Backpicked to 3.3. Sorry for the wait.

--
resolution:  -> fixed
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2016-02-11 Thread Roundup Robot

Roundup Robot added the comment:

New changeset b9c8f1c80f47 by Serhiy Storchaka in branch '3.3':
Issue #25709: Fixed problem with in-place string concatenation and utf-8 cache.
https://hg.python.org/cpython/rev/b9c8f1c80f47

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2016-02-11 Thread Georg Brandl

Georg Brandl added the comment:

Actually I prefer Greg to Gerg, so it's only half bad. :D

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-12-06 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 376b100107ba by Serhiy Storchaka in branch '3.5':
Issue #25709: Fixed problem with in-place string concatenation and utf-8 cache.
https://hg.python.org/cpython/rev/376b100107ba

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-12-05 Thread Larry Hastings

Larry Hastings added the comment:

I cherry-picked this for 3.5.1.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-12-03 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
versions:  -Python 3.4, Python 3.5, Python 3.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-12-02 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 67718032badb by Serhiy Storchaka in branch '3.4':
Issue #25709: Fixed problem with in-place string concatenation and utf-8 cache.
https://hg.python.org/cpython/rev/67718032badb

New changeset a0e2376768dc by Serhiy Storchaka in branch '3.5':
Issue #25709: Fixed problem with in-place string concatenation and utf-8 cache.
https://hg.python.org/cpython/rev/a0e2376768dc

New changeset 9e800b2aeeac by Serhiy Storchaka in branch 'default':
Issue #25709: Fixed problem with in-place string concatenation and utf-8 cache.
https://hg.python.org/cpython/rev/9e800b2aeeac

--
nosy: +python-dev

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-12-02 Thread STINNER Victor

STINNER Victor added the comment:

> New changeset 67718032badb by Serhiy Storchaka in branch '3.4':

Thanks.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-12-02 Thread Larry Hastings

Larry Hastings added the comment:

Is this going in soon?  I want to cherry-pick this for 3.5.1, which I tag in 
about 80 hours.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-12-02 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

I wait only Greg's approving for 3.3. If I'll not get it in a day, I'll commit 
the patch for 3.4+.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-12-02 Thread STINNER Victor

STINNER Victor added the comment:

Please commit right now to 3.4+. Backport to 3.3 can be done later.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-11-24 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Fixed yet one bug (thanks Victor again). Test is improved, now it doesn't rely 
on implementation detail of particular builtin.

--
Added file: http://bugs.python.org/file41146/issue25709_4.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-11-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 24.11.2015 02:30, Steven D'Aprano wrote:
> 
> Steven D'Aprano added the comment:
> 
> On Mon, Nov 23, 2015 at 09:48:46PM +, STINNER Victor wrote:
> 
>> * the string has a cached UTF-8 byte string (ex: int(s) was called before 
>> the resize)
> 
> Why do strings cache their UTF-8 encoding?
> 
> I presume that some of Python's internals rely on the UTF-8 encoding 
> rather than the internal Latin-1/UCS-2/UTF-32 representation (PEP 393). 
> E.g. I infer from the above that int(s) parses the UTF-8 representation 
> of s rather than the internal representation. Is that right?
> 
> Nevertheless, I wonder why the UTF-8 representation is cached. Is it 
> that expensive to generate that it can't be done on the fly, as needed? 
> As it stands now, non-ASCII strings may be up to twice as big as they 
> need be, once you include the UTF-8 cache. And, as this bug painfully 
> shows, the problem with caches is that you run the risk of the cache 
> being out of date.

The cache is needed because it's the only way to get a direct
C char* to the object's UTF-8 representation without having to
worry about memory management on the caller's side. Not having
access to this would break a lot of code using the Python
C API, since the cache is there per design. The speedup aspect
is secondary.

Unicode objects are normally immutable, but there are a few
corner cases during the initialization of the objects where
they are in fact mutable for a short while, e.g. when
creating an empty object which is then filled with data and
resized to the final length before passing it back to
Python.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-11-24 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

> Why do strings cache their UTF-8 encoding?

Mainly for compatibility with existing C API. Common way to parse function 
arguments in implemented in C function is to use special argument parsing API: 
PyArg_ParseTuple, PyArg_ParseTupleAndKeywords, or PyArg_Parse. Most format 
codes for Unicode strings returned a C pointer to char array. For that encoded 
Unicode strings should be kept somewhere at least for the time of executing C 
function. As well as PyArg_Parse* functions doesn't allow user to specify a 
storage for encoded string, it should be saved in Unicode object. That is not 
new to PEP 393 or Python 3, in Python 2 the Unicode objects also keep cached 
encoded version.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-11-24 Thread STINNER Victor

STINNER Victor added the comment:

issue25709_4.patch now looks good to me, but I added some minor comments on the 
review.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-11-24 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Georg, I ask for applying this fix to 3.3.

--
nosy: +georg.brandl
versions: +Python 3.3

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-11-23 Thread Terry J. Reedy

Terry J. Reedy added the comment:

It would be good to get this in 3.4.4.

--
components: +Library (Lib) -IDLE
nosy: +benjamin.peterson, ezio.melotti, haypo, lemburg, pitrou -kbk, roger.serwy
title: greek alphabet bug it is very disturbing... -> Problem with string 
concatenation and utf-8 cache.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-11-23 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Added test without using pickle.

--
nosy: +larry
priority: high -> release blocker
Added file: http://bugs.python.org/file41141/issue25709_2.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-11-23 Thread Eryk Sun

Eryk Sun added the comment:

Serhiy, when does sharing UTF-8 data occur in a compact object? It has to be 
ASCII since non-ASCII UTF-8 isn't sharable, but PyASCIIObject doesn't have the 
utf8 field. So it has to be a PyCompactUnicodeObject. But isn't ASCII always 
allocated as a PyASCIIObject? I need a bit of help getting from point A to 
point B. Thanks.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-11-23 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

In updated patch fixed a bug found by Victor and addressed other his comments. 
Many thanks Victor!

--
Added file: http://bugs.python.org/file41142/issue25709_3.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-11-23 Thread Steven D'Aprano

Steven D'Aprano added the comment:

On Mon, Nov 23, 2015 at 09:48:46PM +, STINNER Victor wrote:

> * the string has a cached UTF-8 byte string (ex: int(s) was called before the 
> resize)

Why do strings cache their UTF-8 encoding?

I presume that some of Python's internals rely on the UTF-8 encoding 
rather than the internal Latin-1/UCS-2/UTF-32 representation (PEP 393). 
E.g. I infer from the above that int(s) parses the UTF-8 representation 
of s rather than the internal representation. Is that right?

Nevertheless, I wonder why the UTF-8 representation is cached. Is it 
that expensive to generate that it can't be done on the fly, as needed? 
As it stands now, non-ASCII strings may be up to twice as big as they 
need be, once you include the UTF-8 cache. And, as this bug painfully 
shows, the problem with caches is that you run the risk of the cache 
being out of date.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-11-23 Thread STINNER Victor

STINNER Victor added the comment:

Steven D'Aprano added the comment:
> the problem with caches is that you run the risk of the cache being out
of date.

Since strings are immutable, it's not a big deal. We control where strings
are modified (unicodeobject.c).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-11-23 Thread Eryk Sun

Eryk Sun added the comment:

> Why do strings cache their UTF-8 encoding?

Strings also cache the wide-string representation. For example:

from ctypes import *
s = '\241\242\243'
pythonapi.PyUnicode_AsUnicodeAndSize(py_object(s), None)
pythonapi.PyUnicode_AsUTF8AndSize(py_object(s), None)

>>> hex(id(s))
'0x769f8e98'

(gdb) p *(PyCompactUnicodeObject *)0x769f8e98
$1 = {_base = {ob_base = {_ob_next = 0x7697f890,
  _ob_prev = 0x76a04d40,
  ob_refcnt = 1, 
  ob_type = 0x89d860 },
   length = 3,
   hash = -5238559198920514942,
   state = {interned = 0,
kind = 1,
compact = 1,
ascii = 0,
ready = 1},
   wstr = 0x769690a0 L"¡¢£"},
  utf8_length = 6,
  utf8 = 0x7696b7e8 "¡¢£",
  wstr_length = 3}

(gdb) p (char *)((PyCompactUnicodeObject *)0x769f8e98 + 1)
$2 = 0x769f8ef0 "\241\242\243"

This object uses 4 bytes for the null-terminated Latin-1 string, which directly 
follows the PyCompactUnicodeObject struct. It uses 7 bytes for the UTF-8 
string. It uses 16 bytes for the wchar_t string (4 bytes per wchar_t).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-11-23 Thread STINNER Victor

STINNER Victor added the comment:

I reviewed issue25709_2.patch.

> It would be good to get this in 3.4.4.

Since it's a major bug in the Unicode implementation, it may be worth to fix it 
in Python 3.3. The bug was introduced in Python 3.3 by the PEP 393.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-11-23 Thread Antoine Pitrou

Antoine Pitrou added the comment:

3.3 is presumably in security mode. Anyone using it would have had to live with 
the bug for a long time already.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-11-23 Thread random832

random832 added the comment:

> unicode_modifiable in Objects/unicodeobject.c should return 0 if there's 
> cached PyUnicode_UTF8 data. In this case PyUnicode_Append won't operate in 
> place but instead concatenate a new string.

Shouldn't it still operate in place but clear it? Operating in place is only an 
option if the old string has no references and will be discarded, right?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-11-23 Thread Larry Hastings

Larry Hastings added the comment:

I read some comments here and on the patches.  Serhiy's patch adds some code 
and Victor says you can't call that macro on this object and wow this is badly 
broken.  Can someone explain in simpler terms what's so broken, exactly?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25709] Problem with string concatenation and utf-8 cache.

2015-11-23 Thread STINNER Victor

STINNER Victor added the comment:

" and wow this is badly broken "

I mean the currently code is badly broken.

The bug is that sometimes, when a string is resized (which doesn't make sense, 
strings are immutable, right? :-D), the cached UTF-8 string can become 
corrupted (old pointer not updated).

It occurs if

* the string is resized (ex: "s += s2")
* the string has a cached UTF-8 byte string (ex: int(s) was called before the 
resize)
* the resize moves the memory block to a new address

Ok, it's probably unlikely to get these 3 conditions, but from my point of 
view, it's a major bug because it's in a Python fundamental type (str), it's 
not a bug in user code and it cannot be worked around (easily).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com