[issue7551] SystemError/MemoryError/OverflowErrors on encode() a unicode string

2009-12-20 Thread Andreas Jung
New submission from Andreas Jung : We encountered a pretty bizarre behavior of Python 2.4.6 while decoding a 600MB long unicode string 'data': Python 2.4.6 (8GB RAM, 64 bit) (Pdb) type(data) (Pdb) len(data) 601794657 (Pdb) data2=data.encode('utf-8') *** SystemError: Negative size passed to

[issue7551] SystemError/MemoryError/OverflowErrors on encode() a unicode string

2009-12-20 Thread Mark Dickinson
Mark Dickinson added the comment: Is the first machine also a Linux machine? Perhaps the difference is that the first machine has a wide-unicode build (i.e., it uses UCS4 internally) and the other doesn't? Unfortunately there's not much that the python-devs can do about this unless the prob

[issue7551] SystemError/MemoryError/OverflowErrors on encode() a unicode string

2009-12-20 Thread Andreas Jung
Andreas Jung added the comment: Both systems are Linux system running a narrow Python build. The problem does not occur with Python 2.5 or 2.6. Unfortunately this error occurs with Zope 2 which is tied (at least with versions prior to Zope 2.12 to Python 2.4). -- status: pending -> o

[issue7551] SystemError/MemoryError/OverflowErrors on encode() a unicode string

2009-12-20 Thread Mark Dickinson
Mark Dickinson added the comment: Well, the signature of PyUnicode_Encode in Python 2.4 (see Objects/unicodeobject.c) is: PyObject *PyUnicode_Encode(const Py_UNICODE *s, int size, const char *encoding, const char

[issue7551] SystemError/MemoryError/OverflowErrors on encode() a unicode string

2009-12-20 Thread Martin v . Löwis
Martin v. Löwis added the comment: Just to support Mark's decision: Python 2.4 is no longer maintained; you are on your own with any problems you encounter with it. So closing it as "won't fix" would also have been appropriate. The same holds for 2.5, unless you can demonstrate this to cause se

[issue7551] SystemError/MemoryError/OverflowErrors on encode() a unicode string

2009-12-21 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: All string length calculations in Python 2.4 are done using ints which are 32-bit, even on 64-bit platforms. Since UTF-8 can use up to 4 bytes per Unicode code point, the encoder overallocates the needed chunk of memory to len*4 bytes. This will go straight