OverflowErrors on encode() a unicode string

Marc-Andre Lemburg Mon, 21 Dec 2009 01:24:41 -0800

Marc-Andre Lemburg <[email protected]> added the comment:

All string length calculations in Python 2.4 are done using ints
which are 32-bit, even on 64-bit platforms.


Since UTF-8 can use up to 4 bytes per Unicode code point, the encoder
overallocates the needed chunk of memory to len*4 bytes. This
will go straight over the 2GB limit the 32-bit int imposes if
you try to encode a 512M code point Unicode string.

The reason for using ints to represent string length is simple:
no one really expected that someone would work with 2GB strings
in memory at the time the string API was designed (large hard
drives had around 2GB at that time) - strings of such size are
simply not supported by Python 2.4.

BTW: I wouldn't really count on Python 2.4 working properly on
64-bit platforms. A lot of issues were fixed in Python 2.5
related to 32/64-bit differences.

----------
nosy: +lemburg
title: SystemError/MemoryError/OverflowErrors on encode() a unicode string -> 
SystemError/MemoryError/OverflowErrors on encode() a      unicode string

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue7551>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue7551] SystemError/MemoryError/OverflowErrors on encode() a unicode string

Reply via email to