Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-29 Thread Alexander Belopolsky
On Sat, Jan 29, 2011 at 12:03 PM, Stefan Behnel wrote: .. > What about the character property functions? > > http://docs.python.org/py3k/c-api/unicode.html#unicode-character-properties > > Will they be adapted to accept Py_UCS4 instead of Py_UNICODE? They have been already. See revision 84177.

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-29 Thread Stefan Behnel
"Martin v. Löwis", 24.01.2011 21:17: I'd like to propose PEP 393, which takes a different approach, addressing both problems simultaneously: by getting a flexible representation (one that can be either 1, 2, or 4 bytes), we can support the full range of Unicode on all systems, but still use only

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-29 Thread Antoine Pitrou
On Sat, 29 Jan 2011 11:00:48 +0100 Stefan Behnel wrote: > > I know, that's not what I meant. But this PEP would enable a C API that > provides direct access to the underlying buffer. Just as is currently > provided for the Py_UNICODE array, but with a stable ABI because the buffer > type won't

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-29 Thread Nick Coghlan
On Sat, Jan 29, 2011 at 8:00 PM, Stefan Behnel wrote: > OTOH, one could argue that this is already partly provided by the generic > buffer API. Which won't be part of the stable ABI until 3.3 - there are some discrepancies between PEP 3118 and the actual implementation that we need to sort out fi

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-29 Thread Stefan Behnel
"Martin v. Löwis", 29.01.2011 10:05: None of the functions in this PEP become part of the stable ABI. I think that's only part of the truth. This PEP can potentially have an impact on the stable ABI in the sense that the build-time size of Py_UNICODE may no longer be important for extensions th

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-29 Thread Martin v. Löwis
>> None of the functions in this PEP become part of the stable ABI. > > I think that's only part of the truth. This PEP can potentially have an > impact on the stable ABI in the sense that the build-time size of > Py_UNICODE may no longer be important for extensions that work on > unicode buffers

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-28 Thread Stefan Behnel
"Martin v. Löwis", 24.01.2011 21:17: I have been thinking about Unicode representation for some time now. This was triggered, on the one hand, by discussions with Glyph Lefkowitz (who complained that his server app consumes too much memory), and Carl Friedrich Bolz (who profiled Python applicatio

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-28 Thread Stefan Behnel
"Martin v. Löwis", 24.01.2011 21:17: I have been thinking about Unicode representation for some time now. This was triggered, on the one hand, by discussions with Glyph Lefkowitz (who complained that his server app consumes too much memory), and Carl Friedrich Bolz (who profiled Python applicatio

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-28 Thread Stefan Behnel
"Martin v. Löwis", 28.01.2011 22:49: And indeed, when Cython is updated to 3.3, it shouldn't access the UTF-8 representation for such a loop. Instead, it should access the str representation Sure. Regarding Cython specifically, the above will still be *possible* under the proposal, given tha

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-28 Thread Josiah Carlson
Pardon me for this drive-by posting, but this thread smells a lot like this old thread (don't be afraid to read it all, there are some good points in there; not directed at you Martin, but at all readers/posters in this thread)... http://mail.python.org/pipermail/python-3000/2006-September/003795.

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-28 Thread Martin v. Löwis
> The nice thing about Py_UNICODE is that is basically gives you native > Unicode code points directly, without needing to decode UTF-8 byte runs > and the like. In Cython, it allows you to do things like this: > > def test_for_those_characters(unicode s): > for c in s: > #

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-28 Thread Stefan Behnel
Florian Weimer, 28.01.2011 15:27: * Stefan Behnel: The nice thing about Py_UNICODE is that is basically gives you native Unicode code points directly, without needing to decode UTF-8 byte runs and the like. In Cython, it allows you to do things like this: def test_for_those_characters(uni

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-28 Thread Florian Weimer
* Stefan Behnel: > The nice thing about Py_UNICODE is that is basically gives you native > Unicode code points directly, without needing to decode UTF-8 byte > runs and the like. In Cython, it allows you to do things like this: > > def test_for_those_characters(unicode s): > for c in s

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-28 Thread Stefan Behnel
Florian Weimer, 28.01.2011 10:35: * Stefan Behnel: "Martin v. Löwis", 24.01.2011 21:17: The Py_UNICODE type is still supported but deprecated. It is always defined as a typedef for wchar_t, so the wstr representation can double as Py_UNICODE representation. It's too bad this isn't initialised

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-28 Thread Florian Weimer
* Stefan Behnel: > "Martin v. Löwis", 24.01.2011 21:17: >> The Py_UNICODE type is still supported but deprecated. It is always >> defined as a typedef for wchar_t, so the wstr representation can double >> as Py_UNICODE representation. > > It's too bad this isn't initialised by default, though. Py_

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Stefan Behnel
"Martin v. Löwis", 28.01.2011 01:02: Am 27.01.2011 23:53, schrieb Stefan Behnel: "Martin v. Löwis", 24.01.2011 21:17: If the string is created directly with the canonical representation (see below), this representation doesn't take a separate memory block, but is allocated right after the PyUni

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Martin v. Löwis
Am 27.01.2011 23:53, schrieb Stefan Behnel: > "Martin v. Löwis", 24.01.2011 21:17: >> If the string is created directly with the canonical representation >> (see below), this representation doesn't take a separate memory block, >> but is allocated right after the PyUnicodeObject struct. > > Does t

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Stefan Behnel
"Martin v. Löwis", 24.01.2011 21:17: If the string is created directly with the canonical representation (see below), this representation doesn't take a separate memory block, but is allocated right after the PyUnicodeObject struct. Does this mean it's supposed to become a PyVarObject? Antoine

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Antoine Pitrou
> > Incidentally, to slightly reduce the overhead the unicode objects, > > there's this proposal: http://bugs.python.org/issue1943 > > I wonder what aspects of this patch and discussion should be integrated > into the PEP. The notion of allocating the memory in the same block is > already conside

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Gregory P. Smith
BTW, has anyone looked at what other languages with a native unicode type do for their implementations if any of them attempt to conserve ram? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Martin v. Löwis
> I agree. After all, CPython is lucky to have it available. It wouldn't > be the first time that we duplicate looping code based on the input > type. However, like the looping code, it will also complicate all > indexing code at runtime as it always needs to test which of the > representations is

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Martin v. Löwis
Am 27.01.2011 20:06, schrieb Stefan Behnel: > "Martin v. Löwis", 24.01.2011 21:17: >> The Py_UNICODE type is still supported but deprecated. It is always >> defined as a typedef for wchar_t, so the wstr representation can double >> as Py_UNICODE representation. > > It's too bad this isn't initiali

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Martin v. Löwis
>>From my first impression, I'm > not too thrilled by the prospect of making the Unicode implementation > more complicated by having three different representations on each > object. Thanks, added as a concern. > I also don't see how this could save a lot of memory. As an example > take a French

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Martin v. Löwis
Am 25.01.2011 12:08, schrieb Nick Coghlan: > On Tue, Jan 25, 2011 at 6:17 AM, "Martin v. Löwis" wrote: >> A new function PyUnicode_AsUTF8 is provided to access the UTF-8 >> representation. It is thus identical to the existing >> _PyUnicode_AsString, which is removed. The function will compute the

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Stefan Behnel
James Y Knight, 27.01.2011 21:26: On Jan 27, 2011, at 2:06 PM, Stefan Behnel wrote: "Martin v. Löwis", 24.01.2011 21:17: The Py_UNICODE type is still supported but deprecated. It is always defined as a typedef for wchar_t, so the wstr representation can double as Py_UNICODE representation. It

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Martin v. Löwis
> Repetition of "11"; I'm guessing that the 2byte/UCS-2 should read "10", > so that they give the width of the char representation. Thanks, fixed. >> 00 => null pointer > > Naturally this assumes that all pointers are at least 4-byte aligned (so > that they can be masked off). I assume that t

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Glenn Linderman
On 1/27/2011 12:26 PM, James Y Knight wrote: On Jan 27, 2011, at 2:06 PM, Stefan Behnel wrote: "Martin v. Löwis", 24.01.2011 21:17: The Py_UNICODE type is still supported but deprecated. It is always defined as a typedef for wchar_t, so the wstr representation can double as Py_UNICODE represent

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Martin v. Löwis
> I believe the intent this pep is aiming at is for the existing in > memory structure to be compatible with already compiled binary > extension modules without having to recompile them or change the APIs > they are using. No, binary compatibility is not achieved. ABI-conforming modules will conti

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Martin v. Löwis
> So, the only criticism I have, intuitively, is that the unicode > structure seems to become a bit too large. For example, I'm not sure you > need a generic (pointer, size) pair in addition to the > representation-specific ones. It's not really a generic pointer, but rather a variable-sized point

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread James Y Knight
On Jan 27, 2011, at 2:06 PM, Stefan Behnel wrote: > "Martin v. Löwis", 24.01.2011 21:17: >> The Py_UNICODE type is still supported but deprecated. It is always >> defined as a typedef for wchar_t, so the wstr representation can double >> as Py_UNICODE representation. > > It's too bad this isn't in

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Stefan Behnel
"Martin v. Löwis", 24.01.2011 21:17: The Py_UNICODE type is still supported but deprecated. It is always defined as a typedef for wchar_t, so the wstr representation can double as Py_UNICODE representation. It's too bad this isn't initialised by default, though. Py_UNICODE is the only represen

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-27 Thread Antoine Pitrou
Le mercredi 26 janvier 2011 à 21:50 -0800, Gregory P. Smith a écrit : > > > > Incidentally, to slightly reduce the overhead the unicode objects, > > there's this proposal: http://bugs.python.org/issue1943 > > Interesting. But that aims more at cpu performance than memory > overhead. What I see i

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-26 Thread Gregory P. Smith
On Mon, Jan 24, 2011 at 3:20 PM, Antoine Pitrou wrote: > Le mardi 25 janvier 2011 à 00:07 +0100, "Martin v. Löwis" a écrit : >> >> I'd like to propose PEP 393, which takes a different approach, >> >> addressing both problems simultaneously: by getting a flexible >> >> representation (one that can

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-26 Thread Paul Moore
On 26 January 2011 12:30, Nick Coghlan wrote: > The PEP actually does define that already: > > PyUnicode_AsUTF8 populates the utf8 field of the existing string, > while PyUnicode_AsUTF8String creates a *new* string with that field > populated. > > PyUnicode_AsUnicode will populate the wstr field (

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-26 Thread Nick Coghlan
On Wed, Jan 26, 2011 at 11:50 AM, Dj Gilcrease wrote: > On Tue, Jan 25, 2011 at 5:43 PM, M.-A. Lemburg wrote: >> I also don't see how this could save a lot of memory. As an example >> take a French text with say 10mio code points. This would end up >> appearing in memory as 3 copies on Windows: o

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-25 Thread Dj Gilcrease
On Tue, Jan 25, 2011 at 5:43 PM, M.-A. Lemburg wrote: > I also don't see how this could save a lot of memory. As an example > take a French text with say 10mio code points. This would end up > appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB), > one as Latin-1 (10MB) and o

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-25 Thread Antoine Pitrou
On Tue, 25 Jan 2011 21:08:01 +1000 Nick Coghlan wrote: > > One change I would propose is that rather than hiding flags in the low > order bits of the str pointer, we expand the use of the existing > "state" field to cover the representation information in addition to > the interning information.

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-25 Thread Antoine Pitrou
For the record: > I also don't see how this could save a lot of memory. As an example > take a French text with say 10mio code points. This would end up > appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB), > one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, de

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-25 Thread M.-A. Lemburg
I'll comment more on this later this week... >From my first impression, I'm not too thrilled by the prospect of making the Unicode implementation more complicated by having three different representations on each object. I also don't see how this could save a lot of memory. As an example take a F

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-25 Thread Nick Coghlan
On Tue, Jan 25, 2011 at 6:17 AM, "Martin v. Löwis" wrote: > A new function PyUnicode_AsUTF8 is provided to access the UTF-8 > representation. It is thus identical to the existing > _PyUnicode_AsString, which is removed. The function will compute the > utf8 representation when first called. Since t

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-24 Thread David Malcolm
On Mon, 2011-01-24 at 21:17 +0100, "Martin v. Löwis" wrote: ... snip ... > I'd like to propose PEP 393, which takes a different approach, > addressing both problems simultaneously: by getting a flexible > representation (one that can be either 1, 2, or 4 bytes), we can > support the full range of

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-24 Thread Antoine Pitrou
Le mardi 25 janvier 2011 à 00:07 +0100, "Martin v. Löwis" a écrit : > >> I'd like to propose PEP 393, which takes a different approach, > >> addressing both problems simultaneously: by getting a flexible > >> representation (one that can be either 1, 2, or 4 bytes), we can > >> support the full ran

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-24 Thread Martin v. Löwis
>> I'd like to propose PEP 393, which takes a different approach, >> addressing both problems simultaneously: by getting a flexible >> representation (one that can be either 1, 2, or 4 bytes), we can >> support the full range of Unicode on all systems, but still use >> only one byte per character f

Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-24 Thread Antoine Pitrou
On Mon, 24 Jan 2011 21:17:34 +0100 "Martin v. Löwis" wrote: > I have been thinking about Unicode representation for some time now. > This was triggered, on the one hand, by discussions with Glyph Lefkowitz > (who complained that his server app consumes too much memory), and Carl > Friedrich Bolz (

[Python-Dev] PEP 393: Flexible String Representation

2011-01-24 Thread Martin v. Löwis
I have been thinking about Unicode representation for some time now. This was triggered, on the one hand, by discussions with Glyph Lefkowitz (who complained that his server app consumes too much memory), and Carl Friedrich Bolz (who profiled Python applications to determine that Unicode strings ar