Re: [Python-3000] string C API

2006-10-03 Thread Josiah Carlson
"Jim Jewett" <[EMAIL PROTECTED]> wrote: > On 10/3/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > Jim Jewett schrieb: > > > By knowing that there is only one possible representation for a given > > > string, he skips the equivalency cache. On the other hand, he also > > > loses the equivalen

Re: [Python-3000] string C API

2006-10-03 Thread Jim Jewett
On 10/3/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > Jim Jewett schrieb: > > By knowing that there is only one possible representation for a given > > string, he skips the equivalency cache. On the other hand, he also > > loses the equivalency cache. > What is an equivalency cache, and why

Re: [Python-3000] string C API

2006-10-03 Thread Martin v. Löwis
Jim Jewett schrieb: > By knowing that there is only one possible representation for a given > string, he skips the equivalency cache. On the other hand, he also > loses the equivalency cache. What is an equivalency cache, and why would one like to have one? > When python 2.x chooses the unicode

Re: [Python-3000] string C API

2006-10-03 Thread Jim Jewett
On 10/3/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > Jim Jewett schrieb: > > In python 3, a string object might look like > > #define PyObject_str_HEAD \ > >PyObject_VAR_HEAD \ > >long ob_shash; \ > >PyObject *cache; > > with a typical concrete implementation looking like

Re: [Python-3000] string C API

2006-10-03 Thread Martin v. Löwis
Jim Jewett schrieb: > In python 3, a string object might look like > > #define PyObject_str_HEAD \ >PyObject_VAR_HEAD \ >long ob_shash; \ >PyObject *cache; > > with a typical concrete implementation looking like > > typedef struct { >PyObject_str_HEAD >PyObject *encodin

Re: [Python-3000] string C API

2006-10-03 Thread Jim Jewett
On 10/3/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > Jim Jewett schrieb: > > The problem isn't the hash; it is the equality. Which encoding do you > > keep interned? When I wrote this, I had been assuming that UCS4(string) and UCS2(string) would be completely unrelated objects. With more

Re: [Python-3000] string C API

2006-10-03 Thread Martin v. Löwis
Jim Jewett schrieb: >>> Interning may get awkward if multiple encodings are allowed within a >>> program, regardless of whether they're allowed for single strings. It >>> might make sense to intern only strings that are in the same encoding >>> as the source code. (Or whose values are limited to

Re: [Python-3000] string C API

2006-09-16 Thread Greg Ewing
Martin v. Löwis wrote: > Just try implementing comparison some time. You can end up implementing > the same algorithm six times at least, once for each pair (1,1), (1,2), > (1,4), (2,2), (2,4), (4,4). #define UnicodeStringComparisonFunction(TYPE1, TYPE2) \ /* code to implement it here */ Unico

Re: [Python-3000] string C API

2006-09-16 Thread Marcin 'Qrczak' Kowalczyk
"Martin v. Löwis" <[EMAIL PROTECTED]> writes: > Just try implementing comparison some time. You can end up implementing > the same algorithm six times at least, once for each pair (1,1), (1,2), > (1,4), (2,2), (2,4), (4,4). If the algorithm isn't symmetric (i.e. > you can't reduce (2,1) to (1,2)),

Re: [Python-3000] string C API

2006-09-16 Thread Josiah Carlson
"Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > Nick Coghlan schrieb: > > If an 8-bit encoding other than latin-1 is used for the internal buffer, > > then every comparison operation would have to decode the string to > > Unicode in order to compare code points. > > > > It seems much simpler to

Re: [Python-3000] string C API

2006-09-16 Thread Martin v. Löwis
Nick Coghlan schrieb: > If an 8-bit encoding other than latin-1 is used for the internal buffer, > then every comparison operation would have to decode the string to > Unicode in order to compare code points. > > It seems much simpler to me to ensure that what is stored internally is > *always* th

Re: [Python-3000] string C API

2006-09-16 Thread Nick Coghlan
Martin v. Löwis wrote: > Nick Coghlan schrieb: >> The choice of latin-1 is deliberate and non-arbitrary. The reason for the >> choice is that the ordinals 0-255 in latin-1 map to the Unicode code points >> 0-255: > > That's true, but that this makes a good choice for a special case > doesn't fol

Re: [Python-3000] string C API

2006-09-16 Thread Martin v. Löwis
Marcin 'Qrczak' Kowalczyk schrieb: >> You could play tricks with ob_size to save this field: >> >> - ob_size < 0: 8-bit data; length is abs(ob_size) >> - ob_size > 0, (ob_size & 1)==0: 16-bit data, length is ob_size/2 >> - ob_size > 0, (ob_size & 1)==1: 32-bit data, length is ob_size/2 > > I wonde

Re: [Python-3000] string C API

2006-09-16 Thread Martin v. Löwis
Nick Coghlan schrieb: > The choice of latin-1 is deliberate and non-arbitrary. The reason for the > choice is that the ordinals 0-255 in latin-1 map to the Unicode code points > 0-255: That's true, but that this makes a good choice for a special case doesn't follow. Instead, frequency of occurre

Re: [Python-3000] string C API

2006-09-16 Thread Martin v. Löwis
Josiah Carlson schrieb: >> That places a burden on all creators of strings to ensure >> that they are in the minimal format, which could be >> inconvenient for some operations, e.g. taking a substring >> could require making an extra pass to re-code the data. > > If Martin says it's not a big deal

Re: [Python-3000] string C API

2006-09-16 Thread Marcin 'Qrczak' Kowalczyk
"Martin v. Löwis" <[EMAIL PROTECTED]> writes: > You could play tricks with ob_size to save this field: > > - ob_size < 0: 8-bit data; length is abs(ob_size) > - ob_size > 0, (ob_size & 1)==0: 16-bit data, length is ob_size/2 > - ob_size > 0, (ob_size & 1)==1: 32-bit data, length is ob_size/2 I wo

Re: [Python-3000] string C API

2006-09-16 Thread Marcin 'Qrczak' Kowalczyk
Greg Ewing <[EMAIL PROTECTED]> writes: > That places a burden on all creators of strings to ensure > that they are in the minimal format, which could be > inconvenient for some operations, e.g. taking a substring > could require making an extra pass to re-code the data. Yes, but taking a substrin

Re: [Python-3000] string C API

2006-09-16 Thread Josiah Carlson
Greg Ewing <[EMAIL PROTECTED]> wrote: > > Josiah Carlson wrote: > > Because all text objects are internally > > represented in its minimal 'encoding', equal text objects will always be > > in the same encoding. > > That places a burden on all creators of strings to ensure > that they are in the

Re: [Python-3000] string C API

2006-09-15 Thread Martin v. Löwis
Nick Coghlan schrieb: > That way the internal representation of a string would only need to grow > one extra field (the one saying how many bytes there are per character), > and the internal state would remain immutable. You could play tricks with ob_size to save this field: - ob_size < 0: 8-bit

Re: [Python-3000] string C API

2006-09-15 Thread Ronald Oussoren
On Sep 15, 2006, at 7:04 PM, Jim Jewett wrote: On 9/15/06, Nick Coghlan <[EMAIL PROTECTED]> wrote: Jim Jewett wrote: ... would be necessary to at least *scan* the string when it was first created in order to ensure it can be decoded without errors What happens today with strings? I th

Re: [Python-3000] string C API

2006-09-15 Thread Nick Coghlan
Antoine Pitrou wrote: > Le vendredi 15 septembre 2006 à 10:48 -0700, Josiah Carlson a écrit : >> This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4: > > You could replace "latin-1" with "one-byte system encoding chosen at > interpreter startup depending on locale". > There are

Re: [Python-3000] string C API

2006-09-15 Thread Nick Coghlan
Jim Jewett wrote: > On 9/15/06, Nick Coghlan <[EMAIL PROTECTED]> wrote: >> If you're reading text and you *know* it is ASCII data, then you can >> just set >> the encoding to latin-1 > > Only if latin-1 is a valid encoding for the internal implementation. I think the possible internal encodings

Re: [Python-3000] string C API

2006-09-15 Thread Greg Ewing
Josiah Carlson wrote: > Because all text objects are internally > represented in its minimal 'encoding', equal text objects will always be > in the same encoding. That places a burden on all creators of strings to ensure that they are in the minimal format, which could be inconvenient for some ope

Re: [Python-3000] string C API

2006-09-15 Thread Josiah Carlson
"Jim Jewett" <[EMAIL PROTECTED]> wrote: > On 9/15/06, Josiah Carlson <[EMAIL PROTECTED]> wrote: > > "Jim Jewett" <[EMAIL PROTECTED]> wrote: > > > Interning may get awkward if multiple encodings are allowed within a > > > program, regardless of whether they're allowed for single strings. It > > >

Re: [Python-3000] string C API

2006-09-15 Thread Jim Jewett
On 9/15/06, Josiah Carlson <[EMAIL PROTECTED]> wrote: > > "Jim Jewett" <[EMAIL PROTECTED]> wrote: > > Interning may get awkward if multiple encodings are allowed within a > > program, regardless of whether they're allowed for single strings. It > > might make sense to intern only strings that are

Re: [Python-3000] string C API

2006-09-15 Thread Josiah Carlson
"Paul Prescod" <[EMAIL PROTECTED]> wrote: [snip] > The result seems obvious to me...8-bit-fixed encodings are a terrible idea > and need to just go away. Let's not build them into Python's core on the > basis of a minor and fleeting performance improvement. Variable-width encodings make many oper

Re: [Python-3000] string C API

2006-09-15 Thread Paul Prescod
On 9/15/06, Antoine Pitrou <[EMAIL PROTECTED]> wrote: Le vendredi 15 septembre 2006 à 10:48 -0700, Josiah Carlson a écrit :> This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4:You could replace "latin-1" with "one-byte system encoding chosen at interpreter startup depending on

Re: [Python-3000] string C API

2006-09-15 Thread Marcin 'Qrczak' Kowalczyk
Antoine Pitrou <[EMAIL PROTECTED]> writes: >> This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4: > > You could replace "latin-1" with "one-byte system encoding chosen at > interpreter startup depending on locale". Latin-1 has the advantage of being trivially decodable to a se

Re: [Python-3000] string C API

2006-09-15 Thread Antoine Pitrou
Le vendredi 15 septembre 2006 à 10:48 -0700, Josiah Carlson a écrit : > This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4: You could replace "latin-1" with "one-byte system encoding chosen at interpreter startup depending on locale". There are lots of 8-bit encodings other tha

Re: [Python-3000] string C API

2006-09-15 Thread Josiah Carlson
"Jason Orendorff" <[EMAIL PROTECTED]> wrote: > > On 9/15/06, Jim Jewett <[EMAIL PROTECTED]> wrote: > > There should be only one reference to a string until is constructed, > > and after that, its data should be immutable. Recoding that results > > in different bytes should not be in-place. Eith

Re: [Python-3000] string C API

2006-09-15 Thread Josiah Carlson
"Jim Jewett" <[EMAIL PROTECTED]> wrote: > Interning may get awkward if multiple encodings are allowed within a > program, regardless of whether they're allowed for single strings. It > might make sense to intern only strings that are in the same encoding > as the source code. (Or whose values ar

Re: [Python-3000] string C API

2006-09-15 Thread Jim Jewett
On 9/15/06, Nick Coghlan <[EMAIL PROTECTED]> wrote: > Jim Jewett wrote: > >> ... would be necessary to at least *scan* the string when it > >> was first created in order to ensure it can be decoded without errors > > What happens today with strings? I think the answer is: > > "Nothing. > >

Re: [Python-3000] string C API

2006-09-15 Thread Paul Prescod
On 9/15/06, Jason Orendorff <[EMAIL PROTECTED]> wrote: I'm sure this will happen to the same degree that it's become astandard recipe in Java and C# (both of which lack polymorphicwhatzits).  Which is to say, not at all.I think Jason's point is key. This is probably premature optimization and shoul

Re: [Python-3000] string C API

2006-09-15 Thread Jason Orendorff
On 9/15/06, Jim Jewett <[EMAIL PROTECTED]> wrote: > There should be only one reference to a string until is constructed, > and after that, its data should be immutable. Recoding that results > in different bytes should not be in-place. Either it returns a new > string (no problem) or it doesn't c

Re: [Python-3000] string C API

2006-09-15 Thread Nick Coghlan
Jim Jewett wrote: >> > ISTM that raising the exception lazily (which seems to be necessary) >> > would be very confusing. > >> Yeah, it appears it would be necessary to at least *scan* the string >> when it >> was first created in order to ensure it can be decoded without errors >> later on. >

Re: [Python-3000] string C API

2006-09-15 Thread Jim Jewett
On 9/15/06, Nick Coghlan <[EMAIL PROTECTED]> wrote: > Martin v. Löwis wrote: > > Nick Coghlan schrieb: > >> Only the first such call on a given string, though - the idea is to use > >> lazy decoding, not to avoid decoding altogether. Most manipulations > >> (len, indexing, slicing, concatenation, e

Re: [Python-3000] string C API

2006-09-15 Thread Nick Coghlan
Martin v. Löwis wrote: > Nick Coghlan schrieb: >> Only the first such call on a given string, though - the idea is to use >> lazy decoding, not to avoid decoding altogether. Most manipulations >> (len, indexing, slicing, concatenation, etc) would require decoding to >> at least UCS-2 (or perhaps UC

Re: [Python-3000] string C API

2006-09-14 Thread Martin v. Löwis
Nick Coghlan schrieb: > Only the first such call on a given string, though - the idea is to use > lazy decoding, not to avoid decoding altogether. Most manipulations > (len, indexing, slicing, concatenation, etc) would require decoding to > at least UCS-2 (or perhaps UCS-4). Ok. Then my objection

Re: [Python-3000] string C API

2006-09-14 Thread Bob Ippolito
On 9/14/06, Josiah Carlson <[EMAIL PROTECTED]> wrote: > > "Bob Ippolito" <[EMAIL PROTECTED]> wrote: > > The argument for UTF-8 is probably interop efficiency. Lots of C > > libraries, file formats, and wire protocols use UTF-8 for interchange. > > Verifying the validity of UTF-8 during string creat

Re: [Python-3000] string C API

2006-09-14 Thread Josiah Carlson
"Bob Ippolito" <[EMAIL PROTECTED]> wrote: > The argument for UTF-8 is probably interop efficiency. Lots of C > libraries, file formats, and wire protocols use UTF-8 for interchange. > Verifying the validity of UTF-8 during string creation isn't that big > of a deal. Indeed, UTF-8 validation/creat

Re: [Python-3000] string C API

2006-09-14 Thread Bob Ippolito
On 9/14/06, Josiah Carlson <[EMAIL PROTECTED]> wrote: > > "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> wrote: > > Nick Coghlan <[EMAIL PROTECTED]> writes: > > > > > Only the first such call on a given string, though - the idea > > > is to use lazy decoding, not to avoid decoding altogether. > >

Re: [Python-3000] string C API

2006-09-14 Thread Josiah Carlson
"Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> wrote: > Nick Coghlan <[EMAIL PROTECTED]> writes: > > > Only the first such call on a given string, though - the idea > > is to use lazy decoding, not to avoid decoding altogether. > > Most manipulations (len, indexing, slicing, concatenation, etc)

Re: [Python-3000] string C API

2006-09-14 Thread Antoine
> Only the first such call on a given string, though - the idea is to use > lazy > decoding, not to avoid decoding altogether. Most manipulations (len, > indexing, > slicing, concatenation, etc) would require decoding to at least UCS-2 (or > perhaps UCS-4). My two cents: For len() you can comput

Re: [Python-3000] string C API

2006-09-14 Thread Marcin 'Qrczak' Kowalczyk
Nick Coghlan <[EMAIL PROTECTED]> writes: > Only the first such call on a given string, though - the idea > is to use lazy decoding, not to avoid decoding altogether. > Most manipulations (len, indexing, slicing, concatenation, etc) > would require decoding to at least UCS-2 (or perhaps UCS-4). Si

Re: [Python-3000] string C API

2006-09-14 Thread Nick Coghlan
Martin v. Löwis wrote: > Jim Jewett schrieb: >> Simply delegate such methods to a hidden per-encoding subclass. >> >> The UTF-8 methods will indeed be complex, unless the solution is >> simply "someone called indexing/slicing/len, so I have to recode after >> all." >> >> The Latin-1 encoding will h

Re: [Python-3000] string C API

2006-09-13 Thread Martin v. Löwis
Jim Jewett schrieb: > Simply delegate such methods to a hidden per-encoding subclass. > > The UTF-8 methods will indeed be complex, unless the solution is > simply "someone called indexing/slicing/len, so I have to recode after > all." > > The Latin-1 encoding will have no such problem. I'm not

Re: [Python-3000] string C API

2006-09-13 Thread Jim Jewett
On 9/13/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > Jim Jewett schrieb: > > Simply not encoding/decoding until required would save quite a bit of > > time and space -- but then the object would need some way of > > indicating which encoding it is in. > Try implementing that some time. You'l

Re: [Python-3000] string C API

2006-09-13 Thread Martin v. Löwis
Jim Jewett schrieb: > Simply not encoding/decoding until required would save quite a bit of > time and space -- but then the object would need some way of > indicating which encoding it is in. Try implementing that some time. You'll find it will be incredibly complex and unmaintainable. Start with

Re: [Python-3000] string C API

2006-09-13 Thread Jim Jewett
On 9/13/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > Should encoding be an attribute of the string? > No. A Python string is a sequence of Unicode characters. > Even if it was created by converting from some other encoding, > that original encoding gets lost when doing the conversion > (ju

Re: [Python-3000] string C API

2006-09-13 Thread Martin v. Löwis
Jim Jewett schrieb: >> For example, PyString_From{String[AndSize]|Format} would either: >> - have to grow an encoding argument >> - assume a default encoding (either ASCII or UTF-8) >> - change its signature to operate on Py_UNICODE* (although >> we don't have literals for these) or >> - be remov

Re: [Python-3000] string C API

2006-09-13 Thread Jim Jewett
On 9/13/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > Fredrik Lundh schrieb: > > just noticed that PEP 3100 says that PyString_AsEncodedString and > > PyString_AsDecodedString is to be removed, but it doesn't mention > > any other PyString (or PyUnicode) functions. > > how large changes can w

Re: [Python-3000] string C API

2006-09-12 Thread Martin v. Löwis
Fredrik Lundh schrieb: > just noticed that PEP 3100 says that PyString_AsEncodedString and > PyString_AsDecodedString is to be removed, but it doesn't mention > any other PyString (or PyUnicode) functions. > > how large changes can we make here, really ? All API that refers to the internal repres

Re: [Python-3000] string C API

2006-09-01 Thread Neal Norwitz
On 9/1/06, Fredrik Lundh <[EMAIL PROTECTED]> wrote: > just noticed that PEP 3100 says that PyString_AsEncodedString and > PyString_AsDecodedString is to be removed, but it doesn't mention > any other PyString (or PyUnicode) functions. > > how large changes can we make here, really ? I don't know i

[Python-3000] string C API

2006-09-01 Thread Fredrik Lundh
just noticed that PEP 3100 says that PyString_AsEncodedString and PyString_AsDecodedString is to be removed, but it doesn't mention any other PyString (or PyUnicode) functions. how large changes can we make here, really ? (I'm not going to sketch on a concrete proposal here; I'm more interested i