"Jim Jewett" <[EMAIL PROTECTED]> wrote:
> On 10/3/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> > Jim Jewett schrieb:
> > > By knowing that there is only one possible representation for a given
> > > string, he skips the equivalency cache. On the other hand, he also
> > > loses the equivalen
On 10/3/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> Jim Jewett schrieb:
> > By knowing that there is only one possible representation for a given
> > string, he skips the equivalency cache. On the other hand, he also
> > loses the equivalency cache.
> What is an equivalency cache, and why
Jim Jewett schrieb:
> By knowing that there is only one possible representation for a given
> string, he skips the equivalency cache. On the other hand, he also
> loses the equivalency cache.
What is an equivalency cache, and why would one like to have one?
> When python 2.x chooses the unicode
On 10/3/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> Jim Jewett schrieb:
> > In python 3, a string object might look like
> > #define PyObject_str_HEAD \
> >PyObject_VAR_HEAD \
> >long ob_shash; \
> >PyObject *cache;
> > with a typical concrete implementation looking like
Jim Jewett schrieb:
> In python 3, a string object might look like
>
> #define PyObject_str_HEAD \
>PyObject_VAR_HEAD \
>long ob_shash; \
>PyObject *cache;
>
> with a typical concrete implementation looking like
>
> typedef struct {
>PyObject_str_HEAD
>PyObject *encodin
On 10/3/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> Jim Jewett schrieb:
> > The problem isn't the hash; it is the equality. Which encoding do you
> > keep interned?
When I wrote this, I had been assuming that UCS4(string) and
UCS2(string) would be completely unrelated objects. With more
Jim Jewett schrieb:
>>> Interning may get awkward if multiple encodings are allowed within a
>>> program, regardless of whether they're allowed for single strings. It
>>> might make sense to intern only strings that are in the same encoding
>>> as the source code. (Or whose values are limited to
Martin v. Löwis wrote:
> Just try implementing comparison some time. You can end up implementing
> the same algorithm six times at least, once for each pair (1,1), (1,2),
> (1,4), (2,2), (2,4), (4,4).
#define UnicodeStringComparisonFunction(TYPE1, TYPE2) \
/* code to implement it here */
Unico
"Martin v. Löwis" <[EMAIL PROTECTED]> writes:
> Just try implementing comparison some time. You can end up implementing
> the same algorithm six times at least, once for each pair (1,1), (1,2),
> (1,4), (2,2), (2,4), (4,4). If the algorithm isn't symmetric (i.e.
> you can't reduce (2,1) to (1,2)),
"Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
>
> Nick Coghlan schrieb:
> > If an 8-bit encoding other than latin-1 is used for the internal buffer,
> > then every comparison operation would have to decode the string to
> > Unicode in order to compare code points.
> >
> > It seems much simpler to
Nick Coghlan schrieb:
> If an 8-bit encoding other than latin-1 is used for the internal buffer,
> then every comparison operation would have to decode the string to
> Unicode in order to compare code points.
>
> It seems much simpler to me to ensure that what is stored internally is
> *always* th
Martin v. Löwis wrote:
> Nick Coghlan schrieb:
>> The choice of latin-1 is deliberate and non-arbitrary. The reason for the
>> choice is that the ordinals 0-255 in latin-1 map to the Unicode code points
>> 0-255:
>
> That's true, but that this makes a good choice for a special case
> doesn't fol
Marcin 'Qrczak' Kowalczyk schrieb:
>> You could play tricks with ob_size to save this field:
>>
>> - ob_size < 0: 8-bit data; length is abs(ob_size)
>> - ob_size > 0, (ob_size & 1)==0: 16-bit data, length is ob_size/2
>> - ob_size > 0, (ob_size & 1)==1: 32-bit data, length is ob_size/2
>
> I wonde
Nick Coghlan schrieb:
> The choice of latin-1 is deliberate and non-arbitrary. The reason for the
> choice is that the ordinals 0-255 in latin-1 map to the Unicode code points
> 0-255:
That's true, but that this makes a good choice for a special case
doesn't follow. Instead, frequency of occurre
Josiah Carlson schrieb:
>> That places a burden on all creators of strings to ensure
>> that they are in the minimal format, which could be
>> inconvenient for some operations, e.g. taking a substring
>> could require making an extra pass to re-code the data.
>
> If Martin says it's not a big deal
"Martin v. Löwis" <[EMAIL PROTECTED]> writes:
> You could play tricks with ob_size to save this field:
>
> - ob_size < 0: 8-bit data; length is abs(ob_size)
> - ob_size > 0, (ob_size & 1)==0: 16-bit data, length is ob_size/2
> - ob_size > 0, (ob_size & 1)==1: 32-bit data, length is ob_size/2
I wo
Greg Ewing <[EMAIL PROTECTED]> writes:
> That places a burden on all creators of strings to ensure
> that they are in the minimal format, which could be
> inconvenient for some operations, e.g. taking a substring
> could require making an extra pass to re-code the data.
Yes, but taking a substrin
Greg Ewing <[EMAIL PROTECTED]> wrote:
>
> Josiah Carlson wrote:
> > Because all text objects are internally
> > represented in its minimal 'encoding', equal text objects will always be
> > in the same encoding.
>
> That places a burden on all creators of strings to ensure
> that they are in the
Nick Coghlan schrieb:
> That way the internal representation of a string would only need to grow
> one extra field (the one saying how many bytes there are per character),
> and the internal state would remain immutable.
You could play tricks with ob_size to save this field:
- ob_size < 0: 8-bit
On Sep 15, 2006, at 7:04 PM, Jim Jewett wrote:
On 9/15/06, Nick Coghlan <[EMAIL PROTECTED]> wrote:
Jim Jewett wrote:
... would be necessary to at least *scan* the string when it
was first created in order to ensure it can be decoded without
errors
What happens today with strings? I th
Antoine Pitrou wrote:
> Le vendredi 15 septembre 2006 à 10:48 -0700, Josiah Carlson a écrit :
>> This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4:
>
> You could replace "latin-1" with "one-byte system encoding chosen at
> interpreter startup depending on locale".
> There are
Jim Jewett wrote:
> On 9/15/06, Nick Coghlan <[EMAIL PROTECTED]> wrote:
>> If you're reading text and you *know* it is ASCII data, then you can
>> just set
>> the encoding to latin-1
>
> Only if latin-1 is a valid encoding for the internal implementation.
I think the possible internal encodings
Josiah Carlson wrote:
> Because all text objects are internally
> represented in its minimal 'encoding', equal text objects will always be
> in the same encoding.
That places a burden on all creators of strings to ensure
that they are in the minimal format, which could be
inconvenient for some ope
"Jim Jewett" <[EMAIL PROTECTED]> wrote:
> On 9/15/06, Josiah Carlson <[EMAIL PROTECTED]> wrote:
> > "Jim Jewett" <[EMAIL PROTECTED]> wrote:
> > > Interning may get awkward if multiple encodings are allowed within a
> > > program, regardless of whether they're allowed for single strings. It
> > >
On 9/15/06, Josiah Carlson <[EMAIL PROTECTED]> wrote:
>
> "Jim Jewett" <[EMAIL PROTECTED]> wrote:
> > Interning may get awkward if multiple encodings are allowed within a
> > program, regardless of whether they're allowed for single strings. It
> > might make sense to intern only strings that are
"Paul Prescod" <[EMAIL PROTECTED]> wrote:
[snip]
> The result seems obvious to me...8-bit-fixed encodings are a terrible idea
> and need to just go away. Let's not build them into Python's core on the
> basis of a minor and fleeting performance improvement.
Variable-width encodings make many oper
On 9/15/06, Antoine Pitrou <[EMAIL PROTECTED]> wrote:
Le vendredi 15 septembre 2006 à 10:48 -0700, Josiah Carlson a écrit :> This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4:You could replace "latin-1" with "one-byte system encoding chosen at
interpreter startup depending on
Antoine Pitrou <[EMAIL PROTECTED]> writes:
>> This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4:
>
> You could replace "latin-1" with "one-byte system encoding chosen at
> interpreter startup depending on locale".
Latin-1 has the advantage of being trivially decodable to a se
Le vendredi 15 septembre 2006 à 10:48 -0700, Josiah Carlson a écrit :
> This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4:
You could replace "latin-1" with "one-byte system encoding chosen at
interpreter startup depending on locale".
There are lots of 8-bit encodings other tha
"Jason Orendorff" <[EMAIL PROTECTED]> wrote:
>
> On 9/15/06, Jim Jewett <[EMAIL PROTECTED]> wrote:
> > There should be only one reference to a string until is constructed,
> > and after that, its data should be immutable. Recoding that results
> > in different bytes should not be in-place. Eith
"Jim Jewett" <[EMAIL PROTECTED]> wrote:
> Interning may get awkward if multiple encodings are allowed within a
> program, regardless of whether they're allowed for single strings. It
> might make sense to intern only strings that are in the same encoding
> as the source code. (Or whose values ar
On 9/15/06, Nick Coghlan <[EMAIL PROTECTED]> wrote:
> Jim Jewett wrote:
> >> ... would be necessary to at least *scan* the string when it
> >> was first created in order to ensure it can be decoded without errors
> > What happens today with strings? I think the answer is:
> > "Nothing.
> >
On 9/15/06, Jason Orendorff <[EMAIL PROTECTED]> wrote:
I'm sure this will happen to the same degree that it's become astandard recipe in Java and C# (both of which lack polymorphicwhatzits). Which is to say, not at all.I think Jason's point is key. This is probably premature optimization and shoul
On 9/15/06, Jim Jewett <[EMAIL PROTECTED]> wrote:
> There should be only one reference to a string until is constructed,
> and after that, its data should be immutable. Recoding that results
> in different bytes should not be in-place. Either it returns a new
> string (no problem) or it doesn't c
Jim Jewett wrote:
>> > ISTM that raising the exception lazily (which seems to be necessary)
>> > would be very confusing.
>
>> Yeah, it appears it would be necessary to at least *scan* the string
>> when it
>> was first created in order to ensure it can be decoded without errors
>> later on.
>
On 9/15/06, Nick Coghlan <[EMAIL PROTECTED]> wrote:
> Martin v. Löwis wrote:
> > Nick Coghlan schrieb:
> >> Only the first such call on a given string, though - the idea is to use
> >> lazy decoding, not to avoid decoding altogether. Most manipulations
> >> (len, indexing, slicing, concatenation, e
Martin v. Löwis wrote:
> Nick Coghlan schrieb:
>> Only the first such call on a given string, though - the idea is to use
>> lazy decoding, not to avoid decoding altogether. Most manipulations
>> (len, indexing, slicing, concatenation, etc) would require decoding to
>> at least UCS-2 (or perhaps UC
Nick Coghlan schrieb:
> Only the first such call on a given string, though - the idea is to use
> lazy decoding, not to avoid decoding altogether. Most manipulations
> (len, indexing, slicing, concatenation, etc) would require decoding to
> at least UCS-2 (or perhaps UCS-4).
Ok. Then my objection
On 9/14/06, Josiah Carlson <[EMAIL PROTECTED]> wrote:
>
> "Bob Ippolito" <[EMAIL PROTECTED]> wrote:
> > The argument for UTF-8 is probably interop efficiency. Lots of C
> > libraries, file formats, and wire protocols use UTF-8 for interchange.
> > Verifying the validity of UTF-8 during string creat
"Bob Ippolito" <[EMAIL PROTECTED]> wrote:
> The argument for UTF-8 is probably interop efficiency. Lots of C
> libraries, file formats, and wire protocols use UTF-8 for interchange.
> Verifying the validity of UTF-8 during string creation isn't that big
> of a deal.
Indeed, UTF-8 validation/creat
On 9/14/06, Josiah Carlson <[EMAIL PROTECTED]> wrote:
>
> "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> wrote:
> > Nick Coghlan <[EMAIL PROTECTED]> writes:
> >
> > > Only the first such call on a given string, though - the idea
> > > is to use lazy decoding, not to avoid decoding altogether.
> >
"Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> wrote:
> Nick Coghlan <[EMAIL PROTECTED]> writes:
>
> > Only the first such call on a given string, though - the idea
> > is to use lazy decoding, not to avoid decoding altogether.
> > Most manipulations (len, indexing, slicing, concatenation, etc)
> Only the first such call on a given string, though - the idea is to use
> lazy
> decoding, not to avoid decoding altogether. Most manipulations (len,
> indexing,
> slicing, concatenation, etc) would require decoding to at least UCS-2 (or
> perhaps UCS-4).
My two cents:
For len() you can comput
Nick Coghlan <[EMAIL PROTECTED]> writes:
> Only the first such call on a given string, though - the idea
> is to use lazy decoding, not to avoid decoding altogether.
> Most manipulations (len, indexing, slicing, concatenation, etc)
> would require decoding to at least UCS-2 (or perhaps UCS-4).
Si
Martin v. Löwis wrote:
> Jim Jewett schrieb:
>> Simply delegate such methods to a hidden per-encoding subclass.
>>
>> The UTF-8 methods will indeed be complex, unless the solution is
>> simply "someone called indexing/slicing/len, so I have to recode after
>> all."
>>
>> The Latin-1 encoding will h
Jim Jewett schrieb:
> Simply delegate such methods to a hidden per-encoding subclass.
>
> The UTF-8 methods will indeed be complex, unless the solution is
> simply "someone called indexing/slicing/len, so I have to recode after
> all."
>
> The Latin-1 encoding will have no such problem.
I'm not
On 9/13/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> Jim Jewett schrieb:
> > Simply not encoding/decoding until required would save quite a bit of
> > time and space -- but then the object would need some way of
> > indicating which encoding it is in.
> Try implementing that some time. You'l
Jim Jewett schrieb:
> Simply not encoding/decoding until required would save quite a bit of
> time and space -- but then the object would need some way of
> indicating which encoding it is in.
Try implementing that some time. You'll find it will be incredibly
complex and unmaintainable. Start with
On 9/13/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> > Should encoding be an attribute of the string?
> No. A Python string is a sequence of Unicode characters.
> Even if it was created by converting from some other encoding,
> that original encoding gets lost when doing the conversion
> (ju
Jim Jewett schrieb:
>> For example, PyString_From{String[AndSize]|Format} would either:
>> - have to grow an encoding argument
>> - assume a default encoding (either ASCII or UTF-8)
>> - change its signature to operate on Py_UNICODE* (although
>> we don't have literals for these) or
>> - be remov
On 9/13/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> Fredrik Lundh schrieb:
> > just noticed that PEP 3100 says that PyString_AsEncodedString and
> > PyString_AsDecodedString is to be removed, but it doesn't mention
> > any other PyString (or PyUnicode) functions.
> > how large changes can w
Fredrik Lundh schrieb:
> just noticed that PEP 3100 says that PyString_AsEncodedString and
> PyString_AsDecodedString is to be removed, but it doesn't mention
> any other PyString (or PyUnicode) functions.
>
> how large changes can we make here, really ?
All API that refers to the internal repres
On 9/1/06, Fredrik Lundh <[EMAIL PROTECTED]> wrote:
> just noticed that PEP 3100 says that PyString_AsEncodedString and
> PyString_AsDecodedString is to be removed, but it doesn't mention
> any other PyString (or PyUnicode) functions.
>
> how large changes can we make here, really ?
I don't know i
just noticed that PEP 3100 says that PyString_AsEncodedString and
PyString_AsDecodedString is to be removed, but it doesn't mention
any other PyString (or PyUnicode) functions.
how large changes can we make here, really ?
(I'm not going to sketch on a concrete proposal here; I'm more interested
i
54 matches
Mail list logo