Re: [Python-Dev] PEP 393 Summer of Code Project
Torsten Becker, 24.08.2011 04:41:
Also, common, now simple, checks for "unicode->str == NULL" would look
more ambiguous with a union ("unicode->str.latin1 == NULL").
You could just add yet another field "any", i.e.
union {
unsigned char* latin1;
Py_UCS2* ucs2;
Py_UCS4* ucs4;
void* any;
} str;
That way, the above test becomes
if (!unicode->str.any)
or
if (unicode->str.any == NULL)
Or maybe even call it "initialised" to match the intended purpose:
if (!unicode->str.initialised)
That being said, I don't mind "unicode->str.latin1 == NULL" either, given
that it will (as mentioned by others) be hidden behind a macro most of the
time anyway.
Stefan
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] FileSystemError or FilesystemError?
Nick Coghlan writes: > Since I tend to use the one word 'filesystem' form myself (ditto for > 'filename'), I'm +1 for FilesystemError, but I'm only -0 for > FileSystemError (so I expect that will be the option chosen, given > other responses). I slightly prefer FilesystemError because it parses unambiguously. Cf. FileSystemError vs FileUserError. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On 8/23/2011 5:46 PM, Terry Reedy wrote: On 8/23/2011 6:20 AM, "Martin v. Löwis" wrote: Am 23.08.2011 11:46, schrieb Xavier Morel: Mostly ascii is pretty common for western-european languages (French, for instance, is probably 90 to 95% ascii). It's also a risk in english, when the writer "correctly" spells foreign words (résumé and the like). I know - I still question whether it is "extremely common" (so much as to justify a special case). I.e. on what application with what dataset would you gain what speedup, at the expense of what amount of extra lines, and potential slow-down for other datasets? [snip] In the PEP 393 approach, if the string has a two-byte representation, each character needs to widened to two bytes, and likewise for four bytes. So three separate copies of the unrolled loop would be needed, one for each target size. I fully support the declared purpose of the PEP, which I understand to be to have a full,correct Unicode implementation on all new Python releases without paying unnecessary space (and consequent time) penalties. I think the erroneous length, iteration, indexing, and slicing for strings with non-BMP chars in narrow builds needs to be fixed for future versions. I think we should at least consider alternatives to the PEP393 solution of double or quadrupling space if needed for at least one char. In utf16.py, attached to http://bugs.python.org/issue12729 I propose for consideration a prototype of different solution to the 'mostly BMP chars, few non-BMP chars' case. Rather than expand every character from 2 bytes to 4, attach an array cpdex of character (ie code point, not code unit) indexes. Then for indexing and slicing, the correction is simple, simpler than I first expected: code-unit-index = char-index + bisect.bisect_left(cpdex, char_index) where code-unit-index is the adjusted index into the full underlying double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids most of the space penalty and the consequent time penalty of moving more bytes around and increasing cache misses. I believe the same idea would work for utf8 and the mostly-ascii case. The main difference is that non-ascii chars have various byte sizes rather than the 1 extra double-byte of non-BMP chars in UCS2 builds. So the offset correction would not simply be the bisect-left return but would require another lookup byte-index = char-index + offsets[bisect-left(cpdex, char-index)] If possible, I would have the with-index-array versions be separate subtypes, as in utf16.py. I believe either index-array implementation might benefit from a subtype for single multi-unit chars, as a single non-ASCII or non-BMP char does not need an auxiliary [0] array and a senseless lookup therein but does need its length fixed at 1 instead of the number of base array units. So am I correctly reading between the lines when, after reading this thread so far, and the complete issue discussion so far, that I see a PEP 393 revision or replacement that has the following characteristics: 1) Narrow builds are dropped. The conceptual idea of PEP 393 eliminates the need for narrow builds, as the internal string data structures adjust to the actuality of the data. If you want a narrow build, just don't use code points > 65535. 2) There are more, or different, internal kinds of strings, which affect the processing patterns. Here is an enumeration of the ones I can think of, as complete as possible, with recognition that benchmarking and clever algorithms may eliminate the need for some of them. a) all ASCII b) latin-1 (8-bit codepoints, the first 256 Unicode codepoints) This kind may not be able to support a "mostly" variation, and may be no more efficient than case b). But it might also be popular in parts of Europe :) And appropriate benchmarks may discover whether or not it has worth. c) mostly ASCII (utf8) with clever indexing/caching to be efficient d) UTF-8 with clever indexing/caching to be efficient e) 16-bit codepoints f) UTF-16 with clever indexing/caching to be efficient g) 32-bit codepoints h) UTF-32 When instantiating a str, a new parameter or subtype would restrict the implementation to using only a), b), d), f), and h) when fully conformant Unicode behavior is desired. No lone surrogates, no out of range code points, no illegal codepoints. A default str would prefer a), b), c), e), and g) for efficiency and flexibility. When manipulations outside of Unicode are necessary [Windows seems to use e) for example, suffering from the same sorts of backward compatibility problems as Python, in some ways], the default str type would permit them, using e) and g) kinds of representations. Although the surrogate escape codec only uses prefix surrogates (or is it only suffix ones?) which would never match up, note that a conversion from 16-bit codepoints to other formats may produce matches between the results of the surrogate escape
Re: [Python-Dev] PEP 393 Summer of Code Project
Le 24/08/2011 04:41, Torsten Becker a écrit :
On Tue, Aug 23, 2011 at 10:08, Antoine Pitrou wrote:
Macros are useful to shield the abstraction from the implementation. If
you access the members directly, and the unicode object is represented
differently in some future version of Python (say e.g. with tagged
pointers), your code doesn't compile anymore.
I agree with Antoine, from the experience of porting C code from 3.2
to the PEP 393 unicode API, the additional encapsulation by macros
made it much easier to change the implementation of what is a field,
what is a field's actual name, and what needs to be calculated through
a function.
So, I would like to keep primary access as a macro but I see the point
that it would make the struct clearer to access and I would not mind
changing the struct to use a union. But then most access currently is
through macros so I am not sure how much benefit the union would bring
as it mostly complicates the struct definition.
An union helps debugging in gdb: you don't have to cast manually to
unsigned char*/Py_UCS2*/Py_UCS4*.
Also, common, now simple, checks for "unicode->str == NULL" would look
more ambiguous with a union ("unicode->str.latin1 == NULL").
We can rename "str" to something else, to "data" for example.
Victor
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Le 24/08/2011 06:59, Scott Dial a écrit : On 8/23/2011 6:38 PM, Victor Stinner wrote: Le mardi 23 août 2011 00:14:40, Antoine Pitrou a écrit : - You could try to run stringbench, which can be found at http://svn.python.org/projects/sandbox/trunk/stringbench (*) and there's iobench (the text mode benchmarks) in the Tools/iobench directory. Some raw numbers. stringbench: "147.07 203.07 72.4 TOTAL" for the PEP 393 "146.81 140.39 104.6 TOTAL" for default => PEP is 45% slower I ran the same benchmark and couldn't make a distinction in performance between them: Hum, are you sure that you used the PEP 383? Make sure that you are using the pep-383 branch! I also started my benchmark on the wrong branch :-) Victor ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Le 24/08/2011 04:41, Torsten Becker a écrit :
On Tue, Aug 23, 2011 at 18:27, Victor Stinner
wrote:
I posted a patch to re-add it:
http://bugs.python.org/issue12819#msg142867
Thank you for the patch! Note that this patch adds the fast path only
to the helper function which determines the length of the string and
the maximum character. The decoding part is still without a fast path
for ASCII runs.
Ah? If utf8_max_char_size_and_has_errors() returns no error hand
maxchar=127: memcpy() is used. You mean that memcpy() is too slow? :-)
maxchar = utf8_max_char_size_and_has_errors(s, size, &unicode_size,
&has_errors);
if (has_errors) {
...
}
else {
unicode = (PyUnicodeObject *)PyUnicode_New(unicode_size, maxchar);
if (!unicode) return NULL;
/* When the string is ASCII only, just use memcpy and return. */
if (maxchar < 128) {
assert(unicode_size == size);
Py_MEMCPY(PyUnicode_1BYTE_DATA(unicode), s, unicode_size);
return (PyObject *)unicode;
}
...
}
But yes, my patch only optimize ASCII only strings, not "mostly-ASCII"
strings (e.g. 100 ASCII + 1 latin1 character). It can be optimized
later. I didn't benchmark my patch.
Victor
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
> So am I correctly reading between the lines when, after reading this > thread so far, and the complete issue discussion so far, that I see a > PEP 393 revision or replacement that has the following characteristics: > > 1) Narrow builds are dropped. PEP 393 already drops narrow builds. > 2) There are more, or different, internal kinds of strings, which affect > the processing patterns. This is the basic idea of PEP 393. > a) all ASCII > b) latin-1 (8-bit codepoints, the first 256 Unicode codepoints) This > kind may not be able to support a "mostly" variation, and may be no more > efficient than case b). But it might also be popular in parts of Europe This two cases are already in PEP 393. > c) mostly ASCII (utf8) with clever indexing/caching to be efficient > d) UTF-8 with clever indexing/caching to be efficient I see neither a need nor a means to consider these. > e) 16-bit codepoints These are in PEP 393. > f) UTF-16 with clever indexing/caching to be efficient Again, -1. > g) 32-bit codepoints This is in PEP 393. > h) UTF-32 What's that, as opposed to g)? I'm not open to revise PEP 393 in the direction of adding more representations. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Terry Reedy writes: > The current UCS2 Unicode string implementation, by design, quickly gives > WRONG answers for len(), iteration, indexing, and slicing if a string > contains any non-BMP (surrogate pair) Unicode characters. That may have > been excusable when there essentially were no such extended chars, and > the few there were were almost never used. Well, no, it gives the right answer according to the design. unicode objects do not contain character strings. By design, they contain code point strings. Guido has made that absolutely clear on a number of occasions. And the reasons have very little to do with lack of non-BMP characters to trip up the implementation. Changing those semantics should have been done before the release of Python 3. It is not clear to me that it is a good idea to try to decide on "the" correct implementation of Unicode strings in Python even today. There are a number of approaches that I can think of. 1. The "too bad if you can't take a joke" approach: do nothing and recommend UTF-32 to those who want len() to DTRT. 2. The "slope is slippery" approach: Implement UTF-16 objects as built-ins, and then try to fend off requests for correct treatment of unnormalized composed characters, normalization, compatibility substitutions, bidi, etc etc. 3. The "are we not hackers?" approach: Implement a transform that maps characters that are not represented by a single code point into Unicode private space, and then see if anybody really needs more than 6400 non-BMP characters. (Note that this would generalize to composed characters that don't have a one-code-point NFC form and similar non-standardized cases that nonstandard users might want handled.) 4. The "42" approach: sadly, I can't think deeply enough to explain it. There are probably others. It's true that Python is going to need good libraries to provide correct handling of Unicode strings (as opposed to unicode objects). But it's not clear to me given the wide variety of implementations I can imagine that there will be one best implementation, let alone which ones are good and Pythonic, and which not so. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Le 24/08/2011 04:56, Torsten Becker a écrit :
On Tue, Aug 23, 2011 at 18:56, Victor Stinner
wrote:
kind=0 is used and public, it's PyUnicode_WCHAR_KIND. Is it still
necessary? It looks to be only used in PyUnicode_DecodeUnicodeEscape().
If it can be removed, it would be nice to have kind in [0; 2] instead of kind
in [1; 2], to be able to have a list (of 3 items) => callback or label.
It is also used in PyUnicode_DecodeUTF8Stateful() and there might be
some cases which I missed converting checks for 0 when I introduced
the macro. The question was more if this should be written as 0 or as
a named constant. I preferred the named constant for readability.
An alternative would be to have kind values be the same as the number
of bytes for the string representation so it would be 0 (wstr), 1
(1-byte), 2 (2-byte), or 4 (4-byte).
Please don't do that: it's more common to need contiguous arrays (for a
jump table/callback list) than having to know the character size. You
can use an array giving the character size: CHARACTER_SIZE[kind] which
is the array {0, 1, 2, 4} (or maybe sizeof(wchar_t) instead of 0 ?).
I think the value for wstr/uninitialized/reserved should not be
removed. The wstr representation is still used in the error case in
the utf8 decoder because these strings can be resized.
In Python, you can resize an object if it has only one reference. Why is
it not possible in your branch?
Oh, I missed the UTF-8 decoder because you wrote "kind = 0": please, use
PyUnicode_WCHAR_KIND instead!
I don't like "reserved" value, especially if its value is 0, the first
value. See Microsoft file formats: they waste a lot of space because
most fields are reserved, and 10 years later, these fields are still
unused. Can't we add the value 4 when we will need a new kind?
Also having one
designated value for "uninitialized" limits comparisons in the
affected functions to the kind value, otherwise they would need to
check the str field for NULL to determine in which buffer to write a
character.
I have to read the code more carefully, I don't know this
"uninitialized" state.
For kind=0: "wstr" means that str is NULL but wstr is set? I didn't
understand that str can be NULL for an initialized string. I should read
the PEP again :-)
I suppose that compilers prefer a switch with all cases defined, 0 a first item
and contiguous values. We may need an enum.
During the Summer of Code, Martin and I did a experiment with GCC and
it did not seem to produce a jump table as an optimization for three
cases but generated comparison instructions anyway.
You mean with a switch with a case for each possible value? I don't
think that GCC knows that all cases are defined if you don't use an enum.
I am not sure how much we should optimize for potential compiler
> optimizations here.
Oh, it was just a suggestion. Sure, it's not the best moment to care of
micro-optimizations.
Victor
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] FileSystemError or FilesystemError?
On 24Aug2011 12:31, Nick Coghlan wrote: | On Wed, Aug 24, 2011 at 5:19 AM, Steven D'Aprano wrote: | > Antoine Pitrou wrote: | >> When reviewing the PEP 3151 implementation (*), Ezio commented that | >> "FileSystemError" looks a bit strange and that "FilesystemError" would | >> be a better spelling. What is your opinion? | > | > It's a file system (two words), not filesystem (not in any dictionary or | > spell checker I've ever used). | | I rarely find spell checkers to be useful sources of data on correct | spelling of technical jargon (and the computing usage of the term | 'filesystem' definitely qualifies as jargon). | | > (Nor do we write filingsystem, governmentsystem, politicalsystem or | > schoolsystem. This is English, not German.) | | Personally, I think 'filesystem' is a portmanteau in the process of | coming into existence (as evidenced by usage like 'FHS' standing for | 'Filesystem Hierarchy Standard'). However, the two word form is still | useful at times, particularly for disambiguation of acronyms (as | evidenced by usage like 'NFS' and 'GFS' for 'Network File System' and | 'Google File System'). Funny, I thought NFS stood for Not a File System :-) | Since I tend to use the one word 'filesystem' form myself (ditto for | 'filename'), I'm +1 for FilesystemError, but I'm only -0 for | FileSystemError (so I expect that will be the option chosen, given | other responses). I also use "filesystem" as a one word piece of jargon, but I am persuaded by the language arguments. So I'm +1 for FileSystemError. Cheers, -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/ Bolts get me through times of no courage better than courage gets me through times of no bolts! - Eric Hirst ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On 8/24/2011 1:18 AM, "Martin v. Löwis" wrote: So am I correctly reading between the lines when, after reading this thread so far, and the complete issue discussion so far, that I see a PEP 393 revision or replacement that has the following characteristics: 1) Narrow builds are dropped. PEP 393 already drops narrow builds. I'd forgotten that. 2) There are more, or different, internal kinds of strings, which affect the processing patterns. This is the basic idea of PEP 393. Agreed. a) all ASCII b) latin-1 (8-bit codepoints, the first 256 Unicode codepoints) This kind may not be able to support a "mostly" variation, and may be no more efficient than case b). But it might also be popular in parts of Europe This two cases are already in PEP 393. Sure. Wanted to enumerate all, rather than just add-ons. c) mostly ASCII (utf8) with clever indexing/caching to be efficient d) UTF-8 with clever indexing/caching to be efficient I see neither a need nor a means to consider these. The discussion about "mostly ASCII" strings seems convincing that there could be a significant space savings if such were implemented. e) 16-bit codepoints These are in PEP 393. f) UTF-16 with clever indexing/caching to be efficient Again, -1. This is probably the one I would pick as least likely to be useful if the rest were implemented. g) 32-bit codepoints This is in PEP 393. h) UTF-32 What's that, as opposed to g)? g) would permit codes greater than u+10 and would permit the illegal codepoints and lone surrogates. h) would be strict Unicode conformance. Sorry that the 4 paragraphs of explanation that you didn't quote didn't make that clear. I'm not open to revise PEP 393 in the direction of adding more representations. It's your PEP. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On 8/24/2011 4:11 AM, Victor Stinner wrote: > Le 24/08/2011 06:59, Scott Dial a écrit : >> On 8/23/2011 6:38 PM, Victor Stinner wrote: >>> Le mardi 23 août 2011 00:14:40, Antoine Pitrou a écrit : - You could try to run stringbench, which can be found at http://svn.python.org/projects/sandbox/trunk/stringbench (*) and there's iobench (the text mode benchmarks) in the Tools/iobench directory. >>> >>> Some raw numbers. >>> >>> stringbench: >>> "147.07 203.07 72.4 TOTAL" for the PEP 393 >>> "146.81 140.39 104.6 TOTAL" for default >>> => PEP is 45% slower >> >> I ran the same benchmark and couldn't make a distinction in performance >> between them: > > Hum, are you sure that you used the PEP 383? Make sure that you are > using the pep-383 branch! I also started my benchmark on the wrong > branch :-) You are right. I used the "Get Source" link on bitbucket to save pulling the whole clone, but the "Get Source" link seems to be whatever branch has the lastest revision (maybe?) even if you switch branches on the webpage. To correct my previous post: cpython.txt 183.26 177.97 103.0 TOTAL cpython-wide-unicode.txt 181.27 195.58 92.7TOTAL pep-393.txt 181.40 270.34 67.1TOTAL And, cpython.txt real0m32.493s cpython-wide-unicode.txt real0m33.489s pep-393.txt real0m36.206s -- Scott Dial [email protected] ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Am 24.08.2011 10:17, schrieb Victor Stinner: > Le 24/08/2011 04:41, Torsten Becker a écrit : >> On Tue, Aug 23, 2011 at 18:27, Victor Stinner >> wrote: >>> I posted a patch to re-add it: >>> http://bugs.python.org/issue12819#msg142867 >> >> Thank you for the patch! Note that this patch adds the fast path only >> to the helper function which determines the length of the string and >> the maximum character. The decoding part is still without a fast path >> for ASCII runs. > > Ah? If utf8_max_char_size_and_has_errors() returns no error hand > maxchar=127: memcpy() is used. You mean that memcpy() is too slow? :-) No: the pure-ASCII case is already optimized with memcpy. It's the mostly-ASCII case that is not optimized anymore in this PEP 393 implementation (the one with "ASCII runs" instead of "pure ASCII"). Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On 8/24/2011 4:22 AM, Stephen J. Turnbull wrote: Terry Reedy writes: > The current UCS2 Unicode string implementation, by design, quickly gives > WRONG answers for len(), iteration, indexing, and slicing if a string > contains any non-BMP (surrogate pair) Unicode characters. That may have > been excusable when there essentially were no such extended chars, and > the few there were were almost never used. Well, no, it gives the right answer according to the design. unicode objects do not contain character strings. Excuse me for believing the fine 3.2 manual that says "Strings contain Unicode characters." (And to a naive reader, that implies that string iteration and indexing should produce Unicode characters.) By design, they contain code point strings. For the purpose of my sentence, the same thing in that code points correspond to characters, where 'character' includes ascii control 'characters' and unicode analogs. The problem is that on narrow builds strings are NOT code point sequences. They are 2-byte code *unit* sequences. Single non-BMP code points are seen as 2 code units and hence given a length of 2, not 1. Strings iterate, index, and slice by 2-byte code units, not by code points. Python floats try to follow the IEEE standard as interpreted for Python (Python has its software exceptions rather than signalling versus non-signalling hardware signals). Python decimals slavishly follow the IEEE decimal standard. Python narrow build unicode breaks the standard for non-BMP code points and cosequently, breaks the re module even when it works for wide builds. As sys.maxunicode more or less says, only the BMP subset is fully supported. Any narrow build string with even 1 non-BMP char violates the standard. Guido has made that absolutely clear on a number of occasions. It is not clear what you mean, but recently on python-ideas he has reiterated that he intends bytes and strings to be conceptually different. Bytes are computer-oriented binary arrays; strings are supposedly human-oriented character/codepoint arrays. Except they are not for non-BMP characters/codepoints. Narrow build unicode is effectively an array of two-byte binary units. > And the reasons have very little to do with lack of non-BMP characters to trip up the implementation. Changing those semantics should have been done before the release of Python 3. The documentation was changed at least a bit for 3.0, and anyway, as indicated above, it is easy (especially for new users) to read the docs in a way that makes the current behavior buggy. I agree that the implementation should have been changed already. Currently, the meaning of Python code differs on narrow versus wide build, and in a way that few users would expect or want. PEP 393 abolishes narrow builds as we now know them and changes semantics. I was answering a complaint about that change. If you do not like the PEP, fine. My separate proposal in my other post is for an alternative implementation but with, I presume, pretty the same visible changes. It is not clear to me that it is a good idea to try to decide on "the" correct implementation of Unicode strings in Python even today. If the implementation is invisible to the Python user, as I believe it should be without specially introspection, and mostly invisible in the C-API except for those who intentionally poke into the details, then the implementation can be changed as the consensus on best implementation changes. There are a number of approaches that I can think of. 1. The "too bad if you can't take a joke" approach: do nothing and recommend UTF-32 to those who want len() to DTRT. 2. The "slope is slippery" approach: Implement UTF-16 objects as built-ins, and then try to fend off requests for correct treatment of unnormalized composed characters, normalization, compatibility substitutions, bidi, etc etc. 3. The "are we not hackers?" approach: Implement a transform that maps characters that are not represented by a single code point into Unicode private space, and then see if anybody really needs more than 6400 non-BMP characters. (Note that this would generalize to composed characters that don't have a one-code-point NFC form and similar non-standardized cases that nonstandard users might want handled.) 4. The "42" approach: sadly, I can't think deeply enough to explain it. There are probably others. It's true that Python is going to need good libraries to provide correct handling of Unicode strings (as opposed to unicode objects). Given that 3.0 unicode (string) objects are defined as Unicode character strings, I do not see the opposition. But it's not clear to me given the wide variety of implementations I can imagine that there will be one best implementation, let alone which ones are good and Pythonic, and which not so. -- Terry Jan Reedy _
Re: [Python-Dev] PEP 393 Summer of Code Project
>> I think the value for wstr/uninitialized/reserved should not be
>> removed. The wstr representation is still used in the error case in
>> the utf8 decoder because these strings can be resized.
>
> In Python, you can resize an object if it has only one reference. Why is
> it not possible in your branch?
If you use the new API to create a string (knowing how many characters
you have, and what the maximum character is), the Unicode object is
allocated as a single memory block. It can then not be resized.
If you allocate in the old style (i.e. giving NULL as the data pointer,
and a length), it still creates a second memory blocks for the
Py_UNICODE[], and allows resizing. When you then call PyUnicode_Ready,
the object gets frozen.
> I don't like "reserved" value, especially if its value is 0, the first
> value. See Microsoft file formats: they waste a lot of space because
> most fields are reserved, and 10 years later, these fields are still
> unused. Can't we add the value 4 when we will need a new kind?
I don't get the analogy, or the relationship with the value 0.
"Reserving" the value 0 is entirely different from reserving a field.
In a field, it wastes space; the value 0 however fills the same space
as the values 1,2,3. It's just used to denote an object where the
str pointer is not filled out yet, i.e. which can still be resized.
>>> I suppose that compilers prefer a switch with all cases defined, 0 a
>>> first item
>>> and contiguous values. We may need an enum.
>>
>> During the Summer of Code, Martin and I did a experiment with GCC and
>> it did not seem to produce a jump table as an optimization for three
>> cases but generated comparison instructions anyway.
>
> You mean with a switch with a case for each possible value?
No, a computed jump on the assembler level. Consider this code
enum kind {null,ucs1,ucs2,ucs4};
void foo(void *d, enum kind k, int i, int v)
{
switch(k){
case ucs1:((unsigned char*)d)[i] = v;break;
case ucs2:((unsigned short*)d)[i] = v;break;
case ucs4:((unsigned int*)d)[i] = v;break;
}
}
gcc 4.6.1 compiles this to
foo:
.LFB0:
.cfi_startproc
cmpl$2, %esi
je .L4
cmpl$3, %esi
je .L5
cmpl$1, %esi
je .L7
.p2align 4,,5
rep
ret
.p2align 4,,10
.p2align 3
.L7:
movslq %edx, %rdx
movb%cl, (%rdi,%rdx)
ret
.p2align 4,,10
.p2align 3
.L5:
movslq %edx, %rdx
movl%ecx, (%rdi,%rdx,4)
ret
.p2align 4,,10
.p2align 3
.L4:
movslq %edx, %rdx
movw%cx, (%rdi,%rdx,2)
ret
.cfi_endproc
As you can see, it generates a chain of compares, rather than an
indirect jump through a jump table.
Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] FileSystemError or FilesystemError?
> When reviewing the PEP 3151 implementation (*), Ezio commented that > "FileSystemError" looks a bit strange and that "FilesystemError" would > be a better spelling. What is your opinion? > > (*) http://bugs.python.org/issue12555 > +1 for FileSystemError Eli ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] sendmsg/recvmsg on Mac OS X
The buildbots are complaining about some of tests for the new socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that provide CMSG_LEN. http://www.python.org/dev/buildbot/all/builders/AMD64%20Snow%20Leopard%202%203.x/builds/831/steps/test/logs/stdio Before I start trying to figure this out without a Mac to test on, are any of the devs that actually use Mac OS X seeing the failure in their local builds? Cheers, Nick. -- Nick Coghlan | [email protected] | Brisbane, Australia ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote: > In utf16.py, attached to http://bugs.python.org/issue12729 > I propose for consideration a prototype of different solution to the 'mostly > BMP chars, few non-BMP chars' case. Rather than expand every character from > 2 bytes to 4, attach an array cpdex of character (ie code point, not code > unit) indexes. Then for indexing and slicing, the correction is simple, > simpler than I first expected: > code-unit-index = char-index + bisect.bisect_left(cpdex, char_index) > where code-unit-index is the adjusted index into the full underlying > double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids > most of the space penalty and the consequent time penalty of moving more > bytes around and increasing cache misses. Interesting idea, but putting on my C programmer hat, I say -1. Non-uniform cell size = not a C array = standard C array manipulation idioms don't work = pain (no matter how simple the index correction happens to be). The nice thing about PEP 383 is that it gives us the smallest storage array that is both an ordinary C array and has sufficiently large individual elements to handle every character in the string. Cheers, Nick. -- Nick Coghlan | [email protected] | Brisbane, Australia ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] sendmsg/recvmsg on Mac OS X
> The buildbots are complaining about some of tests for the new > socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that > provide CMSG_LEN. Looks like kernel bugs: http://developer.apple.com/library/mac/#qa/qa1541/_index.html """ Yes. Mac OS X 10.5 fixes a number of kernel bugs related to descriptor passing [...] Avoid passing two or more descriptors back-to-back. """ We should probably add @requires_mac_ver(10, 5) for testFDPassSeparate and testFDPassSeparateMinSpace. As for InterruptedSendTimeoutTest and testInterruptedSendmsgTimeout, it also looks like a kernel bug: the syscall should fail with EINTR once the socket buffer is full. I guess one should skip those on OS-X. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Nick Coghlan, 24.08.2011 15:06: On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote: In utf16.py, attached to http://bugs.python.org/issue12729 I propose for consideration a prototype of different solution to the 'mostly BMP chars, few non-BMP chars' case. Rather than expand every character from 2 bytes to 4, attach an array cpdex of character (ie code point, not code unit) indexes. Then for indexing and slicing, the correction is simple, simpler than I first expected: code-unit-index = char-index + bisect.bisect_left(cpdex, char_index) where code-unit-index is the adjusted index into the full underlying double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids most of the space penalty and the consequent time penalty of moving more bytes around and increasing cache misses. Interesting idea, but putting on my C programmer hat, I say -1. Non-uniform cell size = not a C array = standard C array manipulation idioms don't work = pain (no matter how simple the index correction happens to be). The nice thing about PEP 383 is that it gives us the smallest storage array that is both an ordinary C array and has sufficiently large individual elements to handle every character in the string. +1 Stefan ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Terry Reedy writes: > Excuse me for believing the fine 3.2 manual that says > "Strings contain Unicode characters." The manual is wrong, then, subject to a pronouncement to the contrary, of course. I was on your side of the fence when this was discussed, pre-release. I was wrong then. My bet is that we are still wrong, now. > For the purpose of my sentence, the same thing in that code points > correspond to characters, Not in Unicode, they do not. By definition, a small number of code points (eg, U+) *never* did and *never* will correspond to characters. Since about Unicode 3.0, the same is true of surrogate code points. Some restrictions have been placed on what can be done with composed characters, so even with the PEP (which gives us code point arrays) we do not really get arrays of Unicode characters that fully conform to the model. > strings are NOT code point sequences. They are 2-byte code *unit* > sequences. I stand corrected on Unicode terminology. "Code unit" is what I meant, and what I understand Guido to have defined unicode objects as arrays of. > Any narrow build string with even 1 non-BMP char violates the > standard. Yup. That's by design. > > Guido has made that absolutely clear on a number > > of occasions. > > It is not clear what you mean, but recently on python-ideas he has > reiterated that he intends bytes and strings to be conceptually > different. Sure. Nevertheless, practicality beat purity long ago, and that decision has never been rescinded AFAIK. > Bytes are computer-oriented binary arrays; strings are > supposedly human-oriented character/codepoint arrays. And indeed they are, in UCS-4 builds. But they are *not* in Unicode! Unicode violates the array model. Specifically, in handling composing characters, and in bidi, where arbitrary slicing of direction control characters will result in garbled display. The thing is, that 90% of applications are not really going to care about full conformance to the Unicode standard. Of the remaining 10%, 90% are not going to need both huge strings *and* ABI interoperability with C modules compiled for UCS-2, so UCS-4 is satisfactory. Of the remaining 1% of all applications, those that deal with huge strings *and* need full Unicode conformance, well, they need efficiency too almost by definition. They probably are going to want something more efficient than either the UTF-16 or the UTF-32 representation can provide, and therefore will need trickier, possibly app-specific, algorithms that probably do not belong in an initial implementation. > > And the reasons have very little to do with lack of > > non-BMP characters to trip up the implementation. Changing those > > semantics should have been done before the release of Python 3. > > The documentation was changed at least a bit for 3.0, and anyway, as > indicated above, it is easy (especially for new users) to read the docs > in a way that makes the current behavior buggy. I agree that the > implementation should have been changed already. I don't. I suspect Guido does not, even today. > Currently, the meaning of Python code differs on narrow versus wide > build, and in a way that few users would expect or want. Let them become developers, then, and show us how to do it better. > PEP 393 abolishes narrow builds as we now know them and changes > semantics. I was answering a complaint about that change. If you do > not like the PEP, fine. No, I do like the PEP. However, it is only a step, a rather conservative one in some ways, toward conformance to the Unicode character model. In particular, it does nothing to resolve the fact that len() will give different answers for character count depending on normalization, and that slicing and indexing will allow you to cut characters in half (even in NFC, since not all composed characters have fully composed forms). > > It is not clear to me that it is a good idea to try to decide on "the" > > correct implementation of Unicode strings in Python even today. > > If the implementation is invisible to the Python user, as I believe it > should be without specially introspection, and mostly invisible in the > C-API except for those who intentionally poke into the details, then the > implementation can be changed as the consensus on best implementation > changes. A naive implementation of UTF-16 will be quite visible in terms of performance, I suspect, and performance-oriented applications will "go behind the API's back" to get it. We're already seeing that in the people who insist that bytes are characters too, and string APIs should work on them just as they do on (Unicode) strings. > > It's true that Python is going to need good libraries to provide > > correct handling of Unicode strings (as opposed to unicode objects). > > Given that 3.0 unicode (string) objects are defined as Unicode character > strings, I do not see the opposition. I think they're not
Re: [Python-Dev] PEP 393 Summer of Code Project
On Thu, 25 Aug 2011 01:34:17 +0900 "Stephen J. Turnbull" wrote: > > Martin has long claimed that the fact that I/O is done in terms of > UTF-16 means that the internal representation is UTF-16 Which I/O? ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] sendmsg/recvmsg on Mac OS X
On Wed, 24 Aug 2011 15:31:50 +0200 Charles-François Natali wrote: > > The buildbots are complaining about some of tests for the new > > socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that > > provide CMSG_LEN. > > Looks like kernel bugs: > http://developer.apple.com/library/mac/#qa/qa1541/_index.html > > """ > Yes. Mac OS X 10.5 fixes a number of kernel bugs related to descriptor passing > [...] > Avoid passing two or more descriptors back-to-back. > """ But Snow Leopard, where these failures occur, is OS X 10.6. Antoine. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] FileSystemError or FilesystemError?
+1 for FileSystemError. I see myself misspelling it as FileSystemError if we go with alternate spelling. I'll probably won't be the only one. Thank you, Vlad On Wed, Aug 24, 2011 at 4:09 AM, Eli Bendersky wrote: > > When reviewing the PEP 3151 implementation (*), Ezio commented that >> "FileSystemError" looks a bit strange and that "FilesystemError" would >> be a better spelling. What is your opinion? >> >> (*) http://bugs.python.org/issue12555 >> > > +1 for FileSystemError > > Eli > > > > ___ > Python-Dev mailing list > [email protected] > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/riscutiavlad%40gmail.com > > ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Antoine Pitrou writes: > On Thu, 25 Aug 2011 01:34:17 +0900 > "Stephen J. Turnbull" wrote: > > > > Martin has long claimed that the fact that I/O is done in terms of > > UTF-16 means that the internal representation is UTF-16 > > Which I/O? Eg, display of characters in the interpreter. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Le jeudi 25 août 2011 à 02:15 +0900, Stephen J. Turnbull a écrit : > Antoine Pitrou writes: > > On Thu, 25 Aug 2011 01:34:17 +0900 > > "Stephen J. Turnbull" wrote: > > > > > > Martin has long claimed that the fact that I/O is done in terms of > > > UTF-16 means that the internal representation is UTF-16 > > > > Which I/O? > > Eg, display of characters in the interpreter. I don't know why you say it's "done in terms of UTF-16", then. Unicode strings are simply encoded to whatever character set is detected as the terminal's character set. Regards Antoine. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Le 24/08/2011 02:46, Terry Reedy a écrit : On 8/23/2011 9:21 AM, Victor Stinner wrote: Le 23/08/2011 15:06, "Martin v. Löwis" a écrit : Well, things have to be done in order: 1. the PEP needs to be approved 2. the performance bottlenecks need to be identified 3. optimizations should be applied. I would not vote for the PEP if it slows down Python, especially if it's much slower. But Torsten says that it speeds up Python, which is surprising. I have to do my own benchmarks :-) The current UCS2 Unicode string implementation, by design, quickly gives WRONG answers for len(), iteration, indexing, and slicing if a string contains any non-BMP (surrogate pair) Unicode characters. That may have been excusable when there essentially were no such extended chars, and the few there were were almost never used. But now there are many more, with more being added to each Unicode edition. They include cursive Math letters that are used in English documents today. The problem will slowly get worse and Python, at least on Windows, will become a language to avoid for dependable Unicode document processing. 3.x needs a proper Unicode implementation that works for all strings on all builds. I don't think that using UTF-16 with surrogate pairs is really a big problem. A lot of work has been done to hide this. For example, repr(chr(0x10)) now displays '\U0010' instead of two characters. Ezio fixed recently str.is*() methods in Python 3.2+. For len(str): its a known problem, but if you really care of the number of *character* and not the number of UTF-16 units, it's easy to implement your own character_length() function. len(str) gives the UTF-16 units instead of the number of character for a simple reason: it's faster: O(1), whereas character_length() is O(n). utf16.py, attached to http://bugs.python.org/issue12729 prototypes a different solution than the PEP for the above problems for the 'mostly BMP' case. I will discuss it in a different post. Yeah, you can workaround UTF-16 limits using O(n) algorithms. PEP-393 provides support of the full Unicode charset (U+-U+10) an all platforms with a small memory footprint and only O(1) functions. Note: Java and the Qt library use also UTF-16 strings and have exactly the same "limitations" for str[n] and len(str). Victor ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
> > PEP 393 abolishes narrow builds as we now know them and changes > > semantics. I was answering a complaint about that change. If you do > > not like the PEP, fine. > > No, I do like the PEP. However, it is only a step, a rather > conservative one in some ways, toward conformance to the Unicode > character model. I'd like to point out that the improved compatibility is only a side effect, not the primary objective of the PEP. The primary objective is the reduction in memory usage. (any changes in runtime are also side effects, and it's not really clear yet whether you get speedups or slowdowns on average, or no effect). > > Given that 3.0 unicode (string) objects are defined as Unicode character > > strings, I do not see the opposition. > > I think they're not, I think they're defined as Unicode code unit > arrays, and that the documentation is in error. That's just a description of the implementation, and not part of the language, though. My understanding is that the "abstract Python language definition" considers this aspect implementation-defined: PyPy, Jython, IronPython etc. would be free to do things differently (and I understand that there are plans to do PEP-393 style Unicode objects in PyPy). > Martin has long claimed that the fact that I/O is done in terms of > UTF-16 means that the internal representation is UTF-16, so I could be > wrong. But when issues of slicing, len() values and so on have come > up in the past, Guido has always said "no, there will be no change in > semantics of builtins here". Not with these words, though. As I recall, it's rather like (still with different words) "len() will stay O(1) forever, regardless of any perceived incorrectness of this choice". An attempt to change the builtins to introduce higher complexity for the sake of correctness is what he rejects. I think PEP 393 balances this well, keeping the O(1) operations in that complexity, while improving the cross- platform "correctness" of these functions. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
>> Eg, display of characters in the interpreter. > > I don't know why you say it's "done in terms of UTF-16", then. Unicode > strings are simply encoded to whatever character set is detected as the > terminal's character set. I think what he means (and what I meant when I said something similar): I/O will consider surrogate pairs in the representation when converting to the output encoding. This is actually relevant only for UTF-8 (I think), which converts surrogate pairs "correctly". This can be taken as a proof that Python 3.2 is "UTF-16 aware" (in some places, but not in others). With Python's I/O architecture, it is of course not *actually* the I/O which considers UTF-16, but the codec. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Le 24/08/2011 11:22, Glenn Linderman a écrit : c) mostly ASCII (utf8) with clever indexing/caching to be efficient d) UTF-8 with clever indexing/caching to be efficient I see neither a need nor a means to consider these. The discussion about "mostly ASCII" strings seems convincing that there could be a significant space savings if such were implemented. Antoine's optimization in the UTF-8 decoder has been removed. It doesn't change the memory footprint, it is just slower to create the Unicode object. When you decode an UTF-8 string: - "abc" string uses "latin1" (8 bits) units - "aé" string uses "latin1" (8 bits) units <= cool! - "a€" string uses UCS2 (16 bits) units - "a\U0010" string uses UCS4 (32 bits) units Victor ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] PEP 393 review
Guido has agreed to eventually pronounce on PEP 393. Before that can happen, I'd like to collect feedback on it. There have been a number of voice supporting the PEP in principle, so I'm now interested in comments in the following areas: - principle objection. I'll list them in the PEP. - issues to be considered (unclarities, bugs, limitations, ...) - conditions you would like to pose on the implementation before acceptance. I'll see which of these can be resolved, and list the ones that remain open. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 review
On Wed, 24 Aug 2011 20:15:24 +0200 "Martin v. Löwis" wrote: > - issues to be considered (unclarities, bugs, limitations, ...) With this PEP, the unicode object overhead grows to 10 pointer-sized words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine. Does it have any adverse effects? Are there any plans to make instantiation of small strings fast enough? Or is it already as fast as it should be? When interfacing with the Win32 "wide" APIs, what is the recommended way to get the required LPCWSTR? Will the format codes returning a Py_UNICODE pointer with PyArg_ParseTuple be deprecated? Do you think the wstr representation could be removed in some future version of Python? Is PyUnicode_Ready() necessary for all unicode objects, or only those allocated through the legacy API? “The Py_Unicode representation is not instantaneously available”: you mean the Py_UNICODE representation? > - conditions you would like to pose on the implementation before > acceptance. I'll see which of these can be resolved, and list > the ones that remain open. That it doesn't significantly slow down benchmarks such as stringbench and iobench. Regards Antoine. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] sendmsg/recvmsg on Mac OS X
In article <[email protected]>, Antoine Pitrou wrote: > On Wed, 24 Aug 2011 15:31:50 +0200 > Charles-François Natali wrote: > > > The buildbots are complaining about some of tests for the new > > > socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that > > > provide CMSG_LEN. > > > > Looks like kernel bugs: > > http://developer.apple.com/library/mac/#qa/qa1541/_index.html > > > > """ > > Yes. Mac OS X 10.5 fixes a number of kernel bugs related to descriptor > > passing > > [...] > > Avoid passing two or more descriptors back-to-back. > > """ > > But Snow Leopard, where these failures occur, is OS X 10.6. But chances are the build is using the default 10.4 ABI. Adding MACOSX_DEPLOYMENT_TARGET=10.6 as an env variable to ./configure may fix it. There is an open issue to change configure to use better defaults for this. (I'm right in the middle of reconfiguring my development systems so I can't test it myself immediately but I'll report back shortly.) -- Ned Deily, [email protected] ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On 8/24/2011 1:50 PM, "Martin v. Löwis" wrote: I'd like to point out that the improved compatibility is only a side effect, not the primary objective of the PEP. Then why does the Rationale start with "on systems only supporting UTF-16, users complain that non-BMP characters are not properly supported."? A Windows user can only solve this problem by switching to *nix. The primary objective is the reduction in memory usage. On average (perhaps). As I understand the PEP, for some strings, Windows users will see a doubling of memory usage. Statistically, that doubling is probably more likely in longer texts. Ascii-only Python code and other limited-to-ascii text will benefit. Typical English business documents will see no change as they often have proper non-ascii quotes and occasional accented characters, trademark symbols, and other things. I think you have the objectives backwards. Adding memory is a lot easier than switching OSes. -- Terry Jan Reedy ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] sendmsg/recvmsg on Mac OS X
On Wed, 24 Aug 2011 11:37:20 -0700 Ned Deily wrote: > In article <[email protected]>, > Antoine Pitrou wrote: > > On Wed, 24 Aug 2011 15:31:50 +0200 > > Charles-François Natali wrote: > > > > The buildbots are complaining about some of tests for the new > > > > socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that > > > > provide CMSG_LEN. > > > > > > Looks like kernel bugs: > > > http://developer.apple.com/library/mac/#qa/qa1541/_index.html > > > > > > """ > > > Yes. Mac OS X 10.5 fixes a number of kernel bugs related to descriptor > > > passing > > > [...] > > > Avoid passing two or more descriptors back-to-back. > > > """ > > > > But Snow Leopard, where these failures occur, is OS X 10.6. > > But chances are the build is using the default 10.4 ABI. Adding > MACOSX_DEPLOYMENT_TARGET=10.6 as an env variable to ./configure may fix > it. Does the ABI affect kernel bugs? Regards Antoine. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On 8/24/2011 9:00 AM, Stefan Behnel wrote: Nick Coghlan, 24.08.2011 15:06: On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote: In utf16.py, attached to http://bugs.python.org/issue12729 I propose for consideration a prototype of different solution to the 'mostly BMP chars, few non-BMP chars' case. Rather than expand every character from 2 bytes to 4, attach an array cpdex of character (ie code point, not code unit) indexes. Then for indexing and slicing, the correction is simple, simpler than I first expected: code-unit-index = char-index + bisect.bisect_left(cpdex, char_index) where code-unit-index is the adjusted index into the full underlying double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids most of the space penalty and the consequent time penalty of moving more bytes around and increasing cache misses. Interesting idea, but putting on my C programmer hat, I say -1. Non-uniform cell size = not a C array = standard C array manipulation idioms don't work = pain (no matter how simple the index correction happens to be). The nice thing about PEP 383 is that it gives us the smallest storage array that is both an ordinary C array and has sufficiently large individual elements to handle every character in the string. +1 Yes, this sounds like a nice benefit, but the problem is it is false. The correct statement would be: The nice thing about PEP 383 is that it gives us the smallest storage array that is both an ordinary C array and has sufficiently large individual elements to handle every Unicode codepoint in the string. As Tom eloquently describes in the referenced issue (is Tom ever non-eloquent?), not all characters can be represented in a single codepoint. It seems there are three concepts in Unicode, code units, codepoints, and characters, none of which are equivalent (and the first of which varies according to the encoding). It also seems (to me) that Unicode has failed in its original premise, of being an easy way to handle "big char" for "all languages" with fixed size elements, but it is not clear that its original premise is achievable regardless of the size of "big char", when mixed directionality is desired, and it seems that support of some single languages require mixed directionality, not to mention mixed language support. Given the required variability of character size in all presently Unicode defined encodings, I tend to agree with Tom that UTF-8, together with some technique of translating character index to code unit offset, may provide the best overall space utilization, and adequate CPU efficiency. On the other hand, there are large subsets of applications that simply do not require support for bidirectional text or composed characters, and for those that do not, it remains to be seen if the price to be paid for supporting those features is too high a price for such applications. So far, we don't have implementations to benchmark to figure that out! What does this mean for Python? Well, if Python is willing to limit its support for applications to the subset for which the "big char" solution sufficient, then PEP 393 provides a way to do that, that looks to be pretty effective for reducing memory consumption for those applications that use short strings most of which can be classified by content into the 1 byte or 2 byte representations. Applications that support long strings are more likely to bitten by the occasional "outlier" character that is longer than the average character, doubling or quadrupling the space needed to represent such strings, and eliminating a significant portion of the space savings the PEP is providing for other applications. Benchmarks may or may not fully reflect the actual requirements of all applications, so conclusions based on benchmarking can easily be blind-sided the realities of other applications, unless the benchmarks are carefully constructed. It is possible that the ideas in PEP 393, with its support for multiple underlying representations, could be the basis for some more complex representations that would better support characters rather than only supporting code points, but Martin has stated he is not open to additional representations, so the PEP itself cannot be that basis (although with care which may or may not be taken in the implementation of the PEP, the implementation may still provide that basis). ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] sendmsg/recvmsg on Mac OS X
> But Snow Leopard, where these failures occur, is OS X 10.6. *sighs* It still looks like a kernel/libc bug to me: AFAICT, both the code and the tests are correct. And apparently, there are still issues pertaining to FD passing on 10.5 (and maybe later, I couldn't find a public access to their bug tracker): http://lists.apple.com/archives/Darwin-dev/2008/Feb/msg00033.html Anyway, if someone with a recent OS X release could run test_socket, it would probably help. Follow ups to http://bugs.python.org/issue6560 ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On Wed, Aug 24, 2011 at 11:52 AM, Glenn Linderman wrote: > On 8/24/2011 9:00 AM, Stefan Behnel wrote: > > Nick Coghlan, 24.08.2011 15:06: > > On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote: > > In utf16.py, attached to http://bugs.python.org/issue12729 > I propose for consideration a prototype of different solution to the 'mostly > BMP chars, few non-BMP chars' case. Rather than expand every character from > 2 bytes to 4, attach an array cpdex of character (ie code point, not code > unit) indexes. Then for indexing and slicing, the correction is simple, > simpler than I first expected: > code-unit-index = char-index + bisect.bisect_left(cpdex, char_index) > where code-unit-index is the adjusted index into the full underlying > double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids > most of the space penalty and the consequent time penalty of moving more > bytes around and increasing cache misses. > > Interesting idea, but putting on my C programmer hat, I say -1. > > Non-uniform cell size = not a C array = standard C array manipulation > idioms don't work = pain (no matter how simple the index correction > happens to be). > > The nice thing about PEP 383 is that it gives us the smallest storage > array that is both an ordinary C array and has sufficiently large > individual elements to handle every character in the string. > > +1 > > Yes, this sounds like a nice benefit, but the problem is it is false. The > correct statement would be: > > The nice thing about PEP 383 is that it gives us the smallest storage > array that is both an ordinary C array and has sufficiently large > individual elements to handle every Unicode codepoint in the string. (PEP 393, I presume. :-) > As Tom eloquently describes in the referenced issue (is Tom ever > non-eloquent?), not all characters can be represented in a single codepoint. But this is also besides the point (except insofar where we have to remind ourselves not to confuse the two in docs). > It seems there are three concepts in Unicode, code units, codepoints, and > characters, none of which are equivalent (and the first of which varies > according to the encoding). It also seems (to me) that Unicode has failed > in its original premise, of being an easy way to handle "big char" for "all > languages" with fixed size elements, but it is not clear that its original > premise is achievable regardless of the size of "big char", when mixed > directionality is desired, and it seems that support of some single > languages require mixed directionality, not to mention mixed language > support. I see nothing wrong with having the language's fundamental data types (i.e., the unicode object, and even the re module) to be defined in terms of codepoints, not characters, and I see nothing wrong with len() returning the number of codepoints (as long as it is advertised as such). After all UTF-8 also defines an encoding for a sequence of code points. Characters that require two or more codepoints are not represented special in UTF-8 -- they are represented as two or more encoded codepoints. The added requirement that UTF-8 must only be used to represent valid characters is just that -- it doesn't affect how strings are encoded, just what is considered valid at a higher level. > Given the required variability of character size in all presently Unicode > defined encodings, I tend to agree with Tom that UTF-8, together with some > technique of translating character index to code unit offset, may provide > the best overall space utilization, and adequate CPU efficiency. There is no doubt that UTF-8 is the most space efficient. I just don't think it is worth giving up O(1) indexing of codepoints -- it would change programmers' expectations too much. OTOH I am sold on getting rid of the added complexities of "narrow builds" where not even all codepoints can be represented without using surrogate pairs (i.e. two code units per codepoint) and indexing uses code units instead of codepoints. I think this is an area where PEP 393 has a huge advantage: users can get rid of their exceptions for narrow builds. > On the > other hand, there are large subsets of applications that simply do not > require support for bidirectional text or composed characters, and for those > that do not, it remains to be seen if the price to be paid for supporting > those features is too high a price for such applications. So far, we don't > have implementations to benchmark to figure that out! I think you are saying that many apps can ignore the distinction between codepoints and characters. Given the complexity of bidi rendering and normalization (which will always remain an issue) I agree; this is much less likely to be a burden than the narrow-build issues with code units vs. codepoints. What should the stdlib do? It should try to skirt the issue where it can (using the garbage-in-garbage-out principle) and advertise what it supports where there is a difference. I don'
Re: [Python-Dev] PEP 393 Summer of Code Project
On 8/24/2011 12:34 PM, Stephen J. Turnbull wrote: Terry Reedy writes: > Excuse me for believing the fine 3.2 manual that says > "Strings contain Unicode characters." The manual is wrong, then, subject to a pronouncement to the contrary, Please suggest a re-wording then, as it is a bug for doc and behavior to disagree. > For the purpose of my sentence, the same thing in that code points > correspond to characters, Not in Unicode, they do not. By definition, a small number of code points (eg, U+) *never* did and *never* will correspond to characters. On computers, characters are represented by code points. What about the other way around? http://www.unicode.org/glossary/#C says code point: 1) i in range(0x11000) 2) "A value, or position, for a character" (To muddy the waters more, 'character' has multiple definitions also.) You are using 1), I am using 2) ;-(. > Any narrow build string with even 1 non-BMP char violates the > standard. Yup. That's by design. [...] Sure. Nevertheless, practicality beat purity long ago, and that decision has never been rescinded AFAIK. I think you have it backwards. I see the current situation as the purity of the C code beating the practicality for the user of getting right answers. The thing is, that 90% of applications are not really going to care about full conformance to the Unicode standard. I remember when Intel argued that 99% of applications were not going to be affected when the math coprocessor in its then new chips occasionally gave 'non-standard' answers with certain divisors. > Currently, the meaning of Python code differs on narrow versus wide > build, and in a way that few users would expect or want. Let them become developers, then, and show us how to do it better. I posted a proposal with a link to a prototype implementation in Python. It pretty well solves the problem of narrow builds acting different from wide builds with respect to the basic operations of len(), iterations, indexing, and slicing. No, I do like the PEP. However, it is only a step, a rather conservative one in some ways, toward conformance to the Unicode character model. In particular, it does nothing to resolve the fact that len() will give different answers for character count depending on normalization, and that slicing and indexing will allow you to cut characters in half (even in NFC, since not all composed characters have fully composed forms). I believe my scheme could be extended to solve that also. It would require more pre-processing and more knowledge than I currently have of normalization. I have the impression that the grapheme problem goes further than just normalization. -- Terry Jan Reedy ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] sendmsg/recvmsg on Mac OS X
In article , Charles-Francois Natali wrote: > > But Snow Leopard, where these failures occur, is OS X 10.6. > > *sighs* > It still looks like a kernel/libc bug to me: AFAICT, both the code and > the tests are correct. > And apparently, there are still issues pertaining to FD passing on > 10.5 (and maybe later, I couldn't find a public access to their bug > tracker): > http://lists.apple.com/archives/Darwin-dev/2008/Feb/msg00033.html > > Anyway, if someone with a recent OS X release could run test_socket, > it would probably help. Follow ups to http://bugs.python.org/issue6560 I was able to do a quick test on 10.7 Lion and the 8 test failures still occur regardless of deployment target. Sorry, I don't have time to further investigate. -- Ned Deily, [email protected] ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] sendmsg/recvmsg on Mac OS X
In article <[email protected]>, Antoine Pitrou wrote: > On Wed, 24 Aug 2011 11:37:20 -0700 > Ned Deily wrote: > > In article <[email protected]>, > > Antoine Pitrou wrote: > > > On Wed, 24 Aug 2011 15:31:50 +0200 > > > Charles-François Natali wrote: > > > > > The buildbots are complaining about some of tests for the new > > > > > socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that > > > > > provide CMSG_LEN. > > > > > > > > Looks like kernel bugs: > > > > http://developer.apple.com/library/mac/#qa/qa1541/_index.html > > > > > > > > """ > > > > Yes. Mac OS X 10.5 fixes a number of kernel bugs related to descriptor > > > > passing > > > > [...] > > > > Avoid passing two or more descriptors back-to-back. > > > > """ > > > > > > But Snow Leopard, where these failures occur, is OS X 10.6. > > > > But chances are the build is using the default 10.4 ABI. Adding > > MACOSX_DEPLOYMENT_TARGET=10.6 as an env variable to ./configure may fix > > it. > > Does the ABI affect kernel bugs? If it's more of a "libc" sort of bug (i.e. somewhere below the app layer), it could. But, unfortunately, that doesn't seem to be the case here. -- Ned Deily, [email protected] ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On 8/24/2011 1:45 PM, Victor Stinner wrote:
Le 24/08/2011 02:46, Terry Reedy a écrit :
I don't think that using UTF-16 with surrogate pairs is really a big
problem. A lot of work has been done to hide this. For example,
repr(chr(0x10)) now displays '\U0010' instead of two characters.
Ezio fixed recently str.is*() methods in Python 3.2+.
I greatly appreciate that he did. The * (lower,upper,title) methods
apparently are not fixed yet as the corresponding new tests are
currently skipped for narrow builds.
For len(str): its a known problem, but if you really care of the number
of *character* and not the number of UTF-16 units, it's easy to
implement your own character_length() function. len(str) gives the
UTF-16 units instead of the number of character for a simple reason:
it's faster: O(1), whereas character_length() is O(n).
It is O(1) after a one-time O(n) preproccessing, which is the same time
order for creating the string in the first place.
Anyway, I think the most important deficiency is with iteration:
>>> from unicodedata import name
>>> name('\U0001043c')
'DESERET SMALL LETTER DEE'
>>> for c in 'abc\U0001043c':
print(name(c))
LATIN SMALL LETTER A
LATIN SMALL LETTER B
LATIN SMALL LETTER C
Traceback (most recent call last):
File "", line 2, in
print(name(c))
ValueError: no such name
This would work on wide builds but does not here (win7) because narrow
build iteration produces a naked non-character surrogate code unit that
has no specific entry in the Unicode Character Database.
I believe that most new people who read "Strings contain Unicode
characters." would expect string iteration to always produce the Unicode
characters that they put in the string. The extra time per char needed
to produce the surrogate pair that represents the character entered is
O(1).
utf16.py, attached to http://bugs.python.org/issue12729
prototypes a different solution than the PEP for the above problems for
the 'mostly BMP' case. I will discuss it in a different post.
Yeah, you can workaround UTF-16 limits using O(n) algorithms.
I presented O(log(number of non-BMP chars)) algorithms for indexing and
slicing. For the mostly BMP case, that is hugely better than O(n).
PEP-393 provides support of the full Unicode charset (U+-U+10)
an all platforms with a small memory footprint and only O(1) functions.
For Windows users, I believe it will nearly double the memory footprint
if there are any non-BMP chars. On my new machine, I should not mind
that in exchange for correct behavior.
--
Terry Jan Reedy
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Terry Reedy wrote: PEP-393 provides support of the full Unicode charset (U+-U+10) an all platforms with a small memory footprint and only O(1) functions. For Windows users, I believe it will nearly double the memory footprint if there are any non-BMP chars. On my new machine, I should not mind that in exchange for correct behavior. +1 Heck, I wouldn't mind it on my /old/ machine in exchange for correct behavior! ~Ethan~ ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Le mercredi 24 août 2011 20:52:51, Glenn Linderman a écrit :
> Given the required variability of character size in all presently
> Unicode defined encodings, I tend to agree with Tom that UTF-8, together
> with some technique of translating character index to code unit offset,
> may provide the best overall space utilization, and adequate CPU
> efficiency.
UTF-8 can use more space than latin1 or UCS2:
>>> text="abc"; len(text.encode("latin1")), len(text.encode("utf8"))
(3, 3)
>>> text="ééé"; len(text.encode("latin1")), len(text.encode("utf8"))
(3, 6)
>>> text="€€€"; len(text.encode("utf-16-le")), len(text.encode("utf8"))
(6, 9)
>>> text="北京"; len(text.encode("utf-16-le")), len(text.encode("utf8"))
(4, 6)
UTF-8 uses less space than PEP 393 only if you have few non-ASCII characters
(or few non-BMP characters).
About speed, I guess than O(n) (UTF8 indexing) is slower than O(1)
(PEP 393 indexing).
> ... Applications that support long
> strings are more likely to bitten by the occasional "outlier" character
> that is longer than the average character, doubling or quadrupling the
> space needed to represent such strings, and eliminating a significant
> portion of the space savings the PEP is providing for other
> applications.
In these worst cases, the PEP 393 is not worse than the current
implementation: it just as much memory than Python in wide mode (mode used on
Linux and Mac OS X because wchar_t is 32 bits). But it uses the double of
Python in narrow mode (Windows).
I agree than UTF-8 is better in these corner cases, but I also bet than most
Python programs will use less memory and will be faster with the PEP 393. You
can already try the pep-393 branch on your own programs.
> Benchmarks may or may not fully reflect the actual
> requirements of all applications, so conclusions based on benchmarking
> can easily be blind-sided the realities of other applications, unless
> the benchmarks are carefully constructed.
I used stringbench and "./python -m test test_unicode". I plan to try iobench.
Which other benchmark tool should be used? Should we write a new one?
> It is possible that the ideas in PEP 393, with its support for multiple
> underlying representations, could be the basis for some more complex
> representations that would better support characters rather than only
> supporting code points, ...
I don't think that the *default* Unicode type is the best place for this. The
base Unicode type has to be *very* efficient.
If you have unusual needs, write your own type. Maybe based on the base type?
Victor
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
> For Windows users, I believe it will nearly double the memory footprint > if there are any non-BMP chars. On my new machine, I should not mind > that in exchange for correct behavior. In addition, strings with non-BMP chars are much more rare than strings with all Latin-1, for which memory usage halves on Windows. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 review
> With this PEP, the unicode object overhead grows to 10 pointer-sized
> words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine.
> Does it have any adverse effects?
For pure ASCII, it might be possible to use a shorter struct:
typedef struct {
PyObject_HEAD
Py_ssize_t length;
Py_hash_t hash;
int state;
Py_ssize_t wstr_length;
wchar_t *wstr;
/* no more utf8_length, utf8, str */
/* followed by ascii data */
} _PyASCIIObject;
(-2 pointer -1 ssize_t: 56 bytes)
=> "a" is 58 bytes (with utf8 for free, without wchar_t)
For object allocated with the new API, we can use a shorter struct:
typedef struct {
PyObject_HEAD
Py_ssize_t length;
Py_hash_t hash;
int state;
Py_ssize_t wstr_length;
wchar_t *wstr;
Py_ssize_t utf8_length;
char *utf8;
/* no more str pointer */
/* followed by latin1/ucs2/ucs4 data */
} _PyNewUnicodeObject;
(-1 pointer: 72 bytes)
=> "é" is 74 bytes (without utf8 / wchar_t)
For the legacy API:
typedef struct {
PyObject_HEAD
Py_ssize_t length;
Py_hash_t hash;
int state;
Py_ssize_t wstr_length;
wchar_t *wstr;
Py_ssize_t utf8_length;
char *utf8;
void *str;
} _PyLegacyUnicodeObject;
(same size: 80 bytes)
=> "a" is 80+2 (2 malloc) bytes (without utf8 / wchar_t)
The current struct:
typedef struct {
PyObject_HEAD
Py_ssize_t length;
Py_UNICODE *str;
Py_hash_t hash;
int state;
PyObject *defenc;
} PyUnicodeObject;
=> "a" is 56+2 (2 malloc) bytes (without utf8, with wchar_t if Py_UNICODE is
wchar_t)
... but the code (maybe only the macros?) and debuging will be more complex.
> Will the format codes returning a Py_UNICODE pointer with
> PyArg_ParseTuple be deprecated?
Because Python 2.x is still dominant and it's already hard enough to port C
modules, it's not the best moment to deprecate the legacy API (Py_UNICODE*).
> Do you think the wstr representation could be removed in some future
> version of Python?
Conversion to wchar_t* is common, especially on Windows. But I don't know if
we *have to* cache the result. Is it cached by the way? Or is wstr only used
when a string is created from Py_UNICODE?
Victor
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On 8/24/2011 12:34 PM, Guido van Rossum wrote: On Wed, Aug 24, 2011 at 11:52 AM, Glenn Linderman wrote: On 8/24/2011 9:00 AM, Stefan Behnel wrote: Nick Coghlan, 24.08.2011 15:06: On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote: In utf16.py, attached to http://bugs.python.org/issue12729 I propose for consideration a prototype of different solution to the 'mostly BMP chars, few non-BMP chars' case. Rather than expand every character from 2 bytes to 4, attach an array cpdex of character (ie code point, not code unit) indexes. Then for indexing and slicing, the correction is simple, simpler than I first expected: code-unit-index = char-index + bisect.bisect_left(cpdex, char_index) where code-unit-index is the adjusted index into the full underlying double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids most of the space penalty and the consequent time penalty of moving more bytes around and increasing cache misses. Interesting idea, but putting on my C programmer hat, I say -1. Non-uniform cell size = not a C array = standard C array manipulation idioms don't work = pain (no matter how simple the index correction happens to be). The nice thing about PEP 383 is that it gives us the smallest storage array that is both an ordinary C array and has sufficiently large individual elements to handle every character in the string. +1 Yes, this sounds like a nice benefit, but the problem is it is false. The correct statement would be: The nice thing about PEP 383 is that it gives us the smallest storage array that is both an ordinary C array and has sufficiently large individual elements to handle every Unicode codepoint in the string. (PEP 393, I presume. :-) This statement might yet be made true :) As Tom eloquently describes in the referenced issue (is Tom ever non-eloquent?), not all characters can be represented in a single codepoint. But this is also besides the point (except insofar where we have to remind ourselves not to confuse the two in docs). In the docs, yes, and in programmer's minds (influenced by docs). It seems there are three concepts in Unicode, code units, codepoints, and characters, none of which are equivalent (and the first of which varies according to the encoding). It also seems (to me) that Unicode has failed in its original premise, of being an easy way to handle "big char" for "all languages" with fixed size elements, but it is not clear that its original premise is achievable regardless of the size of "big char", when mixed directionality is desired, and it seems that support of some single languages require mixed directionality, not to mention mixed language support. I see nothing wrong with having the language's fundamental data types (i.e., the unicode object, and even the re module) to be defined in terms of codepoints, not characters, and I see nothing wrong with len() returning the number of codepoints (as long as it is advertised as such). Me neither. After all UTF-8 also defines an encoding for a sequence of code points. Characters that require two or more codepoints are not represented special in UTF-8 -- they are represented as two or more encoded codepoints. The added requirement that UTF-8 must only be used to represent valid characters is just that -- it doesn't affect how strings are encoded, just what is considered valid at a higher level. Yes, this is true. In one sense, though, since UTF-8-supporting code already has to deal with variable length codepoint encoding, support for variable length character encoding seems like a minor extension, not upsetting any concept of fixed-width optimizations, because such cannot be used. Given the required variability of character size in all presently Unicode defined encodings, I tend to agree with Tom that UTF-8, together with some technique of translating character index to code unit offset, may provide the best overall space utilization, and adequate CPU efficiency. There is no doubt that UTF-8 is the most space efficient. I just don't think it is worth giving up O(1) indexing of codepoints -- it would change programmers' expectations too much. Programmers that have to deal with bidi or composed characters shouldn't have such expectations, of course. But there are many programmers who do not, or at least who think they do not, and they can retain their O(1) expectations, I suppose, until it bites them. OTOH I am sold on getting rid of the added complexities of "narrow builds" where not even all codepoints can be represented without using surrogate pairs (i.e. two code units per codepoint) and indexing uses code units instead of codepoints. I think this is an area where PEP 393 has a huge advantage: users can get rid of their exceptions for narrow builds. Yep, the only justification for narrow builds is in interfacing to underlying broken OS that happen to use that encoding... it might be slightly more efficient when doing API calls to such an O
Re: [Python-Dev] PEP 393 Summer of Code Project
On 25 August 2011 07:10, Victor Stinner wrote: > > I used stringbench and "./python -m test test_unicode". I plan to try > iobench. > > Which other benchmark tool should be used? Should we write a new one? I think that the PyPy benchmarks (or at least selected tests such as slowspitfire) would probably exercise things quite well. http://speed.pypy.org/about/ Tim Delaney ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On Wed, Aug 24, 2011 at 3:29 PM, Glenn Linderman wrote: > It would seem helpful if the stdlib could have some support for efficient > handling of Unicode characters in some representation. It would help > address the class of applications that does care. I claim that we have insufficient understanding of their needs to put anything in the stdlib. Wait and see is a good strategy here. > Adding extra support for > Unicode character handling sooner rather than later could be an performance > boost to applications that do care about full character support, and I can > only see the numbers of such applications increasing over time. Such could > be built as a subtype of str, perhaps, but if done in Python, there would > likely be a significant performance hit when going from str to > "unicodeCharacterStr". Sounds like overengineering to me. The right time to add something to the stdlib is when a large number of apps *currently* need something, not when you expect that they might need it in the future. (There just are too many possible futures to plan for them all. YAGNI rules.) -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Antoine Pitrou writes: > Le jeudi 25 août 2011 à 02:15 +0900, Stephen J. Turnbull a écrit : > > Antoine Pitrou writes: > > > On Thu, 25 Aug 2011 01:34:17 +0900 > > > "Stephen J. Turnbull" wrote: > > > > > > > > Martin has long claimed that the fact that I/O is done in terms of > > > > UTF-16 means that the internal representation is UTF-16 > > > > > > Which I/O? > > > > Eg, display of characters in the interpreter. > > I don't know why you say it's "done in terms of UTF-16", then. Unicode > strings are simply encoded to whatever character set is detected as the > terminal's character set. But it's not "simple" at the level we're talking about! Specifically, *in-memory* surrogates are properly respected when doing the encoding, and therefore such I/O is not UCS-2 or "raw code units". This treatment is different from sizing and indexing of unicodes, where surrogates are not treated differently from other code points. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Terry Reedy writes: > Please suggest a re-wording then, as it is a bug for doc and behavior to > disagree. Strings contain Unicode code units, which for most purposes can be treated as Unicode characters. However, even as "simple" an operation as "s1[0] == s2[0]" cannot be relied upon to give Unicode-conforming results. The second sentence remains true under PEP 393. > > > For the purpose of my sentence, the same thing in that code points > > > correspond to characters, > > > > Not in Unicode, they do not. By definition, a small number of code > > points (eg, U+) *never* did and *never* will correspond to > > characters. > > On computers, characters are represented by code points. What about the > other way around? http://www.unicode.org/glossary/#C says > code point: > 1) i in range(0x11000) > 2) "A value, or position, for a character" > (To muddy the waters more, 'character' has multiple definitions also.) > You are using 1), I am using 2) ;-(. No, you're not. You are claiming an isomorphism, which Unicode goes to great trouble to avoid. > I think you have it backwards. I see the current situation as the purity > of the C code beating the practicality for the user of getting right > answers. Sophistry. "Always getting the right answer" is purity. > > The thing is, that 90% of applications are not really going to care > > about full conformance to the Unicode standard. > > I remember when Intel argued that 99% of applications were not going to > be affected when the math coprocessor in its then new chips occasionally > gave 'non-standard' answers with certain divisors. In the case of Intel, the people who demanded standard answers did so for efficiency reasons -- they needed the FPU to DTRT because implementing FP in software was always going to be too slow. CPython, IMO, can afford to trade off because the implementation will necessarily be in software, and can be added later as a Python or C module. > I believe my scheme could be extended to solve [conformance for > composing characters] also. It would require more pre-processing > and more knowledge than I currently have of normalization. I have > the impression that the grapheme problem goes further than just > normalization. Yes and yes. But now you're talking about database lookups for every character (to determine if it's a composing character). Efficiency of a generic implementation isn't going to happen. Anyway, in Martin's rephrasing of my (imperfect) memory of Guido's pronouncement, "indexing is going to be O(1)". And Nick's point about non-uniform arrays is telling. I have 20 years of experience with an implementation of text as a non-uniform array which presents an array API, and *everything* needs to be special-cased for efficiency, and *any* small change can have show-stopping performance implications. Python can probably do better than Emacs has done due to much better leadership in this area, but I still think it's better to make full conformance optional. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Guido van Rossum writes: > I see nothing wrong with having the language's fundamental data types > (i.e., the unicode object, and even the re module) to be defined in > terms of codepoints, not characters, and I see nothing wrong with > len() returning the number of codepoints (as long as it is advertised > as such). In fact, the Unicode Standard, Version 6, goes farther (to code units): 2.7 Unicode Strings A Unicode string data type is simply an ordered sequence of code units. Thus a Unicode 8-bit string is an ordered sequence of 8-bit code units, a Unicode 16-bit string is an ordered sequence of 16-bit code units, and a Unicode 32-bit string is an ordered sequence of 32-bit code units. Depending on the programming environment, a Unicode string may or may not be required to be in the corresponding Unicode encoding form. For example, strings in Java, C#, or ECMAScript are Unicode 16-bit strings, but are not necessarily well-formed UTF-16 sequences. (p. 32). ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On Wed, Aug 24, 2011 at 5:31 PM, Stephen J. Turnbull wrote: > Terry Reedy writes: > > > Please suggest a re-wording then, as it is a bug for doc and behavior to > > disagree. > > Strings contain Unicode code units, which for most purposes can be > treated as Unicode characters. However, even as "simple" an > operation as "s1[0] == s2[0]" cannot be relied upon to give > Unicode-conforming results. > > The second sentence remains true under PEP 393. Really? If strings contain code units, that expression compares code units. What is non-conforming about comparing two code points? They are just integers. Seriously, what does Unicode-conforming mean here? It would be better to specify chapter and verse (e.g. is it a specific thing defined by the dreaded TR18?) > > > > For the purpose of my sentence, the same thing in that code points > > > > correspond to characters, > > > > > > Not in Unicode, they do not. By definition, a small number of code > > > points (eg, U+) *never* did and *never* will correspond to > > > characters. > > > > On computers, characters are represented by code points. What about the > > other way around? http://www.unicode.org/glossary/#C says > > code point: > > 1) i in range(0x11000) > > 2) "A value, or position, for a character" > > (To muddy the waters more, 'character' has multiple definitions also.) > > You are using 1), I am using 2) ;-(. > > No, you're not. You are claiming an isomorphism, which Unicode goes > to great trouble to avoid. I don't know that we will be able to educate our users to the point where they will use code unit, code point, character, glyph, character set, encoding, and other technical terms correctly. TBH even though less than two hours ago I composed a reply in this thread, I've already forgotten which is a code point and which is a code unit. > > I think you have it backwards. I see the current situation as the purity > > of the C code beating the practicality for the user of getting right > > answers. > > Sophistry. "Always getting the right answer" is purity. Eh? In most other areas Python is pretty careful not to promise to "always get the right answer" since what is right is entirely in the user's mind. We often go to great lengths of defining how things work so as to set the right expectations. For example, variables in Python work differently than in most other languages. Now I am happy to admit that for many Unicode issues the level at which we have currently defined things (code units, I think -- the thingies that encodings are made of) is confusing, and it would be better to switch to the others (code points, I think). But characters are right out. > > > The thing is, that 90% of applications are not really going to care > > > about full conformance to the Unicode standard. > > > > I remember when Intel argued that 99% of applications were not going to > > be affected when the math coprocessor in its then new chips occasionally > > gave 'non-standard' answers with certain divisors. > > In the case of Intel, the people who demanded standard answers did so > for efficiency reasons -- they needed the FPU to DTRT because > implementing FP in software was always going to be too slow. CPython, > IMO, can afford to trade off because the implementation will > necessarily be in software, and can be added later as a Python or C module. It is not so easy to change expectations about O(1) vs. O(N) behavior of indexing however. IMO we shouldn't try and hence we're stuck with operations defined in terms of code thingies instead of (mostly mythical) characters. > > I believe my scheme could be extended to solve [conformance for > > composing characters] also. It would require more pre-processing > > and more knowledge than I currently have of normalization. I have > > the impression that the grapheme problem goes further than just > > normalization. > > Yes and yes. But now you're talking about database lookups for every > character (to determine if it's a composing character). Efficiency of > a generic implementation isn't going to happen. Let's take small steps. Do the evolutionary thing. Let's get things right so users won't have to worry about code points vs. code units any more. A conforming library for all things at the character level can be developed later, once we understand things better at that level (again, most developers don't even understand most of the subtleties, so I claim we're not ready). > Anyway, in Martin's rephrasing of my (imperfect) memory of Guido's > pronouncement, "indexing is going to be O(1)". I still think that. It would be too big of a cultural upheaval to change it. > And Nick's point about > non-uniform arrays is telling. I have 20 years of experience with an > implementation of text as a non-uniform array which presents an array > API, and *everything* needs to be special-cased for efficiency, and > *any* small change can have show-stopping performanc
Re: [Python-Dev] PEP 393 Summer of Code Project
On Wed, Aug 24, 2011 at 5:36 PM, Stephen J. Turnbull wrote: > Guido van Rossum writes: > > > I see nothing wrong with having the language's fundamental data types > > (i.e., the unicode object, and even the re module) to be defined in > > terms of codepoints, not characters, and I see nothing wrong with > > len() returning the number of codepoints (as long as it is advertised > > as such). > > In fact, the Unicode Standard, Version 6, goes farther (to code units): > > 2.7 Unicode Strings > > A Unicode string data type is simply an ordered sequence of code > units. Thus a Unicode 8-bit string is an ordered sequence of 8-bit > code units, a Unicode 16-bit string is an ordered sequence of > 16-bit code units, and a Unicode 32-bit string is an ordered > sequence of 32-bit code units. > > Depending on the programming environment, a Unicode string may or > may not be required to be in the corresponding Unicode encoding > form. For example, strings in Java, C#, or ECMAScript are Unicode > 16-bit strings, but are not necessarily well-formed UTF-16 > sequences. > > (p. 32). I am assuming that that definition only applies to use of the term "unicode string" within the standard and has no bearing on how programming languages are allowed to use the term, as that would be preposterous. (They can define what they mean by terms like well-formed and conforming etc., and I won't try to go against that. But limiting what can be called a unicode string feels like unproductive coddling.) -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On Thu, Aug 25, 2011 at 12:29 PM, Guido van Rossum wrote: > Now I am happy to admit that for many Unicode issues the level at > which we have currently defined things (code units, I think -- the > thingies that encodings are made of) is confusing, and it would be > better to switch to the others (code points, I think). But characters > are right out. Indeed, code points are the abstract concept and code units are the specific byte sequences that are used for serialisation (FWIW, I'm going to try to keep this straight in the future by remembering that the Unicode character set is defined as abstract points on planes, just like geometry). With narrow builds, code units can currently come into play internally, but with PEP 393 everything internal will be working directly with code points. Normalisation, combining characters and bidi issues may still affect the correctness of unicode comparison and slicing (and other text manipulation), but there are limits to how much of the underlying complexity we can effectively hide without being misleading. Cheers, Nick. -- Nick Coghlan | [email protected] | Brisbane, Australia ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On Wed, Aug 24, 2011 at 7:47 PM, Nick Coghlan wrote: > On Thu, Aug 25, 2011 at 12:29 PM, Guido van Rossum wrote: >> Now I am happy to admit that for many Unicode issues the level at >> which we have currently defined things (code units, I think -- the >> thingies that encodings are made of) is confusing, and it would be >> better to switch to the others (code points, I think). But characters >> are right out. > > Indeed, code points are the abstract concept and code units are the > specific byte sequences that are used for serialisation (FWIW, I'm > going to try to keep this straight in the future by remembering that > the Unicode character set is defined as abstract points on planes, > just like geometry). Hm, code points still look pretty concrete to me (integers in the range 0 .. 2**21) and code units don't feel like byte sequences to me (at least not UTF-16 code units -- in Python at least you can think of them as integers in the range 0 .. 2**16). > With narrow builds, code units can currently come into play > internally, but with PEP 393 everything internal will be working > directly with code points. Normalisation, combining characters and > bidi issues may still affect the correctness of unicode comparison and > slicing (and other text manipulation), but there are limits to how > much of the underlying complexity we can effectively hide without > being misleading. Let's just define a Unicode string to be a sequence of code points and let libraries deal with the rest. Ok, methods like lower() should consider characters, but indexing/slicing should refer to code points. Same for '=='; we can have a library that compares by applying (or assuming?) certain normalizations. Tom C tells me that case-less comparison cannot use a.lower() == b.lower(); fine, we can add that operation to the library too. But this exceeds the scope of PEP 393, right? -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On Thu, Aug 25, 2011 at 1:11 PM, Guido van Rossum wrote: >> With narrow builds, code units can currently come into play >> internally, but with PEP 393 everything internal will be working >> directly with code points. Normalisation, combining characters and >> bidi issues may still affect the correctness of unicode comparison and >> slicing (and other text manipulation), but there are limits to how >> much of the underlying complexity we can effectively hide without >> being misleading. > > Let's just define a Unicode string to be a sequence of code points and > let libraries deal with the rest. Ok, methods like lower() should > consider characters, but indexing/slicing should refer to code points. > Same for '=='; we can have a library that compares by applying (or > assuming?) certain normalizations. Tom C tells me that case-less > comparison cannot use a.lower() == b.lower(); fine, we can add that > operation to the library too. But this exceeds the scope of PEP 393, > right? Yep, I was agreeing with you on this point - I think you're right that if we provide a solid code point based core Unicode type (perhaps with some character based methods), then library support can fill the gap between handling code points and handling characters. In particular, a unicode character based string type would be significantly easier to write in Python than it would be in C (after skimming Tom's bug report at http://bugs.python.org/issue12729, I better understand the motivation and desire for that kind of interface and it sounds like Terry's prototype is along those lines). Once those mappings are thrashed out outside the core, then there may be something to incorporate directly around the 3.4 timeframe (or potentially even in 3.3, since it should already be possible to develop such a wrapper based on UCS4 builds of 3.2) However, there may an important distinction to be made on the Python-the-language vs CPython-the-implementation front: is another implementation (e.g. PyPy) *allowed* to implement character based indexing instead of code point based for 2.x unicode/3.x str type? Or is the code point indexing part of the language spec, and any character based indexing needs to be provided via a separate type or module? Regards, Nick. -- Nick Coghlan | [email protected] | Brisbane, Australia ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Guido van Rossum writes: > On Wed, Aug 24, 2011 at 5:31 PM, Stephen J. Turnbull > wrote: > > Strings contain Unicode code units, which for most purposes can be > > treated as Unicode characters. However, even as "simple" an > > operation as "s1[0] == s2[0]" cannot be relied upon to give > > Unicode-conforming results. > > > > The second sentence remains true under PEP 393. > > Really? If strings contain code units, that expression compares code > units. That's true out of context, but in context it's "which for most purposes can be treated as Unicode characters", and this is what Terry is concerned with, as well. > What is non-conforming about comparing two code points? Unicode conformance means treating characters correctly. In particular, s1 and s2 might be NFC and NFD forms of the same string with a combining character at s2[1], or s1[1] and s[2] might be a non-combining character and a combining character respectively. > Seriously, what does Unicode-conforming mean here? Chapter 3, all verses. Here, specifically C6, p. 60. One would have to define the process executing "s1[0] == s2[0]" to be sure that even in the cases cited in the previous paragraph non-conformance is occurring, but one example of a process where that is non-conforming (without additional code to check for trailing combining characters) is in comparison of Vietnamese filenames generated on a Mac vs. those generated on a Linux host. > > No, you're not. You are claiming an isomorphism, which Unicode goes > > to great trouble to avoid. > > I don't know that we will be able to educate our users to the point > where they will use code unit, code point, character, glyph, character > set, encoding, and other technical terms correctly. Sure. I got it wrong myself earlier. I think that the right thing to do is to provide a conformant implementation of Unicode text in the stdlib (a long run goal, see below), and call that "Unicode", while we call strings "strings". > Now I am happy to admit that for many Unicode issues the level at > which we have currently defined things (code units, I think -- the > thingies that encodings are made of) is confusing, and it would be > better to switch to the others (code points, I think). Yes, and AFAICT (I'm better at reading standards than I am at reading Python implementation) PEP 393 allows that. > But characters are right out. +1 > It is not so easy to change expectations about O(1) vs. O(N) behavior > of indexing however. IMO we shouldn't try and hence we're stuck with > operations defined in terms of code thingies instead of (mostly > mythical) characters. Well, O(N) is not really the question. It's really O(log N), as Terry says. Is that out, too? I can verify that it's possible to do it in practice in the long term. In my experience with Emacs, even with 250 MB files, O(log N) mostly gives acceptable performance in an interactive editor, as well as many scripted textual applications. The problems that I see are (1) It's very easy to write algorithms that would be O(N) for a true array, but then become O(N log N) or worse (and the coefficient on the O(log N) algorithm is way higher to start). I guess this would kill the idea, but. (2) Maintenance is fragile; it's easy to break the necessary caches with feature additions and bug fixes. (However, I don't think this would be as big a problem for Python, due to its more disciplined process, as it has been for XEmacs.) You might think space for the caches would be a problem, but that has turned out not to be the case for Emacsen. > Let's take small steps. Do the evolutionary thing. Let's get things > right so users won't have to worry about code points vs. code units > any more. A conforming library for all things at the character level > can be developed later, once we understand things better at that level > (again, most developers don't even understand most of the subtleties, > so I claim we're not ready). I don't think anybody does. That's one reason there's a new version of Unicode every few years. > This I agree with (though if you were referring to me with > "leadership" I consider myself woefully underinformed about Unicode > subtleties). MvL and MAL are not, however, and there are plenty of others who make contributions -- in an orderly fashion. > I also suspect that Unicode "conformance" (however defined) is more > part of a political battle than an actual necessity. I'd much > rather have us fix Tom Christiansen's specific bugs than chase the > elusive "standard conforming". Well, I would advocate specifying which parts of the standard we target and which not (for any given version). The goal of full "Chapter 3" conformance should be left up to a library on PyPI for the nonce IMO. I agree that fixing specific bugs should be given precedence over "conformance chasing," but implementation should conform to the appropriate part
Re: [Python-Dev] PEP 393 review
Victor Stinner, 25.08.2011 00:29:
With this PEP, the unicode object overhead grows to 10 pointer-sized
words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine.
Does it have any adverse effects?
For pure ASCII, it might be possible to use a shorter struct:
typedef struct {
PyObject_HEAD
Py_ssize_t length;
Py_hash_t hash;
int state;
Py_ssize_t wstr_length;
wchar_t *wstr;
/* no more utf8_length, utf8, str */
/* followed by ascii data */
} _PyASCIIObject;
(-2 pointer -1 ssize_t: 56 bytes)
=> "a" is 58 bytes (with utf8 for free, without wchar_t)
For object allocated with the new API, we can use a shorter struct:
typedef struct {
PyObject_HEAD
Py_ssize_t length;
Py_hash_t hash;
int state;
Py_ssize_t wstr_length;
wchar_t *wstr;
Py_ssize_t utf8_length;
char *utf8;
/* no more str pointer */
/* followed by latin1/ucs2/ucs4 data */
} _PyNewUnicodeObject;
(-1 pointer: 72 bytes)
=> "é" is 74 bytes (without utf8 / wchar_t)
For the legacy API:
typedef struct {
PyObject_HEAD
Py_ssize_t length;
Py_hash_t hash;
int state;
Py_ssize_t wstr_length;
wchar_t *wstr;
Py_ssize_t utf8_length;
char *utf8;
void *str;
} _PyLegacyUnicodeObject;
(same size: 80 bytes)
=> "a" is 80+2 (2 malloc) bytes (without utf8 / wchar_t)
The current struct:
typedef struct {
PyObject_HEAD
Py_ssize_t length;
Py_UNICODE *str;
Py_hash_t hash;
int state;
PyObject *defenc;
} PyUnicodeObject;
=> "a" is 56+2 (2 malloc) bytes (without utf8, with wchar_t if Py_UNICODE is
wchar_t)
... but the code (maybe only the macros?) and debuging will be more complex.
That's an interesting idea. However, it's not required to do this as part
of the PEP 393 implementation. This can be added later on if the need
evidently arises in general practice.
Also, there is always the possibility to simply intern very short strings
in order to avoid their multiplication in memory. Long strings don't suffer
from this as the data size quickly dominates. User code that works with a
lot of short strings would likely do the same.
BTW, I would expect that many short strings either go away as quickly as
they appeared (e.g. in a parser) or were brought in as literals and are
therefore interned anyway. That's just one reason why I suggest to wait for
a prove of inefficiency in the real world (and, obviously, to test your own
code with this as quickly as possible).
Will the format codes returning a Py_UNICODE pointer with
PyArg_ParseTuple be deprecated?
Because Python 2.x is still dominant and it's already hard enough to port C
modules, it's not the best moment to deprecate the legacy API (Py_UNICODE*).
Well, it will be quite inefficient in future CPython versions, so I think
if it's not officially deprecated at some point, it will deprecate itself
for efficiency reasons. Better make it clear that it's worth investing in
better performance here.
Do you think the wstr representation could be removed in some future
version of Python?
Conversion to wchar_t* is common, especially on Windows.
That's an issue. However, I cannot say how common this really is in
practice. Surely depends on the specific code, right? How common is it in
core CPython?
But I don't know if
we *have to* cache the result. Is it cached by the way? Or is wstr only used
when a string is created from Py_UNICODE?
If it's so common on Windows, maybe it should only be cached there?
Stefan
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 review
"Martin v. Löwis", 24.08.2011 20:15: Guido has agreed to eventually pronounce on PEP 393. Before that can happen, I'd like to collect feedback on it. There have been a number of voice supporting the PEP in principle Absolutely. - conditions you would like to pose on the implementation before acceptance. I'll see which of these can be resolved, and list the ones that remain open. Just repeating here that I'd like to see the buffer void* changed into a union of pointers that state the exact layout type. IMHO, that would clarify the implementation and make it clearer that it's correct to access the data buffer as a flat array. (Obviously, code that does that is subject to future changes, that's why there are macros.) Stefan ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On 8/24/2011 7:29 PM, Guido van Rossum wrote: (Hey, I feel a QOTW coming. "Standards? We don't need no stinkin' standards."http://en.wikipedia.org/wiki/Stinking_badges :-) Which deserves an appropriate, follow-on, misquote: Guido says the Unicode standard stinks. ˚͜˚ <- and a Unicode smiley to go with it! ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Nick Coghlan writes:
> GvR writes:
> > Let's just define a Unicode string to be a sequence of code points and
> > let libraries deal with the rest. Ok, methods like lower() should
> > consider characters, but indexing/slicing should refer to code points.
> > Same for '=='; we can have a library that compares by applying (or
> > assuming?) certain normalizations. Tom C tells me that case-less
> > comparison cannot use a.lower() == b.lower(); fine, we can add that
> > operation to the library too. But this exceeds the scope of PEP 393,
> > right?
>
> Yep, I was agreeing with you on this point - I think you're right that
> if we provide a solid code point based core Unicode type (perhaps with
> some character based methods), then library support can fill the gap
> between handling code points and handling characters.
+1 I don't really see an alternative to this approach. The
underlying array has to be exposed because there are too many
applications that can take advantage of it, and analysis of decomposed
characters requires it.
Making that array be an array of code points is a really good idea,
and Python already has that in the UCS-4 build. PEP 393 is "just" a
space optimization that allows getting rid of the narrow build, with
all its wartiness.
> something to incorporate directly around the 3.4 timeframe (or
> potentially even in 3.3, since it should already be possible to
> develop such a wrapper based on UCS4 builds of 3.2)
I agree that it's possible, but I estimate that it's not feasible for
3.3 because we don't yet know the requirements. This one really needs
to ferment and mature in PyPI for a while because we just don't know
how far the scope of user needs is going to extend. Bidi is a
mudball[1], confusable character indexes sound like a cool idea for
the web and email but is anybody really going to use them?, etc.
> However, there may an important distinction to be made on the
> Python-the-language vs CPython-the-implementation front: is another
> implementation (e.g. PyPy) *allowed* to implement character based
> indexing instead of code point based for 2.x unicode/3.x str type? Or
> is the code point indexing part of the language spec, and any
> character based indexing needs to be provided via a separate type or
> module?
+1 for language spec. Remember, there are cases in Unicode where
you'd like to access base characters and the like. So you need to be
able to get at individual code points in an NFD string. You shouldn't
need to use different code for that in different implementations of
Python.
Footnotes:
[1] Sure, we can implement the UAX#9 bidi algorithm, but it's not
good enough by itself: something as simple as
"File name (default {0}): ".format(name)
can produce disconcerting results if the whole resulting string is
treated by the UBA. Specifically, using the usual convention of
uppercase letters being an RTL script, name = "ABCD" will result in
the prompt:
File name (default :(DCBA _
(where _ denotes the position of the insertion cursor). The Hebrew
speakers on emacs-devel agreed that an example using a real Hebrew
string didn't look right to them, either.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
