Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Stefan Behnel

Torsten Becker, 24.08.2011 04:41:

Also, common, now simple, checks for "unicode->str == NULL" would look
more ambiguous with a union ("unicode->str.latin1 == NULL").


You could just add yet another field "any", i.e.

union {
   unsigned char* latin1;
   Py_UCS2* ucs2;
   Py_UCS4* ucs4;
   void* any;
} str;

That way, the above test becomes

if (!unicode->str.any)

or

if (unicode->str.any == NULL)

Or maybe even call it "initialised" to match the intended purpose:

if (!unicode->str.initialised)

That being said, I don't mind "unicode->str.latin1 == NULL" either, given 
that it will (as mentioned by others) be hidden behind a macro most of the 
time anyway.


Stefan

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FileSystemError or FilesystemError?

2011-08-24 Thread Stephen J. Turnbull
Nick Coghlan writes:

 > Since I tend to use the one word 'filesystem' form myself (ditto for
 > 'filename'), I'm +1 for FilesystemError, but I'm only -0 for
 > FileSystemError (so I expect that will be the option chosen, given
 > other responses).

I slightly prefer FilesystemError because it parses unambiguously.
Cf. FileSystemError vs FileUserError.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Glenn Linderman

On 8/23/2011 5:46 PM, Terry Reedy wrote:

On 8/23/2011 6:20 AM, "Martin v. Löwis" wrote:

Am 23.08.2011 11:46, schrieb Xavier Morel:
Mostly ascii is pretty common for western-european languages 
(French, for
instance, is probably 90 to 95% ascii). It's also a risk in english, 
when

the writer "correctly" spells foreign words (résumé and the like).


I know - I still question whether it is "extremely common" (so much as
to justify a special case). I.e. on what application with what dataset
would you gain what speedup, at the expense of what amount of extra
lines, and potential slow-down for other datasets?

[snip]

In the PEP 393 approach, if the string has a two-byte representation,
each character needs to widened to two bytes, and likewise for four
bytes. So three separate copies of the unrolled loop would be needed,
one for each target size.


I fully support the declared purpose of the PEP, which I understand to 
be to have a full,correct Unicode implementation on all new Python 
releases without paying unnecessary space (and consequent time) 
penalties. I think the erroneous length, iteration, indexing, and 
slicing for strings with non-BMP chars in narrow builds needs to be 
fixed for future versions. I think we should at least consider 
alternatives to the PEP393 solution of double or quadrupling space if 
needed for at least one char.


In utf16.py, attached to http://bugs.python.org/issue12729
I propose for consideration a prototype of different solution to the 
'mostly BMP chars, few non-BMP chars' case. Rather than expand every 
character from 2 bytes to 4, attach an array cpdex of character (ie 
code point, not code unit) indexes. Then for indexing and slicing, the 
correction is simple, simpler than I first expected:

  code-unit-index = char-index + bisect.bisect_left(cpdex, char_index)
where code-unit-index is the adjusted index into the full underlying 
double-byte array. This adds a time penalty of log2(len(cpdex)), but 
avoids most of the space penalty and the consequent time penalty of 
moving more bytes around and increasing cache misses.


I believe the same idea would work for utf8 and the mostly-ascii case. 
The main difference is that non-ascii chars have various byte sizes 
rather than the 1 extra double-byte of non-BMP chars in UCS2 builds. 
So the offset correction would not simply be the bisect-left return 
but would require another lookup

  byte-index = char-index + offsets[bisect-left(cpdex, char-index)]

If possible, I would have the with-index-array versions be separate 
subtypes, as in utf16.py. I believe either index-array implementation 
might benefit from a subtype for single multi-unit chars, as a single 
non-ASCII or non-BMP char does not need an auxiliary [0] array and a 
senseless lookup therein but does need its length fixed at 1 instead 
of the number of base array units.


So am I correctly reading between the lines when, after reading this 
thread so far, and the complete issue discussion so far, that I see a 
PEP 393 revision or replacement that has the following characteristics:


1) Narrow builds are dropped.  The conceptual idea of PEP 393 eliminates 
the need for narrow builds, as the internal string data structures 
adjust to the actuality of the data.  If you want a narrow build, just 
don't use code points > 65535.


2) There are more, or different, internal kinds of strings, which affect 
the processing patterns.  Here is an enumeration of the ones I can think 
of, as complete as possible, with recognition that benchmarking and 
clever algorithms may eliminate the need for some of them.


a) all ASCII
b) latin-1 (8-bit codepoints, the first 256 Unicode codepoints) This 
kind may not be able to support a "mostly" variation, and may be no more 
efficient than case b).  But it might also be popular in parts of Europe 
:)  And appropriate benchmarks may discover whether or not it has worth.

c) mostly ASCII (utf8) with clever indexing/caching to be efficient
d) UTF-8 with clever indexing/caching to be efficient
e) 16-bit codepoints
f) UTF-16 with clever indexing/caching to be efficient
g) 32-bit codepoints
h) UTF-32

When instantiating a str, a new parameter or subtype would restrict the 
implementation to using only a), b), d), f), and h) when fully 
conformant Unicode behavior is desired.  No lone surrogates, no out of 
range code points, no illegal codepoints. A default str would prefer a), 
b), c), e), and g) for efficiency and flexibility.


When manipulations outside of Unicode are necessary [Windows seems to 
use e) for example, suffering from the same sorts of backward 
compatibility problems as Python, in some ways], the default str type 
would permit them, using e) and g) kinds of representations.  Although 
the surrogate escape codec only uses prefix surrogates (or is it only 
suffix ones?) which would never match up, note that a conversion from 
16-bit codepoints to other formats may produce matches between the 
results of the surrogate escape 

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Victor Stinner

Le 24/08/2011 04:41, Torsten Becker a écrit :

On Tue, Aug 23, 2011 at 10:08, Antoine Pitrou  wrote:

Macros are useful to shield the abstraction from the implementation. If
you access the members directly, and the unicode object is represented
differently in some future version of Python (say e.g. with tagged
pointers), your code doesn't compile anymore.


I agree with Antoine, from the experience of porting C code from 3.2
to the PEP 393 unicode API, the additional encapsulation by macros
made it much easier to change the implementation of what is a field,
what is a field's actual name, and what needs to be calculated through
a function.

So, I would like to keep primary access as a macro but I see the point
that it would make the struct clearer to access and I would not mind
changing the struct to use a union.  But then most access currently is
through macros so I am not sure how much benefit the union would bring
as it mostly complicates the struct definition.


An union helps debugging in gdb: you don't have to cast manually to 
unsigned char*/Py_UCS2*/Py_UCS4*.



Also, common, now simple, checks for "unicode->str == NULL" would look
more ambiguous with a union ("unicode->str.latin1 == NULL").


We can rename "str" to something else, to "data" for example.

Victor
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Victor Stinner

Le 24/08/2011 06:59, Scott Dial a écrit :

On 8/23/2011 6:38 PM, Victor Stinner wrote:

Le mardi 23 août 2011 00:14:40, Antoine Pitrou a écrit :

- You could try to run stringbench, which can be found at
   http://svn.python.org/projects/sandbox/trunk/stringbench (*)
   and there's iobench (the text mode benchmarks) in the Tools/iobench
   directory.


Some raw numbers.

stringbench:
"147.07 203.07 72.4 TOTAL" for the PEP 393
"146.81 140.39 104.6 TOTAL" for default
=>  PEP is 45% slower


I ran the same benchmark and couldn't make a distinction in performance
between them:


Hum, are you sure that you used the PEP 383? Make sure that you are 
using the pep-383 branch! I also started my benchmark on the wrong 
branch :-)


Victor
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Victor Stinner

Le 24/08/2011 04:41, Torsten Becker a écrit :

On Tue, Aug 23, 2011 at 18:27, Victor Stinner
  wrote:

I posted a patch to re-add it:
http://bugs.python.org/issue12819#msg142867


Thank you for the patch!  Note that this patch adds the fast path only
to the helper function which determines the length of the string and
the maximum character.  The decoding part is still without a fast path
for ASCII runs.


Ah? If utf8_max_char_size_and_has_errors() returns no error hand 
maxchar=127: memcpy() is used. You mean that memcpy() is too slow? :-)


maxchar = utf8_max_char_size_and_has_errors(s, size, &unicode_size,
&has_errors);
if (has_errors) {
  ...
}
else {
   unicode = (PyUnicodeObject *)PyUnicode_New(unicode_size, maxchar);
   if (!unicode) return NULL;
/* When the string is ASCII only, just use memcpy and return. */
if (maxchar < 128) {
assert(unicode_size == size);
Py_MEMCPY(PyUnicode_1BYTE_DATA(unicode), s, unicode_size);
return (PyObject *)unicode;
}
...
}

But yes, my patch only optimize ASCII only strings, not "mostly-ASCII" 
strings (e.g. 100 ASCII + 1 latin1 character). It can be optimized 
later. I didn't benchmark my patch.


Victor
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Martin v. Löwis
> So am I correctly reading between the lines when, after reading this
> thread so far, and the complete issue discussion so far, that I see a
> PEP 393 revision or replacement that has the following characteristics:
> 
> 1) Narrow builds are dropped.

PEP 393 already drops narrow builds.

> 2) There are more, or different, internal kinds of strings, which affect
> the processing patterns.

This is the basic idea of PEP 393.

> a) all ASCII
> b) latin-1 (8-bit codepoints, the first 256 Unicode codepoints) This
> kind may not be able to support a "mostly" variation, and may be no more
> efficient than case b).  But it might also be popular in parts of Europe

This two cases are already in PEP 393.

> c) mostly ASCII (utf8) with clever indexing/caching to be efficient
> d) UTF-8 with clever indexing/caching to be efficient

I see neither a need nor a means to consider these.

> e) 16-bit codepoints

These are in PEP 393.

> f) UTF-16 with clever indexing/caching to be efficient

Again, -1.

> g) 32-bit codepoints

This is in PEP 393.

> h) UTF-32

What's that, as opposed to g)?

I'm not open to revise PEP 393 in the direction of adding more
representations.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Stephen J. Turnbull
Terry Reedy writes:

 > The current UCS2 Unicode string implementation, by design, quickly gives 
 > WRONG answers for len(), iteration, indexing, and slicing if a string 
 > contains any non-BMP (surrogate pair) Unicode characters. That may have 
 > been excusable when there essentially were no such extended chars, and 
 > the few there were were almost never used.

Well, no, it gives the right answer according to the design.  unicode
objects do not contain character strings.  By design, they contain
code point strings.  Guido has made that absolutely clear on a number
of occasions.  And the reasons have very little to do with lack of
non-BMP characters to trip up the implementation.  Changing those
semantics should have been done before the release of Python 3.

It is not clear to me that it is a good idea to try to decide on "the"
correct implementation of Unicode strings in Python even today.  There
are a number of approaches that I can think of.

1.  The "too bad if you can't take a joke" approach: do nothing and
recommend UTF-32 to those who want len() to DTRT.
2.  The "slope is slippery" approach: Implement UTF-16 objects as
built-ins, and then try to fend off requests for correct treatment
of unnormalized composed characters, normalization, compatibility
substitutions, bidi, etc etc.
3.  The "are we not hackers?" approach: Implement a transform that
maps characters that are not represented by a single code point
into Unicode private space, and then see if anybody really needs
more than 6400 non-BMP characters.  (Note that this would
generalize to composed characters that don't have a one-code-point
NFC form and similar non-standardized cases that nonstandard users
might want handled.)
4.  The "42" approach: sadly, I can't think deeply enough to explain it.

There are probably others.

It's true that Python is going to need good libraries to provide
correct handling of Unicode strings (as opposed to unicode objects).
But it's not clear to me given the wide variety of implementations I
can imagine that there will be one best implementation, let alone
which ones are good and Pythonic, and which not so.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Victor Stinner

Le 24/08/2011 04:56, Torsten Becker a écrit :

On Tue, Aug 23, 2011 at 18:56, Victor Stinner
  wrote:

kind=0 is used and public, it's PyUnicode_WCHAR_KIND. Is it still
necessary? It looks to be only used in PyUnicode_DecodeUnicodeEscape().


If it can be removed, it would be nice to have kind in [0; 2] instead of kind
in [1; 2], to be able to have a list (of 3 items) =>  callback or label.


It is also used in PyUnicode_DecodeUTF8Stateful() and there might be
some cases which I missed converting checks for 0 when I introduced
the macro.  The question was more if this should be written as 0 or as
a named constant.  I preferred the named constant for readability.

An alternative would be to have kind values be the same as the number
of bytes for the string representation so it would be 0 (wstr), 1
(1-byte), 2 (2-byte), or 4 (4-byte).


Please don't do that: it's more common to need contiguous arrays (for a 
jump table/callback list) than having to know the character size. You 
can use an array giving the character size: CHARACTER_SIZE[kind] which 
is the array {0, 1, 2, 4} (or maybe sizeof(wchar_t) instead of 0 ?).



I think the value for wstr/uninitialized/reserved should not be
removed.  The wstr representation is still used in the error case in
the utf8 decoder because these strings can be resized.


In Python, you can resize an object if it has only one reference. Why is 
it not possible in your branch?


Oh, I missed the UTF-8 decoder because you wrote "kind = 0": please, use 
PyUnicode_WCHAR_KIND instead!


I don't like "reserved" value, especially if its value is 0, the first 
value. See Microsoft file formats: they waste a lot of space because 
most fields are reserved, and 10 years later, these fields are still 
unused. Can't we add the value 4 when we will need a new kind?



Also having one
designated value for "uninitialized" limits comparisons in the
affected functions to the kind value, otherwise they would need to
check the str field for NULL to determine in which buffer to write a
character.


I have to read the code more carefully, I don't know this 
"uninitialized" state.


For kind=0: "wstr" means that str is NULL but wstr is set? I didn't 
understand that str can be NULL for an initialized string. I should read 
the PEP again :-)



I suppose that compilers prefer a switch with all cases defined, 0 a first item
and contiguous values. We may need an enum.


During the Summer of Code, Martin and I did a experiment with GCC and
it did not seem to produce a jump table as an optimization for three
cases but generated comparison instructions anyway.


You mean with a switch with a case for each possible value? I don't 
think that GCC knows that all cases are defined if you don't use an enum.



I am not sure how much we should optimize for potential compiler

> optimizations here.

Oh, it was just a suggestion. Sure, it's not the best moment to care of 
micro-optimizations.


Victor
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FileSystemError or FilesystemError?

2011-08-24 Thread Cameron Simpson
On 24Aug2011 12:31, Nick Coghlan  wrote:
| On Wed, Aug 24, 2011 at 5:19 AM, Steven D'Aprano  wrote:
| > Antoine Pitrou wrote:
| >> When reviewing the PEP 3151 implementation (*), Ezio commented that
| >> "FileSystemError" looks a bit strange and that "FilesystemError" would
| >> be a better spelling. What is your opinion?
| >
| > It's a file system (two words), not filesystem (not in any dictionary or
| > spell checker I've ever used).
| 
| I rarely find spell checkers to be useful sources of data on correct
| spelling of technical jargon (and the computing usage of the term
| 'filesystem' definitely qualifies as jargon).
| 
| > (Nor do we write filingsystem, governmentsystem, politicalsystem or
| > schoolsystem. This is English, not German.)
| 
| Personally, I think 'filesystem' is a portmanteau in the process of
| coming into existence (as evidenced by usage like 'FHS' standing for
| 'Filesystem Hierarchy Standard'). However, the two word form is still
| useful at times, particularly for disambiguation of acronyms (as
| evidenced by usage like 'NFS' and 'GFS' for 'Network File System' and
| 'Google File System').

Funny, I thought NFS stood for Not a File System :-)

| Since I tend to use the one word 'filesystem' form myself (ditto for
| 'filename'), I'm +1 for FilesystemError, but I'm only -0 for
| FileSystemError (so I expect that will be the option chosen, given
| other responses).

I also use "filesystem" as a one word piece of jargon, but I am
persuaded by the language arguments. So I'm +1 for FileSystemError.

Cheers,
-- 
Cameron Simpson  DoD#743
http://www.cskk.ezoshosting.com/cs/

Bolts get me through times of no courage better than courage gets me
through times of no bolts!
- Eric Hirst 
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Glenn Linderman

On 8/24/2011 1:18 AM, "Martin v. Löwis" wrote:

So am I correctly reading between the lines when, after reading this
thread so far, and the complete issue discussion so far, that I see a
PEP 393 revision or replacement that has the following characteristics:

1) Narrow builds are dropped.

PEP 393 already drops narrow builds.


I'd forgotten that.




2) There are more, or different, internal kinds of strings, which affect
the processing patterns.

This is the basic idea of PEP 393.


Agreed.



a) all ASCII
b) latin-1 (8-bit codepoints, the first 256 Unicode codepoints) This
kind may not be able to support a "mostly" variation, and may be no more
efficient than case b).  But it might also be popular in parts of Europe

This two cases are already in PEP 393.

Sure.  Wanted to enumerate all, rather than just add-ons.


c) mostly ASCII (utf8) with clever indexing/caching to be efficient
d) UTF-8 with clever indexing/caching to be efficient

I see neither a need nor a means to consider these.


The discussion about "mostly ASCII" strings seems convincing that there 
could be a significant space savings if such were implemented.



e) 16-bit codepoints

These are in PEP 393.


f) UTF-16 with clever indexing/caching to be efficient

Again, -1.


This is probably the one I would pick as least likely to be useful if 
the rest were implemented.



g) 32-bit codepoints

This is in PEP 393.


h) UTF-32

What's that, as opposed to g)?


g) would permit codes greater than u+10 and would permit the illegal 
codepoints and lone surrogates.  h) would be strict Unicode 
conformance.  Sorry that the 4 paragraphs of explanation that you didn't 
quote didn't make that clear.



I'm not open to revise PEP 393 in the direction of adding more
representations.


It's your PEP.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Scott Dial
On 8/24/2011 4:11 AM, Victor Stinner wrote:
> Le 24/08/2011 06:59, Scott Dial a écrit :
>> On 8/23/2011 6:38 PM, Victor Stinner wrote:
>>> Le mardi 23 août 2011 00:14:40, Antoine Pitrou a écrit :
 - You could try to run stringbench, which can be found at
http://svn.python.org/projects/sandbox/trunk/stringbench (*)
and there's iobench (the text mode benchmarks) in the Tools/iobench
directory.
>>>
>>> Some raw numbers.
>>>
>>> stringbench:
>>> "147.07 203.07 72.4 TOTAL" for the PEP 393
>>> "146.81 140.39 104.6 TOTAL" for default
>>> =>  PEP is 45% slower
>>
>> I ran the same benchmark and couldn't make a distinction in performance
>> between them:
> 
> Hum, are you sure that you used the PEP 383? Make sure that you are
> using the pep-383 branch! I also started my benchmark on the wrong
> branch :-)

You are right. I used the "Get Source" link on bitbucket to save pulling
the whole clone, but the "Get Source" link seems to be whatever branch
has the lastest revision (maybe?) even if you switch branches on the
webpage. To correct my previous post:

cpython.txt
183.26  177.97  103.0   TOTAL
cpython-wide-unicode.txt
181.27  195.58  92.7TOTAL
pep-393.txt
181.40  270.34  67.1TOTAL

And,

cpython.txt
real0m32.493s
cpython-wide-unicode.txt
real0m33.489s
pep-393.txt
real0m36.206s

-- 
Scott Dial
[email protected]
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Martin v. Löwis
Am 24.08.2011 10:17, schrieb Victor Stinner:
> Le 24/08/2011 04:41, Torsten Becker a écrit :
>> On Tue, Aug 23, 2011 at 18:27, Victor Stinner
>>   wrote:
>>> I posted a patch to re-add it:
>>> http://bugs.python.org/issue12819#msg142867
>>
>> Thank you for the patch!  Note that this patch adds the fast path only
>> to the helper function which determines the length of the string and
>> the maximum character.  The decoding part is still without a fast path
>> for ASCII runs.
> 
> Ah? If utf8_max_char_size_and_has_errors() returns no error hand
> maxchar=127: memcpy() is used. You mean that memcpy() is too slow? :-)

No: the pure-ASCII case is already optimized with memcpy. It's the
mostly-ASCII case that is not optimized anymore in this PEP 393
implementation (the one with "ASCII runs" instead of "pure ASCII").

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Terry Reedy

On 8/24/2011 4:22 AM, Stephen J. Turnbull wrote:

Terry Reedy writes:

  >  The current UCS2 Unicode string implementation, by design, quickly gives
  >  WRONG answers for len(), iteration, indexing, and slicing if a string
  >  contains any non-BMP (surrogate pair) Unicode characters. That may have
  >  been excusable when there essentially were no such extended chars, and
  >  the few there were were almost never used.

Well, no, it gives the right answer according to the design.  unicode
objects do not contain character strings.


Excuse me for believing the fine 3.2 manual that says
"Strings contain Unicode characters." (And to a naive reader, that 
implies that string iteration and indexing should produce Unicode 
characters.)



 By design, they contain code point strings.


For the purpose of my sentence, the same thing in that code points 
correspond to characters, where 'character' includes ascii control 
'characters' and unicode analogs. The problem is that on narrow builds 
strings are NOT code point sequences. They are 2-byte code *unit* 
sequences. Single non-BMP code points are seen as 2 code units and hence 
given a length of 2, not 1. Strings iterate, index, and slice by 2-byte 
code units, not by code points.


Python floats try to follow the IEEE standard as interpreted for Python 
(Python has its software exceptions rather than signalling versus 
non-signalling hardware signals). Python decimals slavishly follow the 
IEEE decimal standard. Python narrow build unicode breaks the standard 
for non-BMP code points and cosequently, breaks the re module even when 
it works for wide builds. As sys.maxunicode more or less says, only the 
BMP subset is fully supported. Any narrow build string with even 1 
non-BMP char violates the standard.



Guido has made that absolutely clear on a number
of occasions.


It is not clear what you mean, but recently on python-ideas he has 
reiterated that he intends bytes and strings to be conceptually 
different. Bytes are computer-oriented binary arrays; strings are 
supposedly human-oriented character/codepoint arrays. Except they are 
not for non-BMP characters/codepoints. Narrow build unicode is 
effectively an array of two-byte binary units.


> And the reasons have very little to do with lack of

non-BMP characters to trip up the implementation.  Changing those
semantics should have been done before the release of Python 3.


The documentation was changed at least a bit for 3.0, and anyway, as 
indicated above, it is easy (especially for new users) to read the docs 
in a way that makes the current behavior buggy. I agree that the 
implementation should have been changed already.


Currently, the meaning of Python code differs on narrow versus wide 
build, and in a way that few users would expect or want. PEP 393 
abolishes narrow builds as we now know them and changes semantics. I was 
answering a complaint about that change. If you do not like the PEP, fine.


My separate proposal in my other post is for an alternative 
implementation but with, I presume, pretty the same visible changes.



It is not clear to me that it is a good idea to try to decide on "the"
correct implementation of Unicode strings in Python even today.


If the implementation is invisible to the Python user, as I believe it 
should be without specially introspection, and mostly invisible in the 
C-API except for those who intentionally poke into the details, then the 
implementation can be changed as the consensus on best implementation 
changes.



There are a number of approaches that I can think of.

1.  The "too bad if you can't take a joke" approach: do nothing and
 recommend UTF-32 to those who want len() to DTRT.
2.  The "slope is slippery" approach: Implement UTF-16 objects as
 built-ins, and then try to fend off requests for correct treatment
 of unnormalized composed characters, normalization, compatibility
 substitutions, bidi, etc etc.
3.  The "are we not hackers?" approach: Implement a transform that
 maps characters that are not represented by a single code point
 into Unicode private space, and then see if anybody really needs
 more than 6400 non-BMP characters.  (Note that this would
 generalize to composed characters that don't have a one-code-point
 NFC form and similar non-standardized cases that nonstandard users
 might want handled.)
4.  The "42" approach: sadly, I can't think deeply enough to explain it.

There are probably others.

It's true that Python is going to need good libraries to provide
correct handling of Unicode strings (as opposed to unicode objects).


Given that 3.0 unicode (string) objects are defined as Unicode character 
strings, I do not see the opposition.



But it's not clear to me given the wide variety of implementations I
can imagine that there will be one best implementation, let alone
which ones are good and Pythonic, and which not so.


--
Terry Jan Reedy

_

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Martin v. Löwis
>> I think the value for wstr/uninitialized/reserved should not be
>> removed.  The wstr representation is still used in the error case in
>> the utf8 decoder because these strings can be resized.
> 
> In Python, you can resize an object if it has only one reference. Why is
> it not possible in your branch?

If you use the new API to create a string (knowing how many characters
you have, and what the maximum character is), the Unicode object is
allocated as a single memory block. It can then not be resized.

If you allocate in the old style (i.e. giving NULL as the data pointer,
and a length), it still creates a second memory blocks for the
Py_UNICODE[], and allows resizing. When you then call PyUnicode_Ready,
the object gets frozen.

> I don't like "reserved" value, especially if its value is 0, the first
> value. See Microsoft file formats: they waste a lot of space because
> most fields are reserved, and 10 years later, these fields are still
> unused. Can't we add the value 4 when we will need a new kind?

I don't get the analogy, or the relationship with the value 0.
"Reserving" the value 0 is entirely different from reserving a field.
In a field, it wastes space; the value 0 however fills the same space
as the values 1,2,3. It's just used to denote an object where the
str pointer is not filled out yet, i.e. which can still be resized.

>>> I suppose that compilers prefer a switch with all cases defined, 0 a
>>> first item
>>> and contiguous values. We may need an enum.
>>
>> During the Summer of Code, Martin and I did a experiment with GCC and
>> it did not seem to produce a jump table as an optimization for three
>> cases but generated comparison instructions anyway.
> 
> You mean with a switch with a case for each possible value? 

No, a computed jump on the assembler level. Consider this code

enum kind {null,ucs1,ucs2,ucs4};

void foo(void *d, enum kind k, int i, int v)
{
switch(k){
case ucs1:((unsigned char*)d)[i] = v;break;
case ucs2:((unsigned short*)d)[i] = v;break;
case ucs4:((unsigned int*)d)[i] = v;break;
}
}

gcc 4.6.1 compiles this to

foo:
.LFB0:
.cfi_startproc
cmpl$2, %esi
je  .L4
cmpl$3, %esi
je  .L5
cmpl$1, %esi
je  .L7
.p2align 4,,5
rep
ret
.p2align 4,,10
.p2align 3
.L7:
movslq  %edx, %rdx
movb%cl, (%rdi,%rdx)
ret
.p2align 4,,10
.p2align 3
.L5:
movslq  %edx, %rdx
movl%ecx, (%rdi,%rdx,4)
ret
.p2align 4,,10
.p2align 3
.L4:
movslq  %edx, %rdx
movw%cx, (%rdi,%rdx,2)
ret
.cfi_endproc

As you can see, it generates a chain of compares, rather than an
indirect jump through a jump table.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FileSystemError or FilesystemError?

2011-08-24 Thread Eli Bendersky
> When reviewing the PEP 3151 implementation (*), Ezio commented that
> "FileSystemError" looks a bit strange and that "FilesystemError" would
> be a better spelling. What is your opinion?
>
> (*) http://bugs.python.org/issue12555
>

+1 for FileSystemError

Eli
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] sendmsg/recvmsg on Mac OS X

2011-08-24 Thread Nick Coghlan
The buildbots are complaining about some of tests for the new
socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that
provide CMSG_LEN.

http://www.python.org/dev/buildbot/all/builders/AMD64%20Snow%20Leopard%202%203.x/builds/831/steps/test/logs/stdio

Before I start trying to figure this out without a Mac to test on, are
any of the devs that actually use Mac OS X seeing the failure in their
local builds?

Cheers,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Nick Coghlan
On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy  wrote:
> In utf16.py, attached to http://bugs.python.org/issue12729
> I propose for consideration a prototype of different solution to the 'mostly
> BMP chars, few non-BMP chars' case. Rather than expand every character from
> 2 bytes to 4, attach an array cpdex of character (ie code point, not code
> unit) indexes. Then for indexing and slicing, the correction is simple,
> simpler than I first expected:
>  code-unit-index = char-index + bisect.bisect_left(cpdex, char_index)
> where code-unit-index is the adjusted index into the full underlying
> double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids
> most of the space penalty and the consequent time penalty of moving more
> bytes around and increasing cache misses.

Interesting idea, but putting on my C programmer hat, I say -1.

Non-uniform cell size = not a C array = standard C array manipulation
idioms don't work = pain (no matter how simple the index correction
happens to be).

The nice thing about PEP 383 is that it gives us the smallest storage
array that is both an ordinary C array and has sufficiently large
individual elements to handle every character in the string.

Cheers,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] sendmsg/recvmsg on Mac OS X

2011-08-24 Thread Charles-François Natali
> The buildbots are complaining about some of tests for the new
> socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that
> provide CMSG_LEN.

Looks like kernel bugs:
http://developer.apple.com/library/mac/#qa/qa1541/_index.html

"""
Yes. Mac OS X 10.5 fixes a number of kernel bugs related to descriptor passing
[...]
Avoid passing two or more descriptors back-to-back.
"""

We should probably add
@requires_mac_ver(10, 5)

for testFDPassSeparate and testFDPassSeparateMinSpace.

As for InterruptedSendTimeoutTest and testInterruptedSendmsgTimeout,
it also looks like a kernel bug: the syscall should fail with EINTR
once the socket buffer is full. I guess one should skip those on OS-X.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Stefan Behnel

Nick Coghlan, 24.08.2011 15:06:

On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote:

In utf16.py, attached to http://bugs.python.org/issue12729
I propose for consideration a prototype of different solution to the 'mostly
BMP chars, few non-BMP chars' case. Rather than expand every character from
2 bytes to 4, attach an array cpdex of character (ie code point, not code
unit) indexes. Then for indexing and slicing, the correction is simple,
simpler than I first expected:
  code-unit-index = char-index + bisect.bisect_left(cpdex, char_index)
where code-unit-index is the adjusted index into the full underlying
double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids
most of the space penalty and the consequent time penalty of moving more
bytes around and increasing cache misses.


Interesting idea, but putting on my C programmer hat, I say -1.

Non-uniform cell size = not a C array = standard C array manipulation
idioms don't work = pain (no matter how simple the index correction
happens to be).

The nice thing about PEP 383 is that it gives us the smallest storage
array that is both an ordinary C array and has sufficiently large
individual elements to handle every character in the string.


+1

Stefan

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Stephen J. Turnbull
Terry Reedy writes:

 > Excuse me for believing the fine 3.2 manual that says
 > "Strings contain Unicode characters."

The manual is wrong, then, subject to a pronouncement to the contrary,
of course.  I was on your side of the fence when this was discussed,
pre-release.  I was wrong then.  My bet is that we are still wrong,
now.

 > For the purpose of my sentence, the same thing in that code points 
 > correspond to characters,

Not in Unicode, they do not.  By definition, a small number of code
points (eg, U+) *never* did and *never* will correspond to
characters.  Since about Unicode 3.0, the same is true of surrogate
code points.  Some restrictions have been placed on what can be done
with composed characters, so even with the PEP (which gives us code
point arrays) we do not really get arrays of Unicode characters that
fully conform to the model.

 > strings are NOT code point sequences. They are 2-byte code *unit* 
 > sequences.

I stand corrected on Unicode terminology.  "Code unit" is what I meant,
and what I understand Guido to have defined unicode objects as arrays of.

 > Any narrow build string with even 1 non-BMP char violates the
 > standard.

Yup.  That's by design.

 > > Guido has made that absolutely clear on a number
 > > of occasions.
 > 
 > It is not clear what you mean, but recently on python-ideas he has 
 > reiterated that he intends bytes and strings to be conceptually 
 > different.

Sure.  Nevertheless, practicality beat purity long ago, and that
decision has never been rescinded AFAIK.

 > Bytes are computer-oriented binary arrays; strings are 
 > supposedly human-oriented character/codepoint arrays.

And indeed they are, in UCS-4 builds.  But they are *not* in Unicode!
Unicode violates the array model.  Specifically, in handling composing
characters, and in bidi, where arbitrary slicing of direction control
characters will result in garbled display.

The thing is, that 90% of applications are not really going to care
about full conformance to the Unicode standard.  Of the remaining 10%,
90% are not going to need both huge strings *and* ABI interoperability
with C modules compiled for UCS-2, so UCS-4 is satisfactory.  Of the
remaining 1% of all applications, those that deal with huge strings
*and* need full Unicode conformance, well, they need efficiency too
almost by definition.  They probably are going to want something more
efficient than either the UTF-16 or the UTF-32 representation can
provide, and therefore will need trickier, possibly app-specific,
algorithms that probably do not belong in an initial implementation.

 >  > And the reasons have very little to do with lack of
 > > non-BMP characters to trip up the implementation.  Changing those
 > > semantics should have been done before the release of Python 3.
 > 
 > The documentation was changed at least a bit for 3.0, and anyway, as 
 > indicated above, it is easy (especially for new users) to read the docs 
 > in a way that makes the current behavior buggy. I agree that the 
 > implementation should have been changed already.

I don't.  I suspect Guido does not, even today.

 > Currently, the meaning of Python code differs on narrow versus wide
 > build, and in a way that few users would expect or want.

Let them become developers, then, and show us how to do it better.

 > PEP 393 abolishes narrow builds as we now know them and changes
 > semantics. I was answering a complaint about that change. If you do
 > not like the PEP, fine.

No, I do like the PEP.  However, it is only a step, a rather
conservative one in some ways, toward conformance to the Unicode
character model.  In particular, it does nothing to resolve the fact
that len() will give different answers for character count depending
on normalization, and that slicing and indexing will allow you to cut
characters in half (even in NFC, since not all composed characters
have fully composed forms).

 > > It is not clear to me that it is a good idea to try to decide on "the"
 > > correct implementation of Unicode strings in Python even today.
 > 
 > If the implementation is invisible to the Python user, as I believe it 
 > should be without specially introspection, and mostly invisible in the 
 > C-API except for those who intentionally poke into the details, then the 
 > implementation can be changed as the consensus on best implementation 
 > changes.

A naive implementation of UTF-16 will be quite visible in terms of
performance, I suspect, and performance-oriented applications will "go
behind the API's back" to get it.  We're already seeing that in the
people who insist that bytes are characters too, and string APIs
should work on them just as they do on (Unicode) strings.

 > > It's true that Python is going to need good libraries to provide
 > > correct handling of Unicode strings (as opposed to unicode objects).
 > 
 > Given that 3.0 unicode (string) objects are defined as Unicode character 
 > strings, I do not see the opposition.

I think they're not

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Antoine Pitrou
On Thu, 25 Aug 2011 01:34:17 +0900
"Stephen J. Turnbull"  wrote:
> 
> Martin has long claimed that the fact that I/O is done in terms of
> UTF-16 means that the internal representation is UTF-16

Which I/O?



___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] sendmsg/recvmsg on Mac OS X

2011-08-24 Thread Antoine Pitrou
On Wed, 24 Aug 2011 15:31:50 +0200
Charles-François Natali  wrote:
> > The buildbots are complaining about some of tests for the new
> > socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that
> > provide CMSG_LEN.
> 
> Looks like kernel bugs:
> http://developer.apple.com/library/mac/#qa/qa1541/_index.html
> 
> """
> Yes. Mac OS X 10.5 fixes a number of kernel bugs related to descriptor passing
> [...]
> Avoid passing two or more descriptors back-to-back.
> """

But Snow Leopard, where these failures occur, is OS X 10.6.

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] FileSystemError or FilesystemError?

2011-08-24 Thread Vlad Riscutia
+1 for FileSystemError. I see myself misspelling it as FileSystemError if we
go with alternate spelling. I'll probably won't be the only one.

Thank you,
Vlad

On Wed, Aug 24, 2011 at 4:09 AM, Eli Bendersky  wrote:

>
> When reviewing the PEP 3151 implementation (*), Ezio commented that
>> "FileSystemError" looks a bit strange and that "FilesystemError" would
>> be a better spelling. What is your opinion?
>>
>> (*) http://bugs.python.org/issue12555
>>
>
> +1 for FileSystemError
>
> Eli
>
>
>
> ___
> Python-Dev mailing list
> [email protected]
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> http://mail.python.org/mailman/options/python-dev/riscutiavlad%40gmail.com
>
>
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Stephen J. Turnbull
Antoine Pitrou writes:
 > On Thu, 25 Aug 2011 01:34:17 +0900
 > "Stephen J. Turnbull"  wrote:
 > > 
 > > Martin has long claimed that the fact that I/O is done in terms of
 > > UTF-16 means that the internal representation is UTF-16
 > 
 > Which I/O?

Eg, display of characters in the interpreter.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Antoine Pitrou
Le jeudi 25 août 2011 à 02:15 +0900, Stephen J. Turnbull a écrit :
> Antoine Pitrou writes:
>  > On Thu, 25 Aug 2011 01:34:17 +0900
>  > "Stephen J. Turnbull"  wrote:
>  > > 
>  > > Martin has long claimed that the fact that I/O is done in terms of
>  > > UTF-16 means that the internal representation is UTF-16
>  > 
>  > Which I/O?
> 
> Eg, display of characters in the interpreter.

I don't know why you say it's "done in terms of UTF-16", then. Unicode
strings are simply encoded to whatever character set is detected as the
terminal's character set.

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Victor Stinner

Le 24/08/2011 02:46, Terry Reedy a écrit :

On 8/23/2011 9:21 AM, Victor Stinner wrote:

Le 23/08/2011 15:06, "Martin v. Löwis" a écrit :

Well, things have to be done in order:
1. the PEP needs to be approved
2. the performance bottlenecks need to be identified
3. optimizations should be applied.


I would not vote for the PEP if it slows down Python, especially if it's
much slower. But Torsten says that it speeds up Python, which is
surprising. I have to do my own benchmarks :-)


The current UCS2 Unicode string implementation, by design, quickly gives
WRONG answers for len(), iteration, indexing, and slicing if a string
contains any non-BMP (surrogate pair) Unicode characters. That may have
been excusable when there essentially were no such extended chars, and
the few there were were almost never used. But now there are many more,
with more being added to each Unicode edition. They include cursive Math
letters that are used in English documents today. The problem will
slowly get worse and Python, at least on Windows, will become a language
to avoid for dependable Unicode document processing. 3.x needs a proper
Unicode implementation that works for all strings on all builds.


I don't think that using UTF-16 with surrogate pairs is really a big 
problem. A lot of work has been done to hide this. For example, 
repr(chr(0x10)) now displays '\U0010' instead of two characters. 
Ezio fixed recently str.is*() methods in Python 3.2+.


For len(str): its a known problem, but if you really care of the number 
of *character* and not the number of UTF-16 units, it's easy to 
implement your own character_length() function. len(str) gives the 
UTF-16 units instead of the number of character for a simple reason: 
it's faster: O(1), whereas character_length() is O(n).



utf16.py, attached to http://bugs.python.org/issue12729
prototypes a different solution than the PEP for the above problems for
the 'mostly BMP' case. I will discuss it in a different post.


Yeah, you can workaround UTF-16 limits using O(n) algorithms.

PEP-393 provides support of the full Unicode charset (U+-U+10) 
an all platforms with a small memory footprint and only O(1) functions.


Note: Java and the Qt library use also UTF-16 strings and have exactly 
the same "limitations" for str[n] and len(str).


Victor
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Martin v. Löwis
>  > PEP 393 abolishes narrow builds as we now know them and changes
>  > semantics. I was answering a complaint about that change. If you do
>  > not like the PEP, fine.
> 
> No, I do like the PEP.  However, it is only a step, a rather
> conservative one in some ways, toward conformance to the Unicode
> character model.

I'd like to point out that the improved compatibility is only a side
effect, not the primary objective of the PEP. The primary objective
is the reduction in memory usage. (any changes in runtime are also
side effects, and it's not really clear yet whether you get speedups
or slowdowns on average, or no effect).

>  > Given that 3.0 unicode (string) objects are defined as Unicode character 
>  > strings, I do not see the opposition.
> 
> I think they're not, I think they're defined as Unicode code unit
> arrays, and that the documentation is in error.

That's just a description of the implementation, and not part of the
language, though. My understanding is that the "abstract Python language
definition" considers this aspect implementation-defined: PyPy,
Jython, IronPython etc. would be free to do things differently
(and I understand that there are plans to do PEP-393 style Unicode
 objects in PyPy).

> Martin has long claimed that the fact that I/O is done in terms of
> UTF-16 means that the internal representation is UTF-16, so I could be
> wrong.  But when issues of slicing, len() values and so on have come
> up in the past, Guido has always said "no, there will be no change in
> semantics of builtins here".

Not with these words, though. As I recall, it's rather like (still
with different words) "len() will stay O(1) forever, regardless of
any perceived incorrectness of this choice". An attempt to change
the builtins to introduce higher complexity for the sake of correctness
is what he rejects. I think PEP 393 balances this well, keeping
the O(1) operations in that complexity, while improving the cross-
platform "correctness" of these functions.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Martin v. Löwis
>> Eg, display of characters in the interpreter.
> 
> I don't know why you say it's "done in terms of UTF-16", then. Unicode
> strings are simply encoded to whatever character set is detected as the
> terminal's character set.

I think what he means (and what I meant when I said something similar):
I/O will consider surrogate pairs in the representation when converting
to the output encoding. This is actually relevant only for UTF-8 (I
think), which converts surrogate pairs "correctly". This can be taken
as a proof that Python 3.2 is "UTF-16 aware" (in some places, but not in
others).

With Python's I/O architecture, it is of course not *actually* the I/O
which considers UTF-16, but the codec.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Victor Stinner

Le 24/08/2011 11:22, Glenn Linderman a écrit :

c) mostly ASCII (utf8) with clever indexing/caching to be efficient
d) UTF-8 with clever indexing/caching to be efficient

I see neither a need nor a means to consider these.


The discussion about "mostly ASCII" strings seems convincing that there
could be a significant space savings if such were implemented.


Antoine's optimization in the UTF-8 decoder has been removed. It doesn't 
change the memory footprint, it is just slower to create the Unicode object.


When you decode an UTF-8 string:

 - "abc" string uses "latin1" (8 bits) units
 - "aé" string uses "latin1" (8 bits) units <= cool!
 - "a€" string uses UCS2 (16 bits) units
 - "a\U0010" string uses UCS4 (32 bits) units

Victor
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 393 review

2011-08-24 Thread Martin v. Löwis
Guido has agreed to eventually pronounce on PEP 393. Before that can
happen, I'd like to collect feedback on it. There have been a number
of voice supporting the PEP in principle, so I'm now interested in
comments in the following areas:

- principle objection. I'll list them in the PEP.
- issues to be considered (unclarities, bugs, limitations, ...)
- conditions you would like to pose on the implementation before
  acceptance. I'll see which of these can be resolved, and list
  the ones that remain open.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 review

2011-08-24 Thread Antoine Pitrou
On Wed, 24 Aug 2011 20:15:24 +0200
"Martin v. Löwis"  wrote:
> - issues to be considered (unclarities, bugs, limitations, ...)

With this PEP, the unicode object overhead grows to 10 pointer-sized
words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine.
Does it have any adverse effects?

Are there any plans to make instantiation of small strings fast enough?
Or is it already as fast as it should be?

When interfacing with the Win32 "wide" APIs, what is the recommended
way to get the required LPCWSTR?

Will the format codes returning a Py_UNICODE pointer with
PyArg_ParseTuple be deprecated?

Do you think the wstr representation could be removed in some future
version of Python?

Is PyUnicode_Ready() necessary for all unicode objects, or only those
allocated through the legacy API?

“The Py_Unicode representation is not instantaneously available”: you
mean the Py_UNICODE representation?

> - conditions you would like to pose on the implementation before
>   acceptance. I'll see which of these can be resolved, and list
>   the ones that remain open.

That it doesn't significantly slow down benchmarks such as stringbench
and iobench.

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] sendmsg/recvmsg on Mac OS X

2011-08-24 Thread Ned Deily
In article <[email protected]>,
 Antoine Pitrou  wrote:
> On Wed, 24 Aug 2011 15:31:50 +0200
> Charles-François Natali  wrote:
> > > The buildbots are complaining about some of tests for the new
> > > socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that
> > > provide CMSG_LEN.
> > 
> > Looks like kernel bugs:
> > http://developer.apple.com/library/mac/#qa/qa1541/_index.html
> > 
> > """
> > Yes. Mac OS X 10.5 fixes a number of kernel bugs related to descriptor 
> > passing
> > [...]
> > Avoid passing two or more descriptors back-to-back.
> > """
> 
> But Snow Leopard, where these failures occur, is OS X 10.6.

But chances are the build is using the default 10.4 ABI.  Adding 
MACOSX_DEPLOYMENT_TARGET=10.6 as an env variable to ./configure may fix 
it.  There is an open issue to change configure to use better defaults 
for this.  (I'm right in the middle of reconfiguring my development 
systems so I can't test it myself immediately but I'll report back 
shortly.)

-- 
 Ned Deily,
 [email protected]

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Terry Reedy

On 8/24/2011 1:50 PM, "Martin v. Löwis" wrote:


I'd like to point out that the improved compatibility is only a side
effect, not the primary objective of the PEP.


Then why does the Rationale start with "on systems only supporting 
UTF-16, users complain that non-BMP characters are not properly supported."?


A Windows user can only solve this problem by switching to *nix.


The primary objective is the reduction in memory usage.


On average (perhaps). As I understand the PEP, for some strings, Windows 
users will see a doubling of memory usage. Statistically, that doubling 
is probably more likely in longer texts. Ascii-only Python code and 
other limited-to-ascii text will benefit. Typical English business 
documents will see no change as they often have proper non-ascii quotes 
and occasional accented characters, trademark symbols, and other things.


I think you have the objectives backwards. Adding memory is a lot easier 
than switching OSes.


--
Terry Jan Reedy


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] sendmsg/recvmsg on Mac OS X

2011-08-24 Thread Antoine Pitrou
On Wed, 24 Aug 2011 11:37:20 -0700
Ned Deily  wrote:

> In article <[email protected]>,
>  Antoine Pitrou  wrote:
> > On Wed, 24 Aug 2011 15:31:50 +0200
> > Charles-François Natali  wrote:
> > > > The buildbots are complaining about some of tests for the new
> > > > socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that
> > > > provide CMSG_LEN.
> > > 
> > > Looks like kernel bugs:
> > > http://developer.apple.com/library/mac/#qa/qa1541/_index.html
> > > 
> > > """
> > > Yes. Mac OS X 10.5 fixes a number of kernel bugs related to descriptor 
> > > passing
> > > [...]
> > > Avoid passing two or more descriptors back-to-back.
> > > """
> > 
> > But Snow Leopard, where these failures occur, is OS X 10.6.
> 
> But chances are the build is using the default 10.4 ABI.  Adding 
> MACOSX_DEPLOYMENT_TARGET=10.6 as an env variable to ./configure may fix 
> it.

Does the ABI affect kernel bugs?

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Glenn Linderman

On 8/24/2011 9:00 AM, Stefan Behnel wrote:

Nick Coghlan, 24.08.2011 15:06:

On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote:

In utf16.py, attached to http://bugs.python.org/issue12729
I propose for consideration a prototype of different solution to the 
'mostly
BMP chars, few non-BMP chars' case. Rather than expand every 
character from
2 bytes to 4, attach an array cpdex of character (ie code point, not 
code

unit) indexes. Then for indexing and slicing, the correction is simple,
simpler than I first expected:
  code-unit-index = char-index + bisect.bisect_left(cpdex, char_index)
where code-unit-index is the adjusted index into the full underlying
double-byte array. This adds a time penalty of log2(len(cpdex)), but 
avoids
most of the space penalty and the consequent time penalty of moving 
more

bytes around and increasing cache misses.


Interesting idea, but putting on my C programmer hat, I say -1.

Non-uniform cell size = not a C array = standard C array manipulation
idioms don't work = pain (no matter how simple the index correction
happens to be).

The nice thing about PEP 383 is that it gives us the smallest storage
array that is both an ordinary C array and has sufficiently large
individual elements to handle every character in the string.


+1 


Yes, this sounds like a nice benefit, but the problem is it is false.  
The correct statement would be:


The nice thing about PEP 383 is that it gives us the smallest storage
array that is both an ordinary C array and has sufficiently large
individual elements to handle every Unicode codepoint in the string.

As Tom eloquently describes in the referenced issue (is Tom ever 
non-eloquent?), not all characters can be represented in a single codepoint.


It seems there are three concepts in Unicode, code units, codepoints, 
and characters, none of which are equivalent (and the first of which 
varies according to the encoding).  It also seems (to me) that Unicode 
has failed in its original premise, of being an easy way to handle "big 
char" for "all languages" with fixed size elements, but it is not clear 
that its original premise is achievable regardless of the size of "big 
char", when mixed directionality is desired, and it seems that support 
of some single languages require mixed directionality, not to mention 
mixed language support.


Given the required variability of character size in all presently 
Unicode defined encodings, I tend to agree with Tom that UTF-8, together 
with some technique of translating character index to code unit offset, 
may provide the best overall space utilization, and adequate CPU 
efficiency.  On the other hand, there are large subsets of applications 
that simply do not require support for bidirectional text or composed 
characters, and for those that do not, it remains to be seen if the 
price to be paid for supporting those features is too high a price for 
such applications. So far, we don't have implementations to benchmark to 
figure that out!


What does this mean for Python?  Well, if Python is willing to limit its 
support for applications to the subset for which the "big char" solution 
sufficient, then PEP 393 provides a way to do that, that looks to be 
pretty effective for reducing memory consumption for those applications 
that use short strings most of which can be classified by content into 
the 1 byte or 2 byte representations.  Applications that support long 
strings are more likely to bitten by the occasional "outlier" character 
that is longer than the average character, doubling or quadrupling the 
space needed to represent such strings, and eliminating a significant 
portion of the space savings the PEP is providing for other 
applications.  Benchmarks may or may not fully reflect the actual 
requirements of all applications, so conclusions based on benchmarking 
can easily be blind-sided the realities of other applications, unless 
the benchmarks are carefully constructed.


It is possible that the ideas in PEP 393, with its support for multiple 
underlying representations, could be the basis for some more complex 
representations that would better support characters rather than only 
supporting code points, but Martin has stated he is not open to 
additional representations, so the PEP itself cannot be that basis 
(although with care which may or may not be taken in the implementation 
of the PEP, the implementation may still provide that basis).
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] sendmsg/recvmsg on Mac OS X

2011-08-24 Thread Charles-François Natali
> But Snow Leopard, where these failures occur, is OS X 10.6.

*sighs*
It still looks like a kernel/libc bug to me: AFAICT, both the code and
the tests are correct.
And apparently, there are still issues pertaining to FD passing on
10.5 (and maybe later, I couldn't find a public access to their bug
tracker):
http://lists.apple.com/archives/Darwin-dev/2008/Feb/msg00033.html

Anyway, if someone with a recent OS X release could run test_socket,
it would probably help. Follow ups to http://bugs.python.org/issue6560
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Guido van Rossum
On Wed, Aug 24, 2011 at 11:52 AM, Glenn Linderman  wrote:
> On 8/24/2011 9:00 AM, Stefan Behnel wrote:
>
> Nick Coghlan, 24.08.2011 15:06:
>
> On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote:
>
> In utf16.py, attached to http://bugs.python.org/issue12729
> I propose for consideration a prototype of different solution to the 'mostly
> BMP chars, few non-BMP chars' case. Rather than expand every character from
> 2 bytes to 4, attach an array cpdex of character (ie code point, not code
> unit) indexes. Then for indexing and slicing, the correction is simple,
> simpler than I first expected:
>   code-unit-index = char-index + bisect.bisect_left(cpdex, char_index)
> where code-unit-index is the adjusted index into the full underlying
> double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids
> most of the space penalty and the consequent time penalty of moving more
> bytes around and increasing cache misses.
>
> Interesting idea, but putting on my C programmer hat, I say -1.
>
> Non-uniform cell size = not a C array = standard C array manipulation
> idioms don't work = pain (no matter how simple the index correction
> happens to be).
>
> The nice thing about PEP 383 is that it gives us the smallest storage
> array that is both an ordinary C array and has sufficiently large
> individual elements to handle every character in the string.
>
> +1
>
> Yes, this sounds like a nice benefit, but the problem is it is false.  The
> correct statement would be:
>
>   The nice thing about PEP 383 is that it gives us the smallest storage
>   array that is both an ordinary C array and has sufficiently large
>   individual elements to handle every Unicode codepoint in the string.

(PEP 393, I presume. :-)

> As Tom eloquently describes in the referenced issue (is Tom ever
> non-eloquent?), not all characters can be represented in a single codepoint.

But this is also besides the point (except insofar where we have to
remind ourselves not to confuse the two in docs).

> It seems there are three concepts in Unicode, code units, codepoints, and
> characters, none of which are equivalent (and the first of which varies
> according to the encoding). It also seems (to me) that Unicode has failed
> in its original premise, of being an easy way to handle "big char" for "all
> languages" with fixed size elements, but it is not clear that its original
> premise is achievable regardless of the size of "big char", when mixed
> directionality is desired, and it seems that support of some single
> languages require mixed directionality, not to mention mixed language
> support.

I see nothing wrong with having the language's fundamental data types
(i.e., the unicode object, and even the re module) to be defined in
terms of codepoints, not characters, and I see nothing wrong with
len() returning the number of codepoints (as long as it is advertised
as such). After all UTF-8 also defines an encoding for a sequence of
code points. Characters that require two or more codepoints are not
represented special in UTF-8 -- they are represented as two or more
encoded codepoints. The added requirement that UTF-8 must only be used
to represent valid characters is just that -- it doesn't affect how
strings are encoded, just what is considered valid at a higher level.

> Given the required variability of character size in all presently Unicode
> defined encodings, I tend to agree with Tom that UTF-8, together with some
> technique of translating character index to code unit offset, may provide
> the best overall space utilization, and adequate CPU efficiency.

There is no doubt that UTF-8 is the most space efficient. I just don't
think it is worth giving up O(1) indexing of codepoints -- it would
change programmers' expectations too much.

OTOH I am sold on getting rid of the added complexities of "narrow
builds" where not even all codepoints can be represented without using
surrogate pairs (i.e. two code units per codepoint) and indexing uses
code units instead of codepoints. I think this is an area where PEP
393 has a huge advantage: users can get rid of their exceptions for
narrow builds.

> On the
> other hand, there are large subsets of applications that simply do not
> require support for bidirectional text or composed characters, and for those
> that do not, it remains to be seen if the price to be paid for supporting
> those features is too high a price for such applications. So far, we don't
> have implementations to benchmark to figure that out!

I think you are saying that many apps can ignore the distinction
between codepoints and characters. Given the complexity of bidi
rendering and normalization (which will always remain an issue) I
agree; this is much less likely to be a burden than the narrow-build
issues with code units vs. codepoints.

What should the stdlib do? It should try to skirt the issue where it
can (using the garbage-in-garbage-out principle) and advertise what it
supports where there is a difference. I don'

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Terry Reedy

On 8/24/2011 12:34 PM, Stephen J. Turnbull wrote:

Terry Reedy writes:

  >  Excuse me for believing the fine 3.2 manual that says
  >  "Strings contain Unicode characters."

The manual is wrong, then, subject to a pronouncement to the contrary,


Please suggest a re-wording then, as it is a bug for doc and behavior to 
disagree.



  >  For the purpose of my sentence, the same thing in that code points
  >  correspond to characters,

Not in Unicode, they do not.  By definition, a small number of code
points (eg, U+) *never* did and *never* will correspond to
characters.


On computers, characters are represented by code points. What about the 
other way around? http://www.unicode.org/glossary/#C says

code point:
1) i in range(0x11000) 
2) "A value, or position, for a character" 
(To muddy the waters more, 'character' has multiple definitions also.)
You are using 1), I am using 2) ;-(.


  >  Any narrow build string with even 1 non-BMP char violates the
  >  standard.

Yup.  That's by design.

[...]

Sure.  Nevertheless, practicality beat purity long ago, and that
decision has never been rescinded AFAIK.


I think you have it backwards. I see the current situation as the purity 
of the C code beating the practicality for the user of getting right 
answers.



The thing is, that 90% of applications are not really going to care
about full conformance to the Unicode standard.


I remember when Intel argued that 99% of applications were not going to 
be affected when the math coprocessor in its then new chips occasionally 
gave 'non-standard' answers with certain divisors.



  >  Currently, the meaning of Python code differs on narrow versus wide
  >  build, and in a way that few users would expect or want.

Let them become developers, then, and show us how to do it better.


I posted a proposal with a link to a prototype implementation in Python. 
It pretty well solves the problem of narrow builds acting different from 
wide builds with respect to the basic operations of len(), iterations, 
indexing, and slicing.



No, I do like the PEP.  However, it is only a step, a rather
conservative one in some ways, toward conformance to the Unicode
character model.  In particular, it does nothing to resolve the fact
that len() will give different answers for character count depending
on normalization, and that slicing and indexing will allow you to cut
characters in half (even in NFC, since not all composed characters
have fully composed forms).


I believe my scheme could be extended to solve that also. It would 
require more pre-processing and more knowledge than I currently have of 
normalization. I have the impression that the grapheme problem goes 
further than just normalization.


--
Terry Jan Reedy

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] sendmsg/recvmsg on Mac OS X

2011-08-24 Thread Ned Deily
In article 
,
 Charles-Francois Natali  wrote:
> > But Snow Leopard, where these failures occur, is OS X 10.6.
> 
> *sighs*
> It still looks like a kernel/libc bug to me: AFAICT, both the code and
> the tests are correct.
> And apparently, there are still issues pertaining to FD passing on
> 10.5 (and maybe later, I couldn't find a public access to their bug
> tracker):
> http://lists.apple.com/archives/Darwin-dev/2008/Feb/msg00033.html
> 
> Anyway, if someone with a recent OS X release could run test_socket,
> it would probably help. Follow ups to http://bugs.python.org/issue6560

I was able to do a quick test on 10.7 Lion and the 8 test failures still 
occur regardless of deployment target.  Sorry, I don't have time to 
further investigate.

-- 
 Ned Deily,
 [email protected]

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] sendmsg/recvmsg on Mac OS X

2011-08-24 Thread Ned Deily
In article <[email protected]>,
 Antoine Pitrou  wrote:
> On Wed, 24 Aug 2011 11:37:20 -0700
> Ned Deily  wrote:
> > In article <[email protected]>,
> >  Antoine Pitrou  wrote:
> > > On Wed, 24 Aug 2011 15:31:50 +0200
> > > Charles-François Natali  wrote:
> > > > > The buildbots are complaining about some of tests for the new
> > > > > socket.sendmsg/recvmsg added by issue #6560 for *nix platforms that
> > > > > provide CMSG_LEN.
> > > > 
> > > > Looks like kernel bugs:
> > > > http://developer.apple.com/library/mac/#qa/qa1541/_index.html
> > > > 
> > > > """
> > > > Yes. Mac OS X 10.5 fixes a number of kernel bugs related to descriptor 
> > > > passing
> > > > [...]
> > > > Avoid passing two or more descriptors back-to-back.
> > > > """
> > > 
> > > But Snow Leopard, where these failures occur, is OS X 10.6.
> > 
> > But chances are the build is using the default 10.4 ABI.  Adding 
> > MACOSX_DEPLOYMENT_TARGET=10.6 as an env variable to ./configure may fix 
> > it.
> 
> Does the ABI affect kernel bugs?

If it's more of a "libc" sort of bug (i.e. somewhere below the app 
layer), it could.  But, unfortunately, that doesn't seem to be the case 
here.

-- 
 Ned Deily,
 [email protected]

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Terry Reedy

On 8/24/2011 1:45 PM, Victor Stinner wrote:

Le 24/08/2011 02:46, Terry Reedy a écrit :



I don't think that using UTF-16 with surrogate pairs is really a big
problem. A lot of work has been done to hide this. For example,
repr(chr(0x10)) now displays '\U0010' instead of two characters.
Ezio fixed recently str.is*() methods in Python 3.2+.


I greatly appreciate that he did. The * (lower,upper,title) methods 
apparently are not fixed yet as the corresponding new tests are 
currently skipped for narrow builds.



For len(str): its a known problem, but if you really care of the number
of *character* and not the number of UTF-16 units, it's easy to
implement your own character_length() function. len(str) gives the
UTF-16 units instead of the number of character for a simple reason:
it's faster: O(1), whereas character_length() is O(n).


It is O(1) after a one-time O(n) preproccessing, which is the same  time 
order for creating the string in the first place.


Anyway, I think the most important deficiency is with iteration:

>>> from unicodedata import name
>>> name('\U0001043c')
'DESERET SMALL LETTER DEE'
>>> for c in 'abc\U0001043c':
print(name(c))

LATIN SMALL LETTER A
LATIN SMALL LETTER B
LATIN SMALL LETTER C
Traceback (most recent call last):
  File "", line 2, in 
print(name(c))
ValueError: no such name

This would work on wide builds but does not here (win7) because narrow 
build iteration produces a naked non-character surrogate code unit that 
has no specific entry in the Unicode Character Database.


I believe that most new people who read "Strings contain Unicode 
characters." would expect string iteration to always produce the Unicode 
characters that they put in the string. The extra time per char needed 
to produce the surrogate pair that represents the character entered is 
O(1).



utf16.py, attached to http://bugs.python.org/issue12729
prototypes a different solution than the PEP for the above problems for
the 'mostly BMP' case. I will discuss it in a different post.


Yeah, you can workaround UTF-16 limits using O(n) algorithms.


I presented O(log(number of non-BMP chars)) algorithms for indexing and 
slicing. For the mostly BMP case, that is hugely better than O(n).



PEP-393 provides support of the full Unicode charset (U+-U+10)
an all platforms with a small memory footprint and only O(1) functions.


For Windows users, I believe it will nearly double the memory footprint 
if there are any non-BMP chars. On my new machine, I should not mind 
that in exchange for correct behavior.


--
Terry Jan Reedy


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Ethan Furman

Terry Reedy wrote:

PEP-393 provides support of the full Unicode charset (U+-U+10)
an all platforms with a small memory footprint and only O(1) functions.


For Windows users, I believe it will nearly double the memory footprint 
if there are any non-BMP chars. On my new machine, I should not mind 
that in exchange for correct behavior.




+1

Heck, I wouldn't mind it on my /old/ machine in exchange for correct 
behavior!


~Ethan~
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Victor Stinner
Le mercredi 24 août 2011 20:52:51, Glenn Linderman a écrit :
> Given the required variability of character size in all presently
> Unicode defined encodings, I tend to agree with Tom that UTF-8, together
> with some technique of translating character index to code unit offset,
> may provide the best overall space utilization, and adequate CPU
> efficiency.

UTF-8 can use more space than latin1 or UCS2:
>>> text="abc"; len(text.encode("latin1")), len(text.encode("utf8"))
(3, 3)
>>> text="ééé"; len(text.encode("latin1")), len(text.encode("utf8"))
(3, 6)
>>> text="€€€"; len(text.encode("utf-16-le")), len(text.encode("utf8"))
(6, 9)
>>> text="北京"; len(text.encode("utf-16-le")), len(text.encode("utf8"))
(4, 6)

UTF-8 uses less space than PEP 393 only if you have few non-ASCII characters 
(or few non-BMP characters).

About speed, I guess than O(n) (UTF8 indexing) is slower than O(1) 
(PEP 393 indexing).

> ...  Applications that support long
> strings are more likely to bitten by the occasional "outlier" character
> that is longer than the average character, doubling or quadrupling the
> space needed to represent such strings, and eliminating a significant
> portion of the space savings the PEP is providing for other
> applications.

In these worst cases, the PEP 393 is not worse than the current 
implementation: it just as much memory than Python in wide mode (mode used on 
Linux and Mac OS X because wchar_t is 32 bits).  But it uses the double of 
Python in narrow mode (Windows).

I agree than UTF-8 is better in these corner cases, but I also bet than most 
Python programs will use less memory and will be faster with the PEP 393. You 
can already try the pep-393 branch on your own programs.

> Benchmarks may or may not fully reflect the actual
> requirements of all applications, so conclusions based on benchmarking
> can easily be blind-sided the realities of other applications, unless
> the benchmarks are carefully constructed.

I used stringbench and "./python -m test test_unicode". I plan to try iobench.

Which other benchmark tool should be used? Should we write a new one?

> It is possible that the ideas in PEP 393, with its support for multiple
> underlying representations, could be the basis for some more complex
> representations that would better support characters rather than only
> supporting code points, ...

I don't think that the *default* Unicode type is the best place for this. The 
base Unicode type has to be *very* efficient.

If you have unusual needs, write your own type. Maybe based on the base type?

Victor

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Martin v. Löwis
> For Windows users, I believe it will nearly double the memory footprint
> if there are any non-BMP chars. On my new machine, I should not mind
> that in exchange for correct behavior.

In addition, strings with non-BMP chars are much more rare than strings
with all Latin-1, for which memory usage halves on Windows.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 review

2011-08-24 Thread Victor Stinner
> With this PEP, the unicode object overhead grows to 10 pointer-sized
> words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine.
> Does it have any adverse effects?

For pure ASCII, it might be possible to use a shorter struct:

typedef struct {
PyObject_HEAD
Py_ssize_t length;
Py_hash_t hash;
int state;
Py_ssize_t wstr_length;
wchar_t *wstr;
/* no more utf8_length, utf8, str */
/* followed by ascii data */
} _PyASCIIObject;
(-2 pointer -1 ssize_t: 56 bytes)

=> "a" is 58 bytes (with utf8 for free, without wchar_t)

For object allocated with the new API, we can use a shorter struct:

typedef struct {
PyObject_HEAD
Py_ssize_t length;
Py_hash_t hash;
int state;
Py_ssize_t wstr_length;
wchar_t *wstr;
Py_ssize_t utf8_length;
char *utf8;
/* no more str pointer */
/* followed by latin1/ucs2/ucs4 data */
} _PyNewUnicodeObject;
(-1 pointer: 72 bytes)

=> "é" is 74 bytes (without utf8 / wchar_t)

For the legacy API:

typedef struct {
PyObject_HEAD
Py_ssize_t length;
Py_hash_t hash;
int state;
Py_ssize_t wstr_length;
wchar_t *wstr;
Py_ssize_t utf8_length;
char *utf8;
void *str;
} _PyLegacyUnicodeObject;
(same size: 80 bytes)

=> "a" is 80+2 (2 malloc) bytes (without utf8 / wchar_t)

The current struct:

typedef struct {
PyObject_HEAD
Py_ssize_t length;
Py_UNICODE *str;
Py_hash_t hash;
int state;
PyObject *defenc;
} PyUnicodeObject;

=> "a" is 56+2 (2 malloc) bytes (without utf8, with wchar_t if Py_UNICODE is 
wchar_t)

... but the code (maybe only the macros?) and debuging will be more complex.

> Will the format codes returning a Py_UNICODE pointer with
> PyArg_ParseTuple be deprecated?

Because Python 2.x is still dominant and it's already hard enough to port C 
modules, it's not the best moment to deprecate the legacy API (Py_UNICODE*).

> Do you think the wstr representation could be removed in some future
> version of Python?

Conversion to wchar_t* is common, especially on Windows. But I don't know if 
we *have to* cache the result. Is it cached by the way? Or is wstr only used 
when a string is created from Py_UNICODE?

Victor

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Glenn Linderman

On 8/24/2011 12:34 PM, Guido van Rossum wrote:

On Wed, Aug 24, 2011 at 11:52 AM, Glenn Linderman  wrote:

On 8/24/2011 9:00 AM, Stefan Behnel wrote:

Nick Coghlan, 24.08.2011 15:06:

On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote:

In utf16.py, attached to http://bugs.python.org/issue12729
I propose for consideration a prototype of different solution to the 'mostly
BMP chars, few non-BMP chars' case. Rather than expand every character from
2 bytes to 4, attach an array cpdex of character (ie code point, not code
unit) indexes. Then for indexing and slicing, the correction is simple,
simpler than I first expected:
   code-unit-index = char-index + bisect.bisect_left(cpdex, char_index)
where code-unit-index is the adjusted index into the full underlying
double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids
most of the space penalty and the consequent time penalty of moving more
bytes around and increasing cache misses.

Interesting idea, but putting on my C programmer hat, I say -1.

Non-uniform cell size = not a C array = standard C array manipulation
idioms don't work = pain (no matter how simple the index correction
happens to be).

The nice thing about PEP 383 is that it gives us the smallest storage
array that is both an ordinary C array and has sufficiently large
individual elements to handle every character in the string.

+1

Yes, this sounds like a nice benefit, but the problem is it is false.  The
correct statement would be:

   The nice thing about PEP 383 is that it gives us the smallest storage
   array that is both an ordinary C array and has sufficiently large
   individual elements to handle every Unicode codepoint in the string.

(PEP 393, I presume. :-)


This statement might yet be made true :)


As Tom eloquently describes in the referenced issue (is Tom ever
non-eloquent?), not all characters can be represented in a single codepoint.

But this is also besides the point (except insofar where we have to
remind ourselves not to confuse the two in docs).


In the docs, yes, and in programmer's minds (influenced by docs).


It seems there are three concepts in Unicode, code units, codepoints, and
characters, none of which are equivalent (and the first of which varies
according to the encoding). It also seems (to me) that Unicode has failed
in its original premise, of being an easy way to handle "big char" for "all
languages" with fixed size elements, but it is not clear that its original
premise is achievable regardless of the size of "big char", when mixed
directionality is desired, and it seems that support of some single
languages require mixed directionality, not to mention mixed language
support.

I see nothing wrong with having the language's fundamental data types
(i.e., the unicode object, and even the re module) to be defined in
terms of codepoints, not characters, and I see nothing wrong with
len() returning the number of codepoints (as long as it is advertised
as such).


Me neither.


After all UTF-8 also defines an encoding for a sequence of
code points. Characters that require two or more codepoints are not
represented special in UTF-8 -- they are represented as two or more
encoded codepoints. The added requirement that UTF-8 must only be used
to represent valid characters is just that -- it doesn't affect how
strings are encoded, just what is considered valid at a higher level.


Yes, this is true.  In one sense, though, since UTF-8-supporting code 
already has to deal with variable length codepoint encoding, support for 
variable length character encoding seems like a minor extension, not 
upsetting any concept of fixed-width optimizations, because such cannot 
be used.



Given the required variability of character size in all presently Unicode
defined encodings, I tend to agree with Tom that UTF-8, together with some
technique of translating character index to code unit offset, may provide
the best overall space utilization, and adequate CPU efficiency.

There is no doubt that UTF-8 is the most space efficient. I just don't
think it is worth giving up O(1) indexing of codepoints -- it would
change programmers' expectations too much.


Programmers that have to deal with bidi or composed characters shouldn't 
have such expectations, of course.   But there are many programmers who 
do not, or at least who think they do not, and they can retain their 
O(1) expectations, I suppose, until it bites them.



OTOH I am sold on getting rid of the added complexities of "narrow
builds" where not even all codepoints can be represented without using
surrogate pairs (i.e. two code units per codepoint) and indexing uses
code units instead of codepoints. I think this is an area where PEP
393 has a huge advantage: users can get rid of their exceptions for
narrow builds.


Yep, the only justification for narrow builds is in interfacing to 
underlying broken OS that happen to use that encoding... it might be 
slightly more efficient when doing API calls to such an O

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Tim Delaney
On 25 August 2011 07:10, Victor Stinner wrote:

>
> I used stringbench and "./python -m test test_unicode". I plan to try
> iobench.
>
> Which other benchmark tool should be used? Should we write a new one?


I think that the PyPy benchmarks (or at least selected tests such as
slowspitfire) would probably exercise things quite well.

http://speed.pypy.org/about/

Tim Delaney
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Guido van Rossum
On Wed, Aug 24, 2011 at 3:29 PM, Glenn Linderman  wrote:
> It would seem helpful if the stdlib could have some support for efficient
> handling of Unicode characters in some representation.  It would help
> address the class of applications that does care.

I claim that we have insufficient understanding of their needs to put
anything in the stdlib. Wait and see is a good strategy here.

> Adding extra support for
> Unicode character handling sooner rather than later could be an performance
> boost to applications that do care about full character support, and I can
> only see the numbers of such applications increasing over time.  Such could
> be built as a subtype of str, perhaps, but if done in Python, there would
> likely be a significant performance hit when going from str to
> "unicodeCharacterStr".

Sounds like overengineering to me. The right time to add something to
the stdlib is when a large number of apps *currently* need something,
not when you expect that they might need it in the future. (There just
are too many possible futures to plan for them all. YAGNI rules.)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Stephen J. Turnbull
Antoine Pitrou writes:
 > Le jeudi 25 août 2011 à 02:15 +0900, Stephen J. Turnbull a écrit :
 > > Antoine Pitrou writes:
 > >  > On Thu, 25 Aug 2011 01:34:17 +0900
 > >  > "Stephen J. Turnbull"  wrote:
 > >  > > 
 > >  > > Martin has long claimed that the fact that I/O is done in terms of
 > >  > > UTF-16 means that the internal representation is UTF-16
 > >  > 
 > >  > Which I/O?
 > > 
 > > Eg, display of characters in the interpreter.
 > 
 > I don't know why you say it's "done in terms of UTF-16", then. Unicode
 > strings are simply encoded to whatever character set is detected as the
 > terminal's character set.

But it's not "simple" at the level we're talking about!

Specifically, *in-memory* surrogates are properly respected when doing
the encoding, and therefore such I/O is not UCS-2 or "raw code units".
This treatment is different from sizing and indexing of unicodes,
where surrogates are not treated differently from other code points.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Stephen J. Turnbull
Terry Reedy writes:

 > Please suggest a re-wording then, as it is a bug for doc and behavior to 
 > disagree.

Strings contain Unicode code units, which for most purposes can be
treated as Unicode characters.  However, even as "simple" an
operation as "s1[0] == s2[0]" cannot be relied upon to give
Unicode-conforming results.

The second sentence remains true under PEP 393.

 > >   >  For the purpose of my sentence, the same thing in that code points
 > >   >  correspond to characters,
 > >
 > > Not in Unicode, they do not.  By definition, a small number of code
 > > points (eg, U+) *never* did and *never* will correspond to
 > > characters.
 > 
 > On computers, characters are represented by code points. What about the 
 > other way around? http://www.unicode.org/glossary/#C says
 > code point:
 > 1) i in range(0x11000) 
 > 2) "A value, or position, for a character" 
 > (To muddy the waters more, 'character' has multiple definitions also.)
 > You are using 1), I am using 2) ;-(.

No, you're not.  You are claiming an isomorphism, which Unicode goes
to great trouble to avoid.

 > I think you have it backwards. I see the current situation as the purity 
 > of the C code beating the practicality for the user of getting right 
 > answers.

Sophistry.  "Always getting the right answer" is purity.

 > > The thing is, that 90% of applications are not really going to care
 > > about full conformance to the Unicode standard.
 > 
 > I remember when Intel argued that 99% of applications were not going to 
 > be affected when the math coprocessor in its then new chips occasionally 
 > gave 'non-standard' answers with certain divisors.

In the case of Intel, the people who demanded standard answers did so
for efficiency reasons -- they needed the FPU to DTRT because
implementing FP in software was always going to be too slow.  CPython,
IMO, can afford to trade off because the implementation will
necessarily be in software, and can be added later as a Python or C module.

 > I believe my scheme could be extended to solve [conformance for
 > composing characters] also. It would require more pre-processing
 > and more knowledge than I currently have of normalization. I have
 > the impression that the grapheme problem goes further than just
 > normalization.

Yes and yes.  But now you're talking about database lookups for every
character (to determine if it's a composing character).  Efficiency of
a generic implementation isn't going to happen.

Anyway, in Martin's rephrasing of my (imperfect) memory of Guido's
pronouncement, "indexing is going to be O(1)".  And Nick's point about
non-uniform arrays is telling.  I have 20 years of experience with an
implementation of text as a non-uniform array which presents an array
API, and *everything* needs to be special-cased for efficiency, and
*any* small change can have show-stopping performance implications.

Python can probably do better than Emacs has done due to much better
leadership in this area, but I still think it's better to make full
conformance optional.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Stephen J. Turnbull
Guido van Rossum writes:

 > I see nothing wrong with having the language's fundamental data types
 > (i.e., the unicode object, and even the re module) to be defined in
 > terms of codepoints, not characters, and I see nothing wrong with
 > len() returning the number of codepoints (as long as it is advertised
 > as such).

In fact, the Unicode Standard, Version 6, goes farther (to code units):

2.7  Unicode Strings

A Unicode string data type is simply an ordered sequence of code
units. Thus a Unicode 8-bit string is an ordered sequence of 8-bit
code units, a Unicode 16-bit string is an ordered sequence of
16-bit code units, and a Unicode 32-bit string is an ordered
sequence of 32-bit code units. 

Depending on the programming environment, a Unicode string may or
may not be required to be in the corresponding Unicode encoding
form. For example, strings in Java, C#, or ECMAScript are Unicode
16-bit strings, but are not necessarily well-formed UTF-16
sequences.

(p. 32).

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Guido van Rossum
On Wed, Aug 24, 2011 at 5:31 PM, Stephen J. Turnbull
 wrote:
> Terry Reedy writes:
>
>  > Please suggest a re-wording then, as it is a bug for doc and behavior to
>  > disagree.
>
>    Strings contain Unicode code units, which for most purposes can be
>    treated as Unicode characters.  However, even as "simple" an
>    operation as "s1[0] == s2[0]" cannot be relied upon to give
>    Unicode-conforming results.
>
> The second sentence remains true under PEP 393.

Really? If strings contain code units, that expression compares code
units. What is non-conforming about comparing two code points? They
are just integers.

Seriously, what does Unicode-conforming mean here? It would be better
to specify chapter and verse (e.g. is it a specific thing defined by
the dreaded TR18?)

>  > >   >  For the purpose of my sentence, the same thing in that code points
>  > >   >  correspond to characters,
>  > >
>  > > Not in Unicode, they do not.  By definition, a small number of code
>  > > points (eg, U+) *never* did and *never* will correspond to
>  > > characters.
>  >
>  > On computers, characters are represented by code points. What about the
>  > other way around? http://www.unicode.org/glossary/#C says
>  > code point:
>  > 1) i in range(0x11000) 
>  > 2) "A value, or position, for a character" 
>  > (To muddy the waters more, 'character' has multiple definitions also.)
>  > You are using 1), I am using 2) ;-(.
>
> No, you're not.  You are claiming an isomorphism, which Unicode goes
> to great trouble to avoid.

I don't know that we will be able to educate our users to the point
where they will use code unit, code point, character, glyph, character
set, encoding, and other technical terms correctly. TBH even though
less than two hours ago I composed a reply in this thread, I've
already forgotten which is a code point and which is a code unit.

>  > I think you have it backwards. I see the current situation as the purity
>  > of the C code beating the practicality for the user of getting right
>  > answers.
>
> Sophistry.  "Always getting the right answer" is purity.

Eh? In most other areas Python is pretty careful not to promise to
"always get the right answer" since what is right is entirely in the
user's mind. We often go to great lengths of defining how things work
so as to set the right expectations. For example, variables in Python
work differently than in most other languages.

Now I am happy to admit that for many Unicode issues the level at
which we have currently defined things (code units, I think -- the
thingies that encodings are made of) is confusing, and it would be
better to switch to the others (code points, I think). But characters
are right out.

>  > > The thing is, that 90% of applications are not really going to care
>  > > about full conformance to the Unicode standard.
>  >
>  > I remember when Intel argued that 99% of applications were not going to
>  > be affected when the math coprocessor in its then new chips occasionally
>  > gave 'non-standard' answers with certain divisors.
>
> In the case of Intel, the people who demanded standard answers did so
> for efficiency reasons -- they needed the FPU to DTRT because
> implementing FP in software was always going to be too slow.  CPython,
> IMO, can afford to trade off because the implementation will
> necessarily be in software, and can be added later as a Python or C module.

It is not so easy to change expectations about O(1) vs. O(N) behavior
of indexing however. IMO we shouldn't try and hence we're stuck with
operations defined in terms of code thingies instead of (mostly
mythical) characters.

>  > I believe my scheme could be extended to solve [conformance for
>  > composing characters] also. It would require more pre-processing
>  > and more knowledge than I currently have of normalization. I have
>  > the impression that the grapheme problem goes further than just
>  > normalization.
>
> Yes and yes.  But now you're talking about database lookups for every
> character (to determine if it's a composing character).  Efficiency of
> a generic implementation isn't going to happen.

Let's take small steps. Do the evolutionary thing. Let's get things
right so users won't have to worry about code points vs. code units
any more. A conforming library for all things at the character level
can be developed later, once we understand things better at that level
(again, most developers don't even understand most of the subtleties,
so I claim we're not ready).

> Anyway, in Martin's rephrasing of my (imperfect) memory of Guido's
> pronouncement, "indexing is going to be O(1)".

I still think that. It would be too big of a cultural upheaval to change it.

>  And Nick's point about
> non-uniform arrays is telling.  I have 20 years of experience with an
> implementation of text as a non-uniform array which presents an array
> API, and *everything* needs to be special-cased for efficiency, and
> *any* small change can have show-stopping performanc

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Guido van Rossum
On Wed, Aug 24, 2011 at 5:36 PM, Stephen J. Turnbull  wrote:
> Guido van Rossum writes:
>
>  > I see nothing wrong with having the language's fundamental data types
>  > (i.e., the unicode object, and even the re module) to be defined in
>  > terms of codepoints, not characters, and I see nothing wrong with
>  > len() returning the number of codepoints (as long as it is advertised
>  > as such).
>
> In fact, the Unicode Standard, Version 6, goes farther (to code units):
>
>    2.7  Unicode Strings
>
>    A Unicode string data type is simply an ordered sequence of code
>    units. Thus a Unicode 8-bit string is an ordered sequence of 8-bit
>    code units, a Unicode 16-bit string is an ordered sequence of
>    16-bit code units, and a Unicode 32-bit string is an ordered
>    sequence of 32-bit code units.
>
>    Depending on the programming environment, a Unicode string may or
>    may not be required to be in the corresponding Unicode encoding
>    form. For example, strings in Java, C#, or ECMAScript are Unicode
>    16-bit strings, but are not necessarily well-formed UTF-16
>    sequences.
>
> (p. 32).

I am assuming that that definition only applies to use of the term
"unicode string" within the standard and has no bearing on how
programming languages are allowed to use the term, as that would be
preposterous. (They can define what they mean by terms like
well-formed and conforming etc., and I won't try to go against that.
But limiting what can be called a unicode string feels like
unproductive coddling.)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Nick Coghlan
On Thu, Aug 25, 2011 at 12:29 PM, Guido van Rossum  wrote:
> Now I am happy to admit that for many Unicode issues the level at
> which we have currently defined things (code units, I think -- the
> thingies that encodings are made of) is confusing, and it would be
> better to switch to the others (code points, I think). But characters
> are right out.

Indeed, code points are the abstract concept and code units are the
specific byte sequences that are used for serialisation (FWIW, I'm
going to try to keep this straight in the future by remembering that
the Unicode character set is defined as abstract points on planes,
just like geometry).

With narrow builds, code units can currently come into play
internally, but with PEP 393 everything internal will be working
directly with code points. Normalisation, combining characters and
bidi issues may still affect the correctness of unicode comparison and
slicing (and other text manipulation), but there are limits to how
much of the underlying complexity we can effectively hide without
being misleading.

Cheers,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Guido van Rossum
On Wed, Aug 24, 2011 at 7:47 PM, Nick Coghlan  wrote:
> On Thu, Aug 25, 2011 at 12:29 PM, Guido van Rossum  wrote:
>> Now I am happy to admit that for many Unicode issues the level at
>> which we have currently defined things (code units, I think -- the
>> thingies that encodings are made of) is confusing, and it would be
>> better to switch to the others (code points, I think). But characters
>> are right out.
>
> Indeed, code points are the abstract concept and code units are the
> specific byte sequences that are used for serialisation (FWIW, I'm
> going to try to keep this straight in the future by remembering that
> the Unicode character set is defined as abstract points on planes,
> just like geometry).

Hm, code points still look pretty concrete to me (integers in the
range 0 .. 2**21) and code units don't feel like byte sequences to me
(at least not UTF-16 code units -- in Python at least you can think of
them as integers in the range 0 .. 2**16).

> With narrow builds, code units can currently come into play
> internally, but with PEP 393 everything internal will be working
> directly with code points. Normalisation, combining characters and
> bidi issues may still affect the correctness of unicode comparison and
> slicing (and other text manipulation), but there are limits to how
> much of the underlying complexity we can effectively hide without
> being misleading.

Let's just define a Unicode string to be a sequence of code points and
let libraries deal with the rest. Ok, methods like lower() should
consider characters, but indexing/slicing should refer to code points.
Same for '=='; we can have a library that compares by applying (or
assuming?) certain normalizations. Tom C tells me that case-less
comparison cannot use a.lower() == b.lower(); fine, we can add that
operation to the library too. But this exceeds the scope of PEP 393,
right?

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Nick Coghlan
On Thu, Aug 25, 2011 at 1:11 PM, Guido van Rossum  wrote:
>> With narrow builds, code units can currently come into play
>> internally, but with PEP 393 everything internal will be working
>> directly with code points. Normalisation, combining characters and
>> bidi issues may still affect the correctness of unicode comparison and
>> slicing (and other text manipulation), but there are limits to how
>> much of the underlying complexity we can effectively hide without
>> being misleading.
>
> Let's just define a Unicode string to be a sequence of code points and
> let libraries deal with the rest. Ok, methods like lower() should
> consider characters, but indexing/slicing should refer to code points.
> Same for '=='; we can have a library that compares by applying (or
> assuming?) certain normalizations. Tom C tells me that case-less
> comparison cannot use a.lower() == b.lower(); fine, we can add that
> operation to the library too. But this exceeds the scope of PEP 393,
> right?

Yep, I was agreeing with you on this point - I think you're right that
if we provide a solid code point based core Unicode type (perhaps with
some character based methods), then library support can fill the gap
between handling code points and handling characters.

In particular, a unicode character based string type would be
significantly easier to write in Python than it would be in C (after
skimming Tom's bug report at http://bugs.python.org/issue12729, I
better understand the motivation and desire for that kind of interface
and it sounds like Terry's prototype is along those lines). Once those
mappings are thrashed out outside the core, then there may be
something to incorporate directly around the 3.4 timeframe (or
potentially even in 3.3, since it should already be possible to
develop such a wrapper based on UCS4 builds of 3.2)

However, there may an important distinction to be made on the
Python-the-language vs CPython-the-implementation front: is another
implementation (e.g. PyPy) *allowed* to implement character based
indexing instead of code point based for 2.x unicode/3.x str type? Or
is the code point indexing part of the language spec, and any
character based indexing needs to be provided via a separate type or
module?

Regards,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Stephen J. Turnbull
Guido van Rossum writes:

 > On Wed, Aug 24, 2011 at 5:31 PM, Stephen J. Turnbull
 >  wrote:

 > >    Strings contain Unicode code units, which for most purposes can be
 > >    treated as Unicode characters.  However, even as "simple" an
 > >    operation as "s1[0] == s2[0]" cannot be relied upon to give
 > >    Unicode-conforming results.
 > >
 > > The second sentence remains true under PEP 393.
 > 
 > Really? If strings contain code units, that expression compares code
 > units.

That's true out of context, but in context it's "which for most
purposes can be treated as Unicode characters", and this is what Terry
is concerned with, as well.

 > What is non-conforming about comparing two code points?

Unicode conformance means treating characters correctly.  In
particular, s1 and s2 might be NFC and NFD forms of the same string
with a combining character at s2[1], or s1[1] and s[2] might be a
non-combining character and a combining character respectively.

 > Seriously, what does Unicode-conforming mean here?

Chapter 3, all verses.  Here, specifically C6, p. 60.  One would have
to define the process executing "s1[0] == s2[0]" to be sure that even
in the cases cited in the previous paragraph non-conformance is
occurring, but one example of a process where that is non-conforming
(without additional code to check for trailing combining characters)
is in comparison of Vietnamese filenames generated on a Mac vs. those
generated on a Linux host.

 > > No, you're not.  You are claiming an isomorphism, which Unicode goes
 > > to great trouble to avoid.
 > 
 > I don't know that we will be able to educate our users to the point
 > where they will use code unit, code point, character, glyph, character
 > set, encoding, and other technical terms correctly.

Sure.  I got it wrong myself earlier.

I think that the right thing to do is to provide a conformant
implementation of Unicode text in the stdlib (a long run goal, see
below), and call that "Unicode", while we call strings "strings".

 > Now I am happy to admit that for many Unicode issues the level at
 > which we have currently defined things (code units, I think -- the
 > thingies that encodings are made of) is confusing, and it would be
 > better to switch to the others (code points, I think).

Yes, and AFAICT (I'm better at reading standards than I am at reading
Python implementation) PEP 393 allows that.

 > But characters are right out.

+1

 > It is not so easy to change expectations about O(1) vs. O(N) behavior
 > of indexing however. IMO we shouldn't try and hence we're stuck with
 > operations defined in terms of code thingies instead of (mostly
 > mythical) characters.

Well, O(N) is not really the question.  It's really O(log N), as Terry
says.  Is that out, too?  I can verify that it's possible to do it in
practice in the long term.  In my experience with Emacs, even with 250
MB files, O(log N) mostly gives acceptable performance in an
interactive editor, as well as many scripted textual applications.

The problems that I see are

(1) It's very easy to write algorithms that would be O(N) for a true
array, but then become O(N log N) or worse (and the coefficient on
the O(log N) algorithm is way higher to start).  I guess this
would kill the idea, but.

(2) Maintenance is fragile; it's easy to break the necessary caches
with feature additions and bug fixes.  (However, I don't think
this would be as big a problem for Python, due to its more
disciplined process, as it has been for XEmacs.)

You might think space for the caches would be a problem, but that has
turned out not to be the case for Emacsen.

 > Let's take small steps. Do the evolutionary thing. Let's get things
 > right so users won't have to worry about code points vs. code units
 > any more. A conforming library for all things at the character level
 > can be developed later, once we understand things better at that level
 > (again, most developers don't even understand most of the subtleties,
 > so I claim we're not ready).

I don't think anybody does.  That's one reason there's a new version
of Unicode every few years.

 > This I agree with (though if you were referring to me with
 > "leadership" I consider myself woefully underinformed about Unicode
 > subtleties).

  MvL and MAL are not, however, and there are plenty of others
who make contributions -- in an orderly fashion.

 > I also suspect that Unicode "conformance" (however defined) is more
 > part of a political battle than an actual necessity.  I'd much
 > rather have us fix Tom Christiansen's specific bugs than chase the
 > elusive "standard conforming".

Well, I would advocate specifying which parts of the standard we
target and which not (for any given version).  The goal of full
"Chapter 3" conformance should be left up to a library on PyPI for the
nonce IMO.  I agree that fixing specific bugs should be given
precedence over "conformance chasing," but implementation should
conform to the appropriate part

Re: [Python-Dev] PEP 393 review

2011-08-24 Thread Stefan Behnel

Victor Stinner, 25.08.2011 00:29:

With this PEP, the unicode object overhead grows to 10 pointer-sized
words (including PyObject_HEAD), that's 80 bytes on a 64-bit machine.
Does it have any adverse effects?


For pure ASCII, it might be possible to use a shorter struct:

typedef struct {
 PyObject_HEAD
 Py_ssize_t length;
 Py_hash_t hash;
 int state;
 Py_ssize_t wstr_length;
 wchar_t *wstr;
 /* no more utf8_length, utf8, str */
 /* followed by ascii data */
} _PyASCIIObject;
(-2 pointer -1 ssize_t: 56 bytes)

=>  "a" is 58 bytes (with utf8 for free, without wchar_t)

For object allocated with the new API, we can use a shorter struct:

typedef struct {
 PyObject_HEAD
 Py_ssize_t length;
 Py_hash_t hash;
 int state;
 Py_ssize_t wstr_length;
 wchar_t *wstr;
 Py_ssize_t utf8_length;
 char *utf8;
 /* no more str pointer */
 /* followed by latin1/ucs2/ucs4 data */
} _PyNewUnicodeObject;
(-1 pointer: 72 bytes)

=>  "é" is 74 bytes (without utf8 / wchar_t)

For the legacy API:

typedef struct {
 PyObject_HEAD
 Py_ssize_t length;
 Py_hash_t hash;
 int state;
 Py_ssize_t wstr_length;
 wchar_t *wstr;
 Py_ssize_t utf8_length;
 char *utf8;
 void *str;
} _PyLegacyUnicodeObject;
(same size: 80 bytes)

=>  "a" is 80+2 (2 malloc) bytes (without utf8 / wchar_t)

The current struct:

typedef struct {
 PyObject_HEAD
 Py_ssize_t length;
 Py_UNICODE *str;
 Py_hash_t hash;
 int state;
 PyObject *defenc;
} PyUnicodeObject;

=>  "a" is 56+2 (2 malloc) bytes (without utf8, with wchar_t if Py_UNICODE is
wchar_t)

... but the code (maybe only the macros?) and debuging will be more complex.


That's an interesting idea. However, it's not required to do this as part 
of the PEP 393 implementation. This can be added later on if the need 
evidently arises in general practice.


Also, there is always the possibility to simply intern very short strings 
in order to avoid their multiplication in memory. Long strings don't suffer 
from this as the data size quickly dominates. User code that works with a 
lot of short strings would likely do the same.


BTW, I would expect that many short strings either go away as quickly as 
they appeared (e.g. in a parser) or were brought in as literals and are 
therefore interned anyway. That's just one reason why I suggest to wait for 
a prove of inefficiency in the real world (and, obviously, to test your own 
code with this as quickly as possible).




Will the format codes returning a Py_UNICODE pointer with
PyArg_ParseTuple be deprecated?


Because Python 2.x is still dominant and it's already hard enough to port C
modules, it's not the best moment to deprecate the legacy API (Py_UNICODE*).


Well, it will be quite inefficient in future CPython versions, so I think 
if it's not officially deprecated at some point, it will deprecate itself 
for efficiency reasons. Better make it clear that it's worth investing in 
better performance here.




Do you think the wstr representation could be removed in some future
version of Python?


Conversion to wchar_t* is common, especially on Windows.


That's an issue. However, I cannot say how common this really is in 
practice. Surely depends on the specific code, right? How common is it in 
core CPython?




But I don't know if
we *have to* cache the result. Is it cached by the way? Or is wstr only used
when a string is created from Py_UNICODE?


If it's so common on Windows, maybe it should only be cached there?

Stefan

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 review

2011-08-24 Thread Stefan Behnel

"Martin v. Löwis", 24.08.2011 20:15:

Guido has agreed to eventually pronounce on PEP 393. Before that can
happen, I'd like to collect feedback on it. There have been a number
of voice supporting the PEP in principle


Absolutely.



- conditions you would like to pose on the implementation before
   acceptance. I'll see which of these can be resolved, and list
   the ones that remain open.


Just repeating here that I'd like to see the buffer void* changed into a 
union of pointers that state the exact layout type. IMHO, that would 
clarify the implementation and make it clearer that it's correct to access 
the data buffer as a flat array. (Obviously, code that does that is subject 
to future changes, that's why there are macros.)


Stefan

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Glenn Linderman

On 8/24/2011 7:29 PM, Guido van Rossum wrote:

(Hey, I feel a QOTW coming. "Standards? We don't need no stinkin'
standards."http://en.wikipedia.org/wiki/Stinking_badges  :-)


Which deserves an appropriate, follow-on, misquote:

Guido says the Unicode standard stinks.

˚͜˚ <- and a Unicode smiley to go with it!
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-24 Thread Stephen J. Turnbull
Nick Coghlan writes:
 > GvR writes:

 > > Let's just define a Unicode string to be a sequence of code points and
 > > let libraries deal with the rest. Ok, methods like lower() should
 > > consider characters, but indexing/slicing should refer to code points.
 > > Same for '=='; we can have a library that compares by applying (or
 > > assuming?) certain normalizations. Tom C tells me that case-less
 > > comparison cannot use a.lower() == b.lower(); fine, we can add that
 > > operation to the library too. But this exceeds the scope of PEP 393,
 > > right?
 > 
 > Yep, I was agreeing with you on this point - I think you're right that
 > if we provide a solid code point based core Unicode type (perhaps with
 > some character based methods), then library support can fill the gap
 > between handling code points and handling characters.

+1  I don't really see an alternative to this approach.  The
underlying array has to be exposed because there are too many
applications that can take advantage of it, and analysis of decomposed
characters requires it.

Making that array be an array of code points is a really good idea,
and Python already has that in the UCS-4 build.  PEP 393 is "just" a
space optimization that allows getting rid of the narrow build, with
all its wartiness.

 > something to incorporate directly around the 3.4 timeframe (or
 > potentially even in 3.3, since it should already be possible to
 > develop such a wrapper based on UCS4 builds of 3.2)

I agree that it's possible, but I estimate that it's not feasible for
3.3 because we don't yet know the requirements.  This one really needs
to ferment and mature in PyPI for a while because we just don't know
how far the scope of user needs is going to extend.  Bidi is a
mudball[1], confusable character indexes sound like a cool idea for
the web and email but is anybody really going to use them?, etc.

 > However, there may an important distinction to be made on the
 > Python-the-language vs CPython-the-implementation front: is another
 > implementation (e.g. PyPy) *allowed* to implement character based
 > indexing instead of code point based for 2.x unicode/3.x str type? Or
 > is the code point indexing part of the language spec, and any
 > character based indexing needs to be provided via a separate type or
 > module?

+1 for language spec.  Remember, there are cases in Unicode where
you'd like to access base characters and the like.  So you need to be
able to get at individual code points in an NFD string.  You shouldn't
need to use different code for that in different implementations of
Python.

Footnotes: 
[1]  Sure, we can implement the UAX#9 bidi algorithm, but it's not
good enough by itself: something as simple as

"File name (default {0}): ".format(name)

can produce disconcerting results if the whole resulting string is
treated by the UBA.  Specifically, using the usual convention of
uppercase letters being an RTL script, name = "ABCD" will result in
the prompt:

File name (default :(DCBA _

(where _ denotes the position of the insertion cursor).  The Hebrew
speakers on emacs-devel agreed that an example using a real Hebrew
string didn't look right to them, either.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com