Re: [Python-Dev] PEP 393 close to pronouncement

2011-10-11 Thread M.-A. Lemburg
Victor Stinner wrote:
 Given that I've been working on and maintaining the Python Unicode
 implementation actively or by providing assistance for almost
 12 years now, I've also thought about whether it's still worth
 the effort.
 
 Thanks for your huge work on Unicode, Marc-Andre!

Thanks. I enjoyed working it on it, but priorities are different
now, and new projects are waiting :-)

 My interests have shifted somewhat into other directions and
 I feel that helping Python reach world domination in other ways
 makes me happier than fighting over Unicode standards, implementations,
 special cases that aren't special enough, and all those other
 nitty-gritty details that cause long discussions :-)
 
 Someone said that we still need to define what a character is! By the way, 
 what 
 is a code point?

I'll leave that as exercise for the interested reader to find out :-)

(Hint: Google should find enough hits where I've explained those things
on various mailing lists and in talks I gave.)

 So I feel that the PEP 393 change is a good time to draw a line
 and leave Unicode maintenance to Ezio, Victor, Martin, and
 all the others that have helped over the years. I know it's
 in good hands.
 
 I don't understand why you would like to stop contribution to Unicode, but 

I only have limited time available for these things and am
nowadays more interested in getting others to recognize just
how great Python is, than actually sitting down and writing
patches for it.

Unicode was my baby for quite a few years, but I now have two
kids which need more love and attention :-)

 well, as you want. We will try to continue your work.

Thanks.

Cheers,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 11 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-28 Thread M.-A. Lemburg
Guido van Rossum wrote:
 Given the feedback so far, I am happy to pronounce PEP 393 as
 accepted. Martin, congratulations! Go ahead and mark ity as Accepted.
 (But please do fix up the small nits that Victor reported in his
 earlier message.)

I've been working on feedback for the last few days, but I guess it's
too late. Here goes anyway...

I've only read the PEP and not followed the discussion due to lack of
time, so if any of this is no longer valid, that's probably because
the PEP wasn't updated :-)

Resizing


Codecs use resizing a lot. Given that PyCompactUnicodeObject
does not support resizing, most decoders will have to use
PyUnicodeObject and thus not benefit from the memory footprint
advantages of e.g. PyASCIIObject.


Data structure
--

The data structure description in the PEP appears to be wrong:

PyASCIIObject has a wchar_t *wstr pointer - I guess this should
be a char *str pointer, otherwise, where's the memory footprint
advantage (esp. on Linux where sizeof(wchar_t) == 4) ?

I also don't see a reason to limit the UCS1 storage version
to ASCII. Accordingly, the object should be called PyLatin1Object
or PyUCS1Object.

Here's the version from the PEP:


typedef struct {
  PyObject_HEAD
  Py_ssize_t length;
  Py_hash_t hash;
  struct {
  unsigned int interned:2;
  unsigned int kind:2;
  unsigned int compact:1;
  unsigned int ascii:1;
  unsigned int ready:1;
  } state;
  wchar_t *wstr;
} PyASCIIObject;

typedef struct {
  PyASCIIObject _base;
  Py_ssize_t utf8_length;
  char *utf8;
  Py_ssize_t wstr_length;
} PyCompactUnicodeObject;


Typedef'ing Py_UNICODE to wchar_t and using wchar_t in existing
code will cause problems on some systems where whcar_t is a
signed type.

Python assumes that Py_UNICODE is unsigned and thus doesn't
check for negative values or takes these into account when
doing range checks or code point arithmetic.

On such platform where wchar_t is signed, it is safer to
typedef Py_UNICODE to unsigned wchar_t.

Accordingly and to prevent further breakage, Py_UNICODE
should not be deprecated and used instead of wchar_t
throughout the code.


Length information
--

Py_UNICODE access to the objects assumes that len(obj) ==
length of the Py_UNICODE buffer. The PEP suggests that length
should not take surrogates into account on UCS2 platforms
such as Windows. The causes len(obj) to not match len(wstr).

As a result, Py_UNICODE access to the Unicode objects breaks
when surrogate code points are present in the Unicode object
on UCS2 platforms.

The PEP also does not explain how lone surrogates will be
handled with respect to the length information.

Furthermore, determining len(obj) will require a loop over
the data, checking for surrogate code points. A simple memcpy()
is no longer enough.

I suggest to drop the idea of having len(obj) not count
wstr surrogate code points to maintain backwards compatibility
and allow for working with lone surrogates.

Note that the whole surrogate debate does not have much to
do with this PEP, since it's mainly about memory footprint
savings. I'd also urge to do a reality check with respect
to surrogates and non-BMP code points: in practice you only
very rarely see any non-BMP code points in your data. Making
all Python users pay for the needs of a tiny fraction is
not really fair. Remember: practicality beats purity.


API
---

Victor already described the needed changes.


Performance
---

The PEP only lists a few low-level benchmarks as basis for the
performance decrease. I'm missing some more adequate real-life
tests, e.g. using an application framework such as Django
(to the extent this is possible with Python3) or a server
like the Radicale calendar server (which is available for Python3).

I'd also like to see a performance comparison which specifically
uses the existing Unicode APIs to create and work with Unicode
objects. Most extensions will use this way of working with the
Unicode API, either because they want to support Python 2 and 3,
or because the effort it takes to port to the new APIs is
too high. The PEP makes some statements that this is slower,
but doesn't quantify those statements.


Memory savings
--

The table only lists string sizes up 8 code points. The memory
savings for these are really only significant for ASCII
strings on 64-bit platforms, if you use the default UCS2
Python build as basis.

For larger strings, I expect the savings to be more significant.
OTOH, a single non-BMP code point in such a string would cause
the savings to drop significantly again.


Complexity
--

In order to benefit from the new API, any code that has to
deal with low-level Py_UNICODE access to the Unicode objects
will have to be adapted.

For best performance, each algorithm will have to be implemented
for all three storage types.

Not doing so, will result in a slow-down, if I read the PEP
correctly. It's difficult to say, of what scale, since that
information 

Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-28 Thread Benjamin Peterson
2011/9/28 M.-A. Lemburg m...@egenix.com:
 Guido van Rossum wrote:
 Given the feedback so far, I am happy to pronounce PEP 393 as
 accepted. Martin, congratulations! Go ahead and mark ity as Accepted.
 (But please do fix up the small nits that Victor reported in his
 earlier message.)

 I've been working on feedback for the last few days, but I guess it's
 too late. Here goes anyway...

 I've only read the PEP and not followed the discussion due to lack of
 time, so if any of this is no longer valid, that's probably because
 the PEP wasn't updated :-)

 Resizing
 

 Codecs use resizing a lot. Given that PyCompactUnicodeObject
 does not support resizing, most decoders will have to use
 PyUnicodeObject and thus not benefit from the memory footprint
 advantages of e.g. PyASCIIObject.


 Data structure
 --

 The data structure description in the PEP appears to be wrong:

 PyASCIIObject has a wchar_t *wstr pointer - I guess this should
 be a char *str pointer, otherwise, where's the memory footprint
 advantage (esp. on Linux where sizeof(wchar_t) == 4) ?

 I also don't see a reason to limit the UCS1 storage version
 to ASCII. Accordingly, the object should be called PyLatin1Object
 or PyUCS1Object.

I think the purpose is that if it's only ASCII, no work is need to
encode to UTF-8.


-- 
Regards,
Benjamin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-28 Thread Martin v. Löwis
 Codecs use resizing a lot. Given that PyCompactUnicodeObject
 does not support resizing, most decoders will have to use
 PyUnicodeObject and thus not benefit from the memory footprint
 advantages of e.g. PyASCIIObject.

No, codecs have been rewritten to not use resizing.

 PyASCIIObject has a wchar_t *wstr pointer - I guess this should
 be a char *str pointer, otherwise, where's the memory footprint
 advantage (esp. on Linux where sizeof(wchar_t) == 4) ?

That's the Py_UNICODE representation for backwards compatibility.
It's normally NULL.

 I also don't see a reason to limit the UCS1 storage version
 to ASCII. Accordingly, the object should be called PyLatin1Object
 or PyUCS1Object.

No, in the ASCII case, the UTF-8 length can be shared with the regular
string length - not so for Latin-1 character above 127.

 Typedef'ing Py_UNICODE to wchar_t and using wchar_t in existing
 code will cause problems on some systems where whcar_t is a
 signed type.
 
 Python assumes that Py_UNICODE is unsigned and thus doesn't
 check for negative values or takes these into account when
 doing range checks or code point arithmetic.
 
 On such platform where wchar_t is signed, it is safer to
 typedef Py_UNICODE to unsigned wchar_t.

No. Py_UNICODE values *must* be in the range 0..17*2**16.
Values larger than 17*2**16 are just as bad as negative
values, so having Py_UNICODE unsigned doesn't improve
anything.

 Py_UNICODE access to the objects assumes that len(obj) ==
 length of the Py_UNICODE buffer. The PEP suggests that length
 should not take surrogates into account on UCS2 platforms
 such as Windows. The causes len(obj) to not match len(wstr).

Correct.

 As a result, Py_UNICODE access to the Unicode objects breaks
 when surrogate code points are present in the Unicode object
 on UCS2 platforms.

Incorrect. What specifically do you think would break?

 The PEP also does not explain how lone surrogates will be
 handled with respect to the length information.

Just as any other code point. Python does not special-case
surrogate code points anymore.

 Furthermore, determining len(obj) will require a loop over
 the data, checking for surrogate code points. A simple memcpy()
 is no longer enough.

No, it won't. The length of the Unicode object is stored in
the length field.

 I suggest to drop the idea of having len(obj) not count
 wstr surrogate code points to maintain backwards compatibility
 and allow for working with lone surrogates.

Backwards-compatibility is fully preserved by PyUnicode_GET_SIZE
returning the size of the Py_UNICODE buffer. PyUnicode_GET_LENGTH
returns the true length of the Unicode object.

 Note that the whole surrogate debate does not have much to
 do with this PEP, since it's mainly about memory footprint
 savings. I'd also urge to do a reality check with respect
 to surrogates and non-BMP code points: in practice you only
 very rarely see any non-BMP code points in your data. Making
 all Python users pay for the needs of a tiny fraction is
 not really fair. Remember: practicality beats purity.

That's the whole point of the PEP. You only pay for what
you actually need, and in most cases, it's ASCII.

 For best performance, each algorithm will have to be implemented
 for all three storage types.

This will be a trade-off. I think most developers will be happy
with a single version covering all three cases, especially as it's
much more maintainable.

Kind regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-28 Thread Victor Stinner
 Resizing
 
 
 Codecs use resizing a lot. Given that PyCompactUnicodeObject
 does not support resizing, most decoders will have to use
 PyUnicodeObject and thus not benefit from the memory footprint
 advantages of e.g. PyASCIIObject.

Wrong. Even if you create a string using the legacy API (e.g. 
PyUnicode_FromUnicode), the string will be quickly compacted to use the most 
efficient memory storage (depending on the maximum character). quickly: at 
the 
first call to PyUnicode_READY. Python tries to make all strings ready as early 
as possible.

 PyASCIIObject has a wchar_t *wstr pointer - I guess this should
 be a char *str pointer, otherwise, where's the memory footprint
 advantage (esp. on Linux where sizeof(wchar_t) == 4) ?

For pure ASCII strings, you don't have to store a pointer to the UTF-8 string, 
nor the length of the UTF-8 string (in bytes), nor the length of the wchar_t 
string (in wide characters): the length is always the length of the ASCII 
string, and the UTF-8 string is shared with the ASCII string. The structure is 
much smaller thanks to these optimizations, and so Python 3.3 uses less memory 
than 2.7 for ASCII strings, even for short strings.

 I also don't see a reason to limit the UCS1 storage version
 to ASCII. Accordingly, the object should be called PyLatin1Object
 or PyUCS1Object.

Latin1 is less interesting, you cannot share length/data fields with utf8 or 
wstr. We didn't add a special case for Latin1 strings (except using Py_UCS1* 
strings to store their characters).

 Furthermore, determining len(obj) will require a loop over
 the data, checking for surrogate code points. A simple memcpy()
 is no longer enough.

Wrong. len(obj) gives the right result (see the long discussion about what 
is the length of a string in a previous thread...) in O(1) since it's computed 
when the string is created.

 ... in practice you only
 very rarely see any non-BMP code points in your data. Making
 all Python users pay for the needs of a tiny fraction is
 not really fair. Remember: practicality beats purity.

The creation of the string is maybe is little bit slower (especially when you 
have to scan the string twice to first get the maximum character), but I think 
that this slow down is smaller than the speedup allowed by the PEP.

Because ASCII strings are now char*, I think that processing ASCII strings is 
faster because the CPU can cache more data (close to the CPU).

We can do better optimization on ASCII and Latin1 strings (it's faster to 
manipulate char* than uint16_t* or uint32_t*). For example, str.center(), 
str.ljust, str.rjust and str.zfill do now use the very fast memset() function 
for latin1 strings to pad the string.

Another example, duplicating a string (or create a substring) should be faster 
just because you have less data to copy (e.g. 10 bytes for a string of 10 
Latin1 characters vs 20 or 40 bytes with Python 3.2).

The two most common encodings in the world are ASCII and UTF-8. With the PEP 
393, encoding to ASCII or UTF-8 is free, you don't have to encode anything, 
you have directly the encoded char* buffer (whereas you have to convert 16/32 
bit wchar_t to char* in Python 3.2, even for pure ASCII). (It's also free to 
encode Latin1 Unicode string to Latin1.)

With the PEP 393, we never have to decode UTF-16 anymore when iterating on 
code pointer to support correctly non-BMP characters (which was required 
before in narrow build, e.g. on Windows). Iterate on code point is just a 
dummy loop, no need to check if each character is in range U+D800-U+DFFF.

There are other funny tricks (optimizations). For example, text.replace(a, b) 
knows that there is nothing to do if maxchar(a)  maxchar(text), where 
maxchar(obj) just requires to read an attribute of the string. Think about 
ASCII and non-ASCII strings: pure_ascii.replace('\xe9', '') now just creates a 
new reference...

I don't think that Martin wrote his PEP to be able to implement all these 
optimisations, but there are an interesting side effect of his PEP :-)

 The table only lists string sizes up 8 code points. The memory
 savings for these are really only significant for ASCII
 strings on 64-bit platforms, if you use the default UCS2
 Python build as basis.

In the 32 different cases, the PEP 393 is better in 29 cases and just as good 
as Python 3.2 in 3 corner cases:

- 1 ASCII, 16-bit wchar, 32-bit
- 1 Latin1, 32-bit wchar, 32-bit
- 2 Latin1, 32-bit wchar, 32-bit

Do you really care of these corner cases? See the more the realistic benchmark 
in previous Martin's email (PEP 393 memory savings update): the PEP 393 not 
only uses 3x less memory than 3.2, but it uses also *less* memory than Python 
2.7, whereas Python 3 uses Unicode for everything!

 For larger strings, I expect the savings to be more significant.

Sure.

 OTOH, a single non-BMP code point in such a string would cause
 the savings to drop significantly again.

In this case, it's just as good as Python 3.2 in wide mode, but worse 

Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-27 Thread Martin v. Löwis
 GDB Debugging Hooks It's not done yet.
 I can do these if need be, but IIRC you (Victor) said on #python-dev
 that you were already working on them.

I already changed it for an earlier version of the PEP. It still needs
to sort out the various compact representations. I could do them as
well, so don't worry.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-27 Thread Victor Stinner
Le mardi 27 septembre 2011 00:19:02, Victor Stinner a écrit :
 On Windows, there is just one failure in test_configparser, I
 didn't investigate it yet

Oh, it was a real bug in io.IncrementalNewlineDecoder. It is now fixed.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-27 Thread Guido van Rossum
Given the feedback so far, I am happy to pronounce PEP 393 as
accepted. Martin, congratulations! Go ahead and mark ity as Accepted.
(But please do fix up the small nits that Victor reported in his
earlier message.)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 393 close to pronouncement

2011-09-26 Thread Guido van Rossum
Martin has asked me to pronounce on PEP 393, after he's updated it in
response to various feedback (including mine :-). I'm currently
looking very favorable on it, but I thought I'd give folks here one
more chance to bring up showstoppers.

So, if you have the time, please review PEP 393 and/or play with the
code (the repo is linked from the PEP's References section now).

Please limit your feedback to show-stopping issues; we're past the
stage of bikeshedding here. It's Good Enough (TM) and we'll have to
rest of the 3.3 release cycle to improve incrementally. But we need to
get to the point where the code can be committed to the 3.3 branch.

In a few days I'll pronounce.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-26 Thread Victor Stinner
Hi,

Le lundi 26 septembre 2011 23:00:06, Guido van Rossum a écrit :
 So, if you have the time, please review PEP 393 and/or play with the
 code (the repo is linked from the PEP's References section now).

I played with the code. The full test suite pass on Linux, FreeBSD and 
Windows. On Windows, there is just one failure in test_configparser, I didn't 
investigate it yet. I like the new API: a classic loop on the string length, 
and a macro to read the nth character. The backward compatibility is fully 
transparent and is already well tested because some modules still use the 
legacy API.

It's quite easy to move from the legacy API to the new API. It's just boring, 
but it's almost done in the core (unicodeobject.c, but also some modules like 
_io).

Since the introduction of PyASCIIObject, the PEP 393 is really good in memory 
footprint, especially for ASCII-only strings. In Python, you manipulate a lot 
of ASCII strings.


PEP
===

It's not clear what is deprecated. It would help to have a full list of the 
deprecated functions/macros.

Sometimes Martin wrote PyUnicode_Ready, sometimes PyUnicode_READY. It's 
confusing.

Typo: PyUnicode_FAST_READY = PyUnicode_READY.

PyUnicode_WRITE_CHAR is not listed in the New API section.

Typo in PyUnicode_CONVERT_BYTES(from_type, tp_type, begin, end, to): tp_type 
= to_type.

PyUnicode_Chr(ch): Why introducing a new function? PyUnicode_FromOrdinal was 
not enough?

GDB Debugging Hooks It's not done yet.

None of the functions in this PEP become part of the stable ABI (PEP 384). 
Why? Some functions don't depend on the internal representation, like 
PyUnicode_Substring or PyUnicode_FindChar.

Typo: In order to port modules to the new API, try to eliminate the use of 
these API elements: ... PyUnicode_GET_LENGTH ... PyUnicode_GET_LENGTH is part 
of the new API. I suppose that you mean PyUnicode_GET_SIZE.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-26 Thread David Malcolm
On Tue, 2011-09-27 at 00:19 +0200, Victor Stinner wrote:
 Hi,
 
 Le lundi 26 septembre 2011 23:00:06, Guido van Rossum a écrit :
  So, if you have the time, please review PEP 393 and/or play with the
  code (the repo is linked from the PEP's References section now).

 
 PEP
 ===

 GDB Debugging Hooks It's not done yet.
I can do these if need be, but IIRC you (Victor) said on #python-dev
that you were already working on them.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com