subject:"\[Python\-Dev\] PEP 393 close to pronouncement"

Re: [Python-Dev] PEP 393 close to pronouncement

2011-10-11 Thread M.-A. Lemburg

Victor Stinner wrote:
>> Given that I've been working on and maintaining the Python Unicode
>> implementation actively or by providing assistance for almost
>> 12 years now, I've also thought about whether it's still worth
>> the effort.
> 
> Thanks for your huge work on Unicode, Marc-Andre!

Thanks. I enjoyed working it on it, but priorities are different
now, and new projects are waiting :-)

>> My interests have shifted somewhat into other directions and
>> I feel that helping Python reach world domination in other ways
>> makes me happier than fighting over Unicode standards, implementations,
>> special cases that aren't special enough, and all those other
>> nitty-gritty details that cause long discussions :-)
> 
> Someone said that we still need to define what a character is! By the way, 
> what 
> is a code point?

I'll leave that as exercise for the interested reader to find out :-)

(Hint: Google should find enough hits where I've explained those things
on various mailing lists and in talks I gave.)

>> So I feel that the PEP 393 change is a good time to draw a line
>> and leave Unicode maintenance to Ezio, Victor, Martin, and
>> all the others that have helped over the years. I know it's
>> in good hands.
> 
> I don't understand why you would like to stop contribution to Unicode, but 

I only have limited time available for these things and am
nowadays more interested in getting others to recognize just
how great Python is, than actually sitting down and writing
patches for it.

Unicode was my baby for quite a few years, but I now have two
kids which need more love and attention :-)

> well, as you want. We will try to continue your work.

Thanks.

Cheers,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 11 2011)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! 

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-28 Thread Victor Stinner

> Resizing
> 
> 
> Codecs use resizing a lot. Given that PyCompactUnicodeObject
> does not support resizing, most decoders will have to use
> PyUnicodeObject and thus not benefit from the memory footprint
> advantages of e.g. PyASCIIObject.

Wrong. Even if you create a string using the legacy API (e.g. 
PyUnicode_FromUnicode), the string will be quickly compacted to use the most 
efficient memory storage (depending on the maximum character). "quickly": at 
the 
first call to PyUnicode_READY. Python tries to make all strings ready as early 
as possible.

> PyASCIIObject has a wchar_t *wstr pointer - I guess this should
> be a char *str pointer, otherwise, where's the memory footprint
> advantage (esp. on Linux where sizeof(wchar_t) == 4) ?

For pure ASCII strings, you don't have to store a pointer to the UTF-8 string, 
nor the length of the UTF-8 string (in bytes), nor the length of the wchar_t 
string (in wide characters): the length is always the length of the "ASCII" 
string, and the UTF-8 string is shared with the ASCII string. The structure is 
much smaller thanks to these optimizations, and so Python 3.3 uses less memory 
than 2.7 for ASCII strings, even for short strings.

> I also don't see a reason to limit the UCS1 storage version
> to ASCII. Accordingly, the object should be called PyLatin1Object
> or PyUCS1Object.

Latin1 is less interesting, you cannot share length/data fields with utf8 or 
wstr. We didn't add a special case for Latin1 strings (except using Py_UCS1* 
strings to store their characters).

> Furthermore, determining len(obj) will require a loop over
> the data, checking for surrogate code points. A simple memcpy()
> is no longer enough.

Wrong. len(obj) gives the "right" result (see the long discussion about what 
is the length of a string in a previous thread...) in O(1) since it's computed 
when the string is created.

> ... in practice you only
> very rarely see any non-BMP code points in your data. Making
> all Python users pay for the needs of a tiny fraction is
> not really fair. Remember: practicality beats purity.

The creation of the string is maybe is little bit slower (especially when you 
have to scan the string twice to first get the maximum character), but I think 
that this slow down is smaller than the speedup allowed by the PEP.

Because ASCII strings are now char*, I think that processing ASCII strings is 
faster because the CPU can cache more data (close to the CPU).

We can do better optimization on ASCII and Latin1 strings (it's faster to 
manipulate char* than uint16_t* or uint32_t*). For example, str.center(), 
str.ljust, str.rjust and str.zfill do now use the very fast memset() function 
for latin1 strings to pad the string.

Another example, duplicating a string (or create a substring) should be faster 
just because you have less data to copy (e.g. 10 bytes for a string of 10 
Latin1 characters vs 20 or 40 bytes with Python 3.2).

The two most common encodings in the world are ASCII and UTF-8. With the PEP 
393, encoding to ASCII or UTF-8 is free, you don't have to encode anything, 
you have directly the encoded char* buffer (whereas you have to convert 16/32 
bit wchar_t to char* in Python 3.2, even for pure ASCII). (It's also free to 
encode "Latin1" Unicode string to Latin1.)

With the PEP 393, we never have to decode UTF-16 anymore when iterating on 
code pointer to support correctly non-BMP characters (which was required 
before in narrow build, e.g. on Windows). Iterate on code point is just a 
dummy loop, no need to check if each character is in range U+D800-U+DFFF.

There are other funny tricks (optimizations). For example, text.replace(a, b) 
knows that there is nothing to do if maxchar(a) > maxchar(text), where 
maxchar(obj) just requires to read an attribute of the string. Think about 
ASCII and non-ASCII strings: pure_ascii.replace('\xe9', '') now just creates a 
new reference...

I don't think that Martin wrote his PEP to be able to implement all these 
optimisations, but there are an interesting side effect of his PEP :-)

> The table only lists string sizes up 8 code points. The memory
> savings for these are really only significant for ASCII
> strings on 64-bit platforms, if you use the default UCS2
> Python build as basis.

In the 32 different cases, the PEP 393 is better in 29 cases and "just" as good 
as Python 3.2 in 3 corner cases:

- 1 ASCII, 16-bit wchar, 32-bit
- 1 Latin1, 32-bit wchar, 32-bit
- 2 Latin1, 32-bit wchar, 32-bit

Do you really care of these corner cases? See the more the realistic benchmark 
in previous Martin's email ("PEP 393 memory savings update"): the PEP 393 not 
only uses 3x less memory than 3.2, but it uses also *less* memory than Python 
2.7, whereas Python 3 uses Unicode for everything!

> For larger strings, I expect the savings to be more significant.

Sure.

> OTOH, a single non-BMP code point in such a string would cause
> the savings to drop significantly again.

In this case, it's just as good a

Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-28 Thread Martin v. Löwis

> Codecs use resizing a lot. Given that PyCompactUnicodeObject
> does not support resizing, most decoders will have to use
> PyUnicodeObject and thus not benefit from the memory footprint
> advantages of e.g. PyASCIIObject.

No, codecs have been rewritten to not use resizing.

> PyASCIIObject has a wchar_t *wstr pointer - I guess this should
> be a char *str pointer, otherwise, where's the memory footprint
> advantage (esp. on Linux where sizeof(wchar_t) == 4) ?

That's the Py_UNICODE representation for backwards compatibility.
It's normally NULL.

> I also don't see a reason to limit the UCS1 storage version
> to ASCII. Accordingly, the object should be called PyLatin1Object
> or PyUCS1Object.

No, in the ASCII case, the UTF-8 length can be shared with the regular
string length - not so for Latin-1 character above 127.

> Typedef'ing Py_UNICODE to wchar_t and using wchar_t in existing
> code will cause problems on some systems where whcar_t is a
> signed type.
> 
> Python assumes that Py_UNICODE is unsigned and thus doesn't
> check for negative values or takes these into account when
> doing range checks or code point arithmetic.
> 
> On such platform where wchar_t is signed, it is safer to
> typedef Py_UNICODE to unsigned wchar_t.

No. Py_UNICODE values *must* be in the range 0..17*2**16.
Values larger than 17*2**16 are just as bad as negative
values, so having Py_UNICODE unsigned doesn't improve
anything.

> Py_UNICODE access to the objects assumes that len(obj) ==
> length of the Py_UNICODE buffer. The PEP suggests that length
> should not take surrogates into account on UCS2 platforms
> such as Windows. The causes len(obj) to not match len(wstr).

Correct.

> As a result, Py_UNICODE access to the Unicode objects breaks
> when surrogate code points are present in the Unicode object
> on UCS2 platforms.

Incorrect. What specifically do you think would break?

> The PEP also does not explain how lone surrogates will be
> handled with respect to the length information.

Just as any other code point. Python does not special-case
surrogate code points anymore.

> Furthermore, determining len(obj) will require a loop over
> the data, checking for surrogate code points. A simple memcpy()
> is no longer enough.

No, it won't. The length of the Unicode object is stored in
the length field.

> I suggest to drop the idea of having len(obj) not count
> wstr surrogate code points to maintain backwards compatibility
> and allow for working with lone surrogates.

Backwards-compatibility is fully preserved by PyUnicode_GET_SIZE
returning the size of the Py_UNICODE buffer. PyUnicode_GET_LENGTH
returns the true length of the Unicode object.

> Note that the whole surrogate debate does not have much to
> do with this PEP, since it's mainly about memory footprint
> savings. I'd also urge to do a reality check with respect
> to surrogates and non-BMP code points: in practice you only
> very rarely see any non-BMP code points in your data. Making
> all Python users pay for the needs of a tiny fraction is
> not really fair. Remember: practicality beats purity.

That's the whole point of the PEP. You only pay for what
you actually need, and in most cases, it's ASCII.

> For best performance, each algorithm will have to be implemented
> for all three storage types.

This will be a trade-off. I think most developers will be happy
with a single version covering all three cases, especially as it's
much more maintainable.

Kind regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-28 Thread Benjamin Peterson

2011/9/28 M.-A. Lemburg :
> Guido van Rossum wrote:
>> Given the feedback so far, I am happy to pronounce PEP 393 as
>> accepted. Martin, congratulations! Go ahead and mark ity as Accepted.
>> (But please do fix up the small nits that Victor reported in his
>> earlier message.)
>
> I've been working on feedback for the last few days, but I guess it's
> too late. Here goes anyway...
>
> I've only read the PEP and not followed the discussion due to lack of
> time, so if any of this is no longer valid, that's probably because
> the PEP wasn't updated :-)
>
> Resizing
> 
>
> Codecs use resizing a lot. Given that PyCompactUnicodeObject
> does not support resizing, most decoders will have to use
> PyUnicodeObject and thus not benefit from the memory footprint
> advantages of e.g. PyASCIIObject.
>
>
> Data structure
> --
>
> The data structure description in the PEP appears to be wrong:
>
> PyASCIIObject has a wchar_t *wstr pointer - I guess this should
> be a char *str pointer, otherwise, where's the memory footprint
> advantage (esp. on Linux where sizeof(wchar_t) == 4) ?
>
> I also don't see a reason to limit the UCS1 storage version
> to ASCII. Accordingly, the object should be called PyLatin1Object
> or PyUCS1Object.

I think the purpose is that if it's only ASCII, no work is need to
encode to UTF-8.


-- 
Regards,
Benjamin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-28 Thread M.-A. Lemburg

Guido van Rossum wrote:
> Given the feedback so far, I am happy to pronounce PEP 393 as
> accepted. Martin, congratulations! Go ahead and mark ity as Accepted.
> (But please do fix up the small nits that Victor reported in his
> earlier message.)

I've been working on feedback for the last few days, but I guess it's
too late. Here goes anyway...

I've only read the PEP and not followed the discussion due to lack of
time, so if any of this is no longer valid, that's probably because
the PEP wasn't updated :-)

Resizing

Codecs use resizing a lot. Given that PyCompactUnicodeObject
does not support resizing, most decoders will have to use
PyUnicodeObject and thus not benefit from the memory footprint
advantages of e.g. PyASCIIObject.

Data structure
--

The data structure description in the PEP appears to be wrong:

PyASCIIObject has a wchar_t *wstr pointer - I guess this should
be a char *str pointer, otherwise, where's the memory footprint
advantage (esp. on Linux where sizeof(wchar_t) == 4) ?

I also don't see a reason to limit the UCS1 storage version
to ASCII. Accordingly, the object should be called PyLatin1Object
or PyUCS1Object.

Here's the version from the PEP:

"""
typedef struct {
  PyObject_HEAD
  Py_ssize_t length;
  Py_hash_t hash;
  struct {
  unsigned int interned:2;
  unsigned int kind:2;
  unsigned int compact:1;
  unsigned int ascii:1;
  unsigned int ready:1;
  } state;
  wchar_t *wstr;
} PyASCIIObject;

typedef struct {
  PyASCIIObject _base;
  Py_ssize_t utf8_length;
  char *utf8;
  Py_ssize_t wstr_length;
} PyCompactUnicodeObject;
"""

Typedef'ing Py_UNICODE to wchar_t and using wchar_t in existing
code will cause problems on some systems where whcar_t is a
signed type.

Python assumes that Py_UNICODE is unsigned and thus doesn't
check for negative values or takes these into account when
doing range checks or code point arithmetic.

On such platform where wchar_t is signed, it is safer to
typedef Py_UNICODE to unsigned wchar_t.

Accordingly and to prevent further breakage, Py_UNICODE
should not be deprecated and used instead of wchar_t
throughout the code.

Length information
--

Py_UNICODE access to the objects assumes that len(obj) ==
length of the Py_UNICODE buffer. The PEP suggests that length
should not take surrogates into account on UCS2 platforms
such as Windows. The causes len(obj) to not match len(wstr).

As a result, Py_UNICODE access to the Unicode objects breaks
when surrogate code points are present in the Unicode object
on UCS2 platforms.

The PEP also does not explain how lone surrogates will be
handled with respect to the length information.

Furthermore, determining len(obj) will require a loop over
the data, checking for surrogate code points. A simple memcpy()
is no longer enough.

I suggest to drop the idea of having len(obj) not count
wstr surrogate code points to maintain backwards compatibility
and allow for working with lone surrogates.

Note that the whole surrogate debate does not have much to
do with this PEP, since it's mainly about memory footprint
savings. I'd also urge to do a reality check with respect
to surrogates and non-BMP code points: in practice you only
very rarely see any non-BMP code points in your data. Making
all Python users pay for the needs of a tiny fraction is
not really fair. Remember: practicality beats purity.

API
---

Victor already described the needed changes.

Performance
---

The PEP only lists a few low-level benchmarks as basis for the
performance decrease. I'm missing some more adequate real-life
tests, e.g. using an application framework such as Django
(to the extent this is possible with Python3) or a server
like the Radicale calendar server (which is available for Python3).

I'd also like to see a performance comparison which specifically
uses the existing Unicode APIs to create and work with Unicode
objects. Most extensions will use this way of working with the
Unicode API, either because they want to support Python 2 and 3,
or because the effort it takes to port to the new APIs is
too high. The PEP makes some statements that this is slower,
but doesn't quantify those statements.

Memory savings
--

The table only lists string sizes up 8 code points. The memory
savings for these are really only significant for ASCII
strings on 64-bit platforms, if you use the default UCS2
Python build as basis.

For larger strings, I expect the savings to be more significant.
OTOH, a single non-BMP code point in such a string would cause
the savings to drop significantly again.

Complexity
--

In order to benefit from the new API, any code that has to
deal with low-level Py_UNICODE access to the Unicode objects
will have to be adapted.

For best performance, each algorithm will have to be implemented
for all three storage types.

Not doing so, will result in a slow-down, if I read the PEP
correctly. It's difficult to say, of what scale, since that
in

Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-27 Thread Guido van Rossum

Given the feedback so far, I am happy to pronounce PEP 393 as
accepted. Martin, congratulations! Go ahead and mark ity as Accepted.
(But please do fix up the small nits that Victor reported in his
earlier message.)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-27 Thread Victor Stinner

Le mardi 27 septembre 2011 00:19:02, Victor Stinner a écrit :
> On Windows, there is just one failure in test_configparser, I
> didn't investigate it yet

Oh, it was a real bug in io.IncrementalNewlineDecoder. It is now fixed.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-26 Thread Martin v. Löwis

>> "GDB Debugging Hooks" It's not done yet.
> I can do these if need be, but IIRC you (Victor) said on #python-dev
> that you were already working on them.

I already changed it for an earlier version of the PEP. It still needs
to sort out the various compact representations. I could do them as
well, so don't worry.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-26 Thread David Malcolm

On Tue, 2011-09-27 at 00:19 +0200, Victor Stinner wrote:
> Hi,
> 
> Le lundi 26 septembre 2011 23:00:06, Guido van Rossum a écrit :
> > So, if you have the time, please review PEP 393 and/or play with the
> > code (the repo is linked from the PEP's References section now).

> 
> PEP
> ===

> "GDB Debugging Hooks" It's not done yet.
I can do these if need be, but IIRC you (Victor) said on #python-dev
that you were already working on them.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-26 Thread Victor Stinner

Hi,

Le lundi 26 septembre 2011 23:00:06, Guido van Rossum a écrit :
> So, if you have the time, please review PEP 393 and/or play with the
> code (the repo is linked from the PEP's References section now).

I played with the code. The full test suite pass on Linux, FreeBSD and 
Windows. On Windows, there is just one failure in test_configparser, I didn't 
investigate it yet. I like the new API: a classic loop on the string length, 
and a macro to read the nth character. The backward compatibility is fully 
transparent and is already well tested because some modules still use the 
legacy API.

It's quite easy to move from the legacy API to the new API. It's just boring, 
but it's almost done in the core (unicodeobject.c, but also some modules like 
_io).

Since the introduction of PyASCIIObject, the PEP 393 is really good in memory 
footprint, especially for ASCII-only strings. In Python, you manipulate a lot 
of ASCII strings.


PEP
===

It's not clear what is deprecated. It would help to have a full list of the 
deprecated functions/macros.

Sometimes Martin wrote PyUnicode_Ready, sometimes PyUnicode_READY. It's 
confusing.

Typo: PyUnicode_FAST_READY => PyUnicode_READY.

"PyUnicode_WRITE_CHAR" is not listed in the New API section.

Typo in "PyUnicode_CONVERT_BYTES(from_type, tp_type, begin, end, to)": tp_type 
=> to_type.

"PyUnicode_Chr(ch)": Why introducing a new function? PyUnicode_FromOrdinal was 
not enough?

"GDB Debugging Hooks" It's not done yet.

"None of the functions in this PEP become part of the stable ABI (PEP 384)." 
Why? Some functions don't depend on the internal representation, like 
PyUnicode_Substring or PyUnicode_FindChar.

Typo: "In order to port modules to the new API, try to eliminate the use of 
these API elements: ... PyUnicode_GET_LENGTH ..." PyUnicode_GET_LENGTH is part 
of the new API. I suppose that you mean PyUnicode_GET_SIZE.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] PEP 393 close to pronouncement

2011-09-26 Thread Guido van Rossum

Martin has asked me to pronounce on PEP 393, after he's updated it in
response to various feedback (including mine :-). I'm currently
looking very favorable on it, but I thought I'd give folks here one
more chance to bring up showstoppers.

So, if you have the time, please review PEP 393 and/or play with the
code (the repo is linked from the PEP's References section now).

Please limit your feedback to show-stopping issues; we're past the
stage of bikeshedding here. It's Good Enough (TM) and we'll have to
rest of the 3.3 release cycle to improve incrementally. But we need to
get to the point where the code can be committed to the 3.3 branch.

In a few days I'll pronounce.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 close to pronouncement

Re: [Python-Dev] PEP 393 close to pronouncement

Re: [Python-Dev] PEP 393 close to pronouncement

Re: [Python-Dev] PEP 393 close to pronouncement

Re: [Python-Dev] PEP 393 close to pronouncement

Re: [Python-Dev] PEP 393 close to pronouncement

Re: [Python-Dev] PEP 393 close to pronouncement

Re: [Python-Dev] PEP 393 close to pronouncement

Re: [Python-Dev] PEP 393 close to pronouncement

Re: [Python-Dev] PEP 393 close to pronouncement

[Python-Dev] PEP 393 close to pronouncement

11 matches

Site Navigation

Mail list logo

Footer information