Re: [Python-Dev] range objects in 3.x

2011-09-28 Thread Fernando Perez
On Thu, 29 Sep 2011 11:36:21 +1300, Greg Ewing wrote:


>> I do hope, though, that the chosen name is *not*:
>> 
>> - 'interval'
>> 
>> - 'interpolate' or similar
> 
> Would 'subdivide' be acceptable?

I'm not great at finding names, and I don't totally love it, but I 
certainly don't see any problems with it.  It is, after all, a subdivision 
of an interval :)

I think 'grid' has been mentioned, and I think it's reasonable, even 
though most people probably associate the word with a two-dimensional 
object.  But grids can have any desired dimensionality.

Now, in fact, numpy has a slightly demented (but extremely useful) ogrid 
object:

In [7]: ogrid[0:10:3]
Out[7]: array([0, 3, 6, 9])

In [8]: ogrid[0:10:3j]
Out[8]: array([  0.,   5.,  10.])

Yup, that's a complex slice :)

So if python named the builtin 'grid', I think it would go well with 
existing numpy habits.

Cheers,

f

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] cpython: Enhance Py_ARRAY_LENGTH(): fail at build time if the argument is not an array

2011-09-28 Thread Victor Stinner
Le jeudi 29 septembre 2011 02:07:02, Benjamin Peterson a écrit :
> 2011/9/28 victor.stinner :
> > http://hg.python.org/cpython/rev/36fc514de7f0
> > changeset:   72512:36fc514de7f0
> > user:Victor Stinner 
> > date:Thu Sep 29 01:12:24 2011 +0200
> > summary:
> >  Enhance Py_ARRAY_LENGTH(): fail at build time if the argument is not an
> > array
> > 
> > Move other various macros to pymcacro.h
> > 
> > Thanks Rusty Russell for having written these amazing C macros!
> > 
> > files:
> >  Include/Python.h  |  19 +
> >  Include/pymacro.h |  57 +++
> 
> Do we really need a new file? Why not pyport.h where other compiler stuff
> goes?

I'm not sure that pyport.h is the right place to add Py_MIN, Py_MAX, 
Py_ARRAY_LENGTH. pyport.h looks to be related to all things specific to the 
platform like INT_MAX, Py_VA_COPY, ... pymacro.h contains platform independant 
macros.

I would like to suggest the opposite: move platform independdant macros from 
pyport.h to pymacro.h :-) Suggestions:
 - Py_ARITHMETIC_RIGHT_SHIFT
 - Py_FORCE_EXPANSION
 - Py_SAFE_DOWNCAST

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-28 Thread Victor Stinner
> Resizing
> 
> 
> Codecs use resizing a lot. Given that PyCompactUnicodeObject
> does not support resizing, most decoders will have to use
> PyUnicodeObject and thus not benefit from the memory footprint
> advantages of e.g. PyASCIIObject.

Wrong. Even if you create a string using the legacy API (e.g. 
PyUnicode_FromUnicode), the string will be quickly compacted to use the most 
efficient memory storage (depending on the maximum character). "quickly": at 
the 
first call to PyUnicode_READY. Python tries to make all strings ready as early 
as possible.

> PyASCIIObject has a wchar_t *wstr pointer - I guess this should
> be a char *str pointer, otherwise, where's the memory footprint
> advantage (esp. on Linux where sizeof(wchar_t) == 4) ?

For pure ASCII strings, you don't have to store a pointer to the UTF-8 string, 
nor the length of the UTF-8 string (in bytes), nor the length of the wchar_t 
string (in wide characters): the length is always the length of the "ASCII" 
string, and the UTF-8 string is shared with the ASCII string. The structure is 
much smaller thanks to these optimizations, and so Python 3.3 uses less memory 
than 2.7 for ASCII strings, even for short strings.

> I also don't see a reason to limit the UCS1 storage version
> to ASCII. Accordingly, the object should be called PyLatin1Object
> or PyUCS1Object.

Latin1 is less interesting, you cannot share length/data fields with utf8 or 
wstr. We didn't add a special case for Latin1 strings (except using Py_UCS1* 
strings to store their characters).

> Furthermore, determining len(obj) will require a loop over
> the data, checking for surrogate code points. A simple memcpy()
> is no longer enough.

Wrong. len(obj) gives the "right" result (see the long discussion about what 
is the length of a string in a previous thread...) in O(1) since it's computed 
when the string is created.

> ... in practice you only
> very rarely see any non-BMP code points in your data. Making
> all Python users pay for the needs of a tiny fraction is
> not really fair. Remember: practicality beats purity.

The creation of the string is maybe is little bit slower (especially when you 
have to scan the string twice to first get the maximum character), but I think 
that this slow down is smaller than the speedup allowed by the PEP.

Because ASCII strings are now char*, I think that processing ASCII strings is 
faster because the CPU can cache more data (close to the CPU).

We can do better optimization on ASCII and Latin1 strings (it's faster to 
manipulate char* than uint16_t* or uint32_t*). For example, str.center(), 
str.ljust, str.rjust and str.zfill do now use the very fast memset() function 
for latin1 strings to pad the string.

Another example, duplicating a string (or create a substring) should be faster 
just because you have less data to copy (e.g. 10 bytes for a string of 10 
Latin1 characters vs 20 or 40 bytes with Python 3.2).

The two most common encodings in the world are ASCII and UTF-8. With the PEP 
393, encoding to ASCII or UTF-8 is free, you don't have to encode anything, 
you have directly the encoded char* buffer (whereas you have to convert 16/32 
bit wchar_t to char* in Python 3.2, even for pure ASCII). (It's also free to 
encode "Latin1" Unicode string to Latin1.)

With the PEP 393, we never have to decode UTF-16 anymore when iterating on 
code pointer to support correctly non-BMP characters (which was required 
before in narrow build, e.g. on Windows). Iterate on code point is just a 
dummy loop, no need to check if each character is in range U+D800-U+DFFF.

There are other funny tricks (optimizations). For example, text.replace(a, b) 
knows that there is nothing to do if maxchar(a) > maxchar(text), where 
maxchar(obj) just requires to read an attribute of the string. Think about 
ASCII and non-ASCII strings: pure_ascii.replace('\xe9', '') now just creates a 
new reference...

I don't think that Martin wrote his PEP to be able to implement all these 
optimisations, but there are an interesting side effect of his PEP :-)

> The table only lists string sizes up 8 code points. The memory
> savings for these are really only significant for ASCII
> strings on 64-bit platforms, if you use the default UCS2
> Python build as basis.

In the 32 different cases, the PEP 393 is better in 29 cases and "just" as good 
as Python 3.2 in 3 corner cases:

- 1 ASCII, 16-bit wchar, 32-bit
- 1 Latin1, 32-bit wchar, 32-bit
- 2 Latin1, 32-bit wchar, 32-bit

Do you really care of these corner cases? See the more the realistic benchmark 
in previous Martin's email ("PEP 393 memory savings update"): the PEP 393 not 
only uses 3x less memory than 3.2, but it uses also *less* memory than Python 
2.7, whereas Python 3 uses Unicode for everything!

> For larger strings, I expect the savings to be more significant.

Sure.

> OTOH, a single non-BMP code point in such a string would cause
> the savings to drop significantly again.

In this case, it's just as good a

Re: [Python-Dev] [Python-checkins] cpython: Enhance Py_ARRAY_LENGTH(): fail at build time if the argument is not an array

2011-09-28 Thread Benjamin Peterson
2011/9/28 victor.stinner :
> http://hg.python.org/cpython/rev/36fc514de7f0
> changeset:   72512:36fc514de7f0
> user:        Victor Stinner 
> date:        Thu Sep 29 01:12:24 2011 +0200
> summary:
>  Enhance Py_ARRAY_LENGTH(): fail at build time if the argument is not an array
>
> Move other various macros to pymcacro.h
>
> Thanks Rusty Russell for having written these amazing C macros!
>
> files:
>  Include/Python.h          |  19 +
>  Include/pymacro.h         |  57 +++

Do we really need a new file? Why not pyport.h where other compiler stuff goes?


-- 
Regards,
Benjamin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] cpython: Implement PEP 393.

2011-09-28 Thread Eric V. Smith
Is there some reason str.format had such major surgery done to it? It
appears parts of it were removed from stringlib. I had not even thought
to look at the code before it was merged, as it never occurred to me
anyone would do that.

I left it in stringlib even in 3.x because there's the occasional talk
of adding bytes.bformat, and since all of the code works well with
stringlib (since it was used by str and unicode in 2.x), it made sense
to leave it there.

In addition, there are outstanding patches that are now broken.

I'd prefer it return to how it used to be, and just the minimum changes
required for PEP 393 be made to it.

Thanks.
Eric.

On 9/28/2011 2:35 AM, martin.v.loewis wrote:
> http://hg.python.org/cpython/rev/8beaa9a37387
> changeset:   72475:8beaa9a37387
> user:Martin v. Löwis 
> date:Wed Sep 28 07:41:54 2011 +0200
> summary:
>   Implement PEP 393.
> 
> files:
>   Doc/c-api/unicode.rst  | 9 +
>   Include/Python.h   | 5 +
>   Include/complexobject.h| 5 +-
>   Include/floatobject.h  | 5 +-
>   Include/longobject.h   | 6 +-
>   Include/pyerrors.h | 6 +
>   Include/pyport.h   | 3 +
>   Include/unicodeobject.h|   783 +-
>   Lib/json/decoder.py| 3 +-
>   Lib/test/json_tests/test_scanstring.py |11 +-
>   Lib/test/test_codeccallbacks.py| 7 +-
>   Lib/test/test_codecs.py| 4 +
>   Lib/test/test_peepholer.py | 4 -
>   Lib/test/test_re.py| 7 +
>   Lib/test/test_sys.py   |38 +-
>   Lib/test/test_unicode.py   |41 +-
>   Makefile.pre.in| 6 +-
>   Misc/NEWS  | 2 +
>   Modules/_codecsmodule.c| 8 +-
>   Modules/_csv.c | 2 +-
>   Modules/_ctypes/_ctypes.c  | 6 +-
>   Modules/_ctypes/callproc.c | 8 -
>   Modules/_ctypes/cfield.c   |64 +-
>   Modules/_cursesmodule.c| 7 +-
>   Modules/_datetimemodule.c  |13 +-
>   Modules/_dbmmodule.c   |12 +-
>   Modules/_elementtree.c |31 +-
>   Modules/_io/_iomodule.h| 2 +-
>   Modules/_io/stringio.c |69 +-
>   Modules/_io/textio.c   |   352 +-
>   Modules/_json.c|   252 +-
>   Modules/_pickle.c  | 4 +-
>   Modules/_sqlite/connection.c   |19 +-
>   Modules/_sre.c |   382 +-
>   Modules/_testcapimodule.c  | 2 +-
>   Modules/_tkinter.c |70 +-
>   Modules/arraymodule.c  | 8 +-
>   Modules/md5module.c|10 +-
>   Modules/operator.c |27 +-
>   Modules/pyexpat.c  |11 +-
>   Modules/sha1module.c   |10 +-
>   Modules/sha256module.c |10 +-
>   Modules/sha512module.c |10 +-
>   Modules/sre.h  | 4 +-
>   Modules/syslogmodule.c |14 +-
>   Modules/unicodedata.c  |28 +-
>   Modules/zipimport.c|   141 +-
>   Objects/abstract.c | 4 +-
>   Objects/bytearrayobject.c  |   147 +-
>   Objects/bytesobject.c  |   127 +-
>   Objects/codeobject.c   |15 +-
>   Objects/complexobject.c|19 +-
>   Objects/dictobject.c   |20 +-
>   Objects/exceptions.c   |26 +-
>   Objects/fileobject.c   |17 +-
>   Objects/floatobject.c  |19 +-
>   Objects/longobject.c   |84 +-
>   Objects/moduleobject.c | 9 +-
>   Objects/object.c   |10 +-
>   Objects/setobject.c|40 +-
>   Objects/stringlib/count.h  | 9 +-
>   Objects/stringlib/eq.h |23 +-
>   Objects/stringlib/fastsearch.h | 4 +-
>   Objects/stringlib/find.h   |31 +-
>   Objects/stringlib/formatter.h  |  1516 --
>   Objects/stringlib/localeutil.h |27 +-
>   Objects/stringlib/partition.h  |12 +-
>   Objects/stringlib/split.h  |26 +-
>   Objects/stringlib/string_format.h  |  1385 --
>   Objects/stringlib/stringdefs.h | 2 +
>   Objects/stringlib/ucs1lib.h|35 +
>   Objects/stringlib/ucs2lib.h|34 +
>   Objects/stringlib/ucs4lib.h|34 +
>   Objects/stringlib/undef.h  |10 +
>   Objects/stringlib/unicode_format.h |  1416 ++
>   Objects/stringlib/unicodedefs.h| 2 +
>   Obj

Re: [Python-Dev] range objects in 3.x

2011-09-28 Thread Greg Ewing

Fernando Perez wrote:

Now, I *suspect* (but don't remember for sure) that the option to have it 
right-hand-open-ended was to match the mental model people have for range:


In [5]: linspace(0, 10, 10, endpoint=False)
Out[5]: array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])

In [6]: range(0, 10)
Out[6]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


My guess would be it's so that you can concatenate two sequences
created with linspace covering adjacent ranges and get the same
result as a single linspace call covering the whole range.


I do hope, though, that the chosen name is *not*:

- 'interval'

- 'interpolate' or similar


Would 'subdivide' be acceptable?

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Heads up: Apple llvm gcc 4.2 miscompiles PEP 393

2011-09-28 Thread Ned Deily
In article <74f6adfa-874d-4bac-b304-ce8b12d80...@masklinn.net>,
 Xavier Morel  wrote:

> On 2011-09-28, at 19:49 , Martin v. Löwis wrote:
> > 
> > Thanks for the advise - I didn't expect that Apple ships thhree compilersŠ
> Yeah I can understand that, they're in the middle of the transition but Clang 
> is not quite there yet so...

BTW, at the moment, we are still using gcc-4.2 (not gcc-llvm nor clang) 
from Xcode 3 on OS X 10.6 for the 64-bit/32-bit installer builds and 
gcc-4.0 on 10.5 for the 32-bit-only installer builds.  We will probably 
revisit that as we get closer to 3.3 alphas and betas.

-- 
 Ned Deily,
 n...@acm.org

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] range objects in 3.x

2011-09-28 Thread Fernando Perez
On Tue, 27 Sep 2011 11:25:48 +1000, Steven D'Aprano wrote:

> The audience for numpy is a small minority of Python users, and they

Certainly, though I'd like to mention that scientific computing is a major 
success story for Python, so hopefully it's a minority with something to 
contribute 

> tend to be more sophisticated. I'm sure they can cope with two functions
> with different APIs 

No problem with having different APIs, but in that case I'd hope the 
builtin wouldnt' be named linspace, to avoid confusion.  In numpy/scipy we 
try hard to avoid collisions with existing builtin names, hopefully in 
this case we can prevent the reverse by having a dialogue.

> While continuity of API might be a good thing, we shouldn't accept a
> poor API just for the sake of continuity. I have some criticisms of the
> linspace API.
> 
> numpy.linspace(start, stop, num=50, endpoint=True, retstep=False)
> 
> http://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html
> 
> * It returns a sequence, which is appropriate for numpy but in standard
> Python it should return an iterator or something like a range object.

Sure, no problem there.

> * Why does num have a default of 50? That seems to be an arbitrary
> choice.

Yup.  linspace was modeled after matlab's identically named command:

http://www.mathworks.com/help/techdoc/ref/linspace.html

but I have no idea why the author went with 50 instead of 100 as the 
default (not that 100 is any better, just that it was matlab's choice).  
Given how linspace is often used for plotting, 100 is arguably a more 
sensible choice to get reasonable graphs on normal-resolution displays at 
typical sizes, absent adaptive plotting algorithms.

> * It arbitrarily singles out the end point for special treatment. When
> integrating, it is just as common for the first point to be singular as
> the end point, and therefore needing to be excluded.

Numerical integration is *not* the focus of linspace(): in numerical 
integration, if an end point is singular you have an improper integral and 
*must* approach the singularity much more carefully than by simply 
dropping the last point and hoping for the best.  Whether you can get away 
by using (desired_end_point - very_small_number) --the dumb, naive 
approach-- or not depends a lot on the nature of the singularity.

Since numerical integration is a complex and specialized domain and the 
subject of an entire subcomponent of the (much bigger than numpy) scipy 
library, there's no point in arguing the linspace API based on numerical 
integration considerations.

Now, I *suspect* (but don't remember for sure) that the option to have it 
right-hand-open-ended was to match the mental model people have for range:

In [5]: linspace(0, 10, 10, endpoint=False)
Out[5]: array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])

In [6]: range(0, 10)
Out[6]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


I'm not arguing this was necessarily a good idea, just my theory on how it 
came to be.  Perhaps R. Kern or one of the numpy lurkers in here will 
pitch in with a better recollection.

> * If you exclude the end point, the stepsize, and hence the values
> returned, change:
> 
>  >>> linspace(1, 2, 4)
> array([ 1.,  1.,  1.6667,  2.])
>  >>> linspace(1, 2, 4, endpoint=False)
> array([ 1.  ,  1.25,  1.5 ,  1.75])
> 
> This surprises me. I expect that excluding the end point will just
> exclude the end point, i.e. return one fewer point. That is, I expect
> num to count the number of subdivisions, not the number of points.

I find it very natural.  It's important to remember that *the whole point* 
of linspace's existence is to provide arrays with a known, fixed number of 
points:

In [17]: npts = 10

In [18]: len(linspace(0, 5, npts))
Out[18]: 10

In [19]: len(linspace(0, 5, npts, endpoint=False))
Out[19]: 10

So the invariant to preserve is *precisely* the number of points, not the 
step size.  As Guido has pointed out several times, the value of this 
function is precisely to steer people *away* from thinking of step sizes 
in a context where they are more likely than not going to get it wrong.  
So linspace focuses on a guaranteed number of points, and lets the step 
size chips fall where they may.


> * The retstep argument changes the return signature from => array to =>
> (array, number). I think that's a pretty ugly thing to do. If linspace
> returned a special iterator object, the step size could be exposed as an
> attribute.

Yup, it's not pretty but understandable in numpy's context, a library that 
has a very strong design focus around arrays, and numpy arrays don't have 
writable attributes:

In [20]: a = linspace(0, 10)

In [21]: a.stepsize = 0.1
---
AttributeErrorTraceback (most recent call last)
/home/fperez/ in ()
> 1 a.stepsize = 0.1

AttributeError: 'numpy.ndarray' object has no attribute 'stepsize'


So while 

[Python-Dev] What it takes to change a single keyword.

2011-09-28 Thread Yaşar Arabacı
Hi,

First of all, I am sincerely sorry if this is wrong mailing list to ask this
question. I checked out definitions of couple other mailing list, and this
one seemed most suitable. Here is my question:

Let's say I want to change a single keyword, let's say import keyword, to be
spelled as something else, like it's translation to my language. I guess it
would be more complicated than modifiying Grammar/Grammar, but I can't be
sure which files should get edited.

I'am asking this, because, I am trying to figure out if I could translate
keyword's into another language, without affecting behaviour of language.


-- 
http://yasar.serveblog.net/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Heads up: Apple llvm gcc 4.2 miscompiles PEP 393

2011-09-28 Thread Xavier Morel
On 2011-09-28, at 19:49 , Martin v. Löwis wrote:
> 
> Thanks for the advise - I didn't expect that Apple ships thhree compilers…
Yeah I can understand that, they're in the middle of the transition but Clang 
is not quite there yet so...
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Heads up: Apple llvm gcc 4.2 miscompiles PEP 393

2011-09-28 Thread Martin v. Löwis
> Does Clang also fail to compile this? Clang was updated from 1.6 to 2.0 with 
> Xcode 4, worth a try.

clang indeed works fine.

> Also, from your version listing it seems to be llvm-gcc (gcc frontend with 
> llvm backend I think), 
> is there no more straight gcc (with gcc frontend and backend)?

/usr/bin/cc and /usr/bin/gcc both link to llvm-gcc-4.2. However, there
still is /usr/bin/gcc-4.2. Using that, Python also compiles correctly -
so I have changed the gcc link on my system.

Thanks for the advise - I didn't expect that Apple ships thhree compilers...

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-28 Thread Martin v. Löwis
> Codecs use resizing a lot. Given that PyCompactUnicodeObject
> does not support resizing, most decoders will have to use
> PyUnicodeObject and thus not benefit from the memory footprint
> advantages of e.g. PyASCIIObject.

No, codecs have been rewritten to not use resizing.

> PyASCIIObject has a wchar_t *wstr pointer - I guess this should
> be a char *str pointer, otherwise, where's the memory footprint
> advantage (esp. on Linux where sizeof(wchar_t) == 4) ?

That's the Py_UNICODE representation for backwards compatibility.
It's normally NULL.

> I also don't see a reason to limit the UCS1 storage version
> to ASCII. Accordingly, the object should be called PyLatin1Object
> or PyUCS1Object.

No, in the ASCII case, the UTF-8 length can be shared with the regular
string length - not so for Latin-1 character above 127.

> Typedef'ing Py_UNICODE to wchar_t and using wchar_t in existing
> code will cause problems on some systems where whcar_t is a
> signed type.
> 
> Python assumes that Py_UNICODE is unsigned and thus doesn't
> check for negative values or takes these into account when
> doing range checks or code point arithmetic.
> 
> On such platform where wchar_t is signed, it is safer to
> typedef Py_UNICODE to unsigned wchar_t.

No. Py_UNICODE values *must* be in the range 0..17*2**16.
Values larger than 17*2**16 are just as bad as negative
values, so having Py_UNICODE unsigned doesn't improve
anything.

> Py_UNICODE access to the objects assumes that len(obj) ==
> length of the Py_UNICODE buffer. The PEP suggests that length
> should not take surrogates into account on UCS2 platforms
> such as Windows. The causes len(obj) to not match len(wstr).

Correct.

> As a result, Py_UNICODE access to the Unicode objects breaks
> when surrogate code points are present in the Unicode object
> on UCS2 platforms.

Incorrect. What specifically do you think would break?

> The PEP also does not explain how lone surrogates will be
> handled with respect to the length information.

Just as any other code point. Python does not special-case
surrogate code points anymore.

> Furthermore, determining len(obj) will require a loop over
> the data, checking for surrogate code points. A simple memcpy()
> is no longer enough.

No, it won't. The length of the Unicode object is stored in
the length field.

> I suggest to drop the idea of having len(obj) not count
> wstr surrogate code points to maintain backwards compatibility
> and allow for working with lone surrogates.

Backwards-compatibility is fully preserved by PyUnicode_GET_SIZE
returning the size of the Py_UNICODE buffer. PyUnicode_GET_LENGTH
returns the true length of the Unicode object.

> Note that the whole surrogate debate does not have much to
> do with this PEP, since it's mainly about memory footprint
> savings. I'd also urge to do a reality check with respect
> to surrogates and non-BMP code points: in practice you only
> very rarely see any non-BMP code points in your data. Making
> all Python users pay for the needs of a tiny fraction is
> not really fair. Remember: practicality beats purity.

That's the whole point of the PEP. You only pay for what
you actually need, and in most cases, it's ASCII.

> For best performance, each algorithm will have to be implemented
> for all three storage types.

This will be a trade-off. I think most developers will be happy
with a single version covering all three cases, especially as it's
much more maintainable.

Kind regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-28 Thread Benjamin Peterson
2011/9/28 M.-A. Lemburg :
> Guido van Rossum wrote:
>> Given the feedback so far, I am happy to pronounce PEP 393 as
>> accepted. Martin, congratulations! Go ahead and mark ity as Accepted.
>> (But please do fix up the small nits that Victor reported in his
>> earlier message.)
>
> I've been working on feedback for the last few days, but I guess it's
> too late. Here goes anyway...
>
> I've only read the PEP and not followed the discussion due to lack of
> time, so if any of this is no longer valid, that's probably because
> the PEP wasn't updated :-)
>
> Resizing
> 
>
> Codecs use resizing a lot. Given that PyCompactUnicodeObject
> does not support resizing, most decoders will have to use
> PyUnicodeObject and thus not benefit from the memory footprint
> advantages of e.g. PyASCIIObject.
>
>
> Data structure
> --
>
> The data structure description in the PEP appears to be wrong:
>
> PyASCIIObject has a wchar_t *wstr pointer - I guess this should
> be a char *str pointer, otherwise, where's the memory footprint
> advantage (esp. on Linux where sizeof(wchar_t) == 4) ?
>
> I also don't see a reason to limit the UCS1 storage version
> to ASCII. Accordingly, the object should be called PyLatin1Object
> or PyUCS1Object.

I think the purpose is that if it's only ASCII, no work is need to
encode to UTF-8.


-- 
Regards,
Benjamin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-28 Thread M.-A. Lemburg
Guido van Rossum wrote:
> Given the feedback so far, I am happy to pronounce PEP 393 as
> accepted. Martin, congratulations! Go ahead and mark ity as Accepted.
> (But please do fix up the small nits that Victor reported in his
> earlier message.)

I've been working on feedback for the last few days, but I guess it's
too late. Here goes anyway...

I've only read the PEP and not followed the discussion due to lack of
time, so if any of this is no longer valid, that's probably because
the PEP wasn't updated :-)

Resizing


Codecs use resizing a lot. Given that PyCompactUnicodeObject
does not support resizing, most decoders will have to use
PyUnicodeObject and thus not benefit from the memory footprint
advantages of e.g. PyASCIIObject.


Data structure
--

The data structure description in the PEP appears to be wrong:

PyASCIIObject has a wchar_t *wstr pointer - I guess this should
be a char *str pointer, otherwise, where's the memory footprint
advantage (esp. on Linux where sizeof(wchar_t) == 4) ?

I also don't see a reason to limit the UCS1 storage version
to ASCII. Accordingly, the object should be called PyLatin1Object
or PyUCS1Object.

Here's the version from the PEP:

"""
typedef struct {
  PyObject_HEAD
  Py_ssize_t length;
  Py_hash_t hash;
  struct {
  unsigned int interned:2;
  unsigned int kind:2;
  unsigned int compact:1;
  unsigned int ascii:1;
  unsigned int ready:1;
  } state;
  wchar_t *wstr;
} PyASCIIObject;

typedef struct {
  PyASCIIObject _base;
  Py_ssize_t utf8_length;
  char *utf8;
  Py_ssize_t wstr_length;
} PyCompactUnicodeObject;
"""

Typedef'ing Py_UNICODE to wchar_t and using wchar_t in existing
code will cause problems on some systems where whcar_t is a
signed type.

Python assumes that Py_UNICODE is unsigned and thus doesn't
check for negative values or takes these into account when
doing range checks or code point arithmetic.

On such platform where wchar_t is signed, it is safer to
typedef Py_UNICODE to unsigned wchar_t.

Accordingly and to prevent further breakage, Py_UNICODE
should not be deprecated and used instead of wchar_t
throughout the code.


Length information
--

Py_UNICODE access to the objects assumes that len(obj) ==
length of the Py_UNICODE buffer. The PEP suggests that length
should not take surrogates into account on UCS2 platforms
such as Windows. The causes len(obj) to not match len(wstr).

As a result, Py_UNICODE access to the Unicode objects breaks
when surrogate code points are present in the Unicode object
on UCS2 platforms.

The PEP also does not explain how lone surrogates will be
handled with respect to the length information.

Furthermore, determining len(obj) will require a loop over
the data, checking for surrogate code points. A simple memcpy()
is no longer enough.

I suggest to drop the idea of having len(obj) not count
wstr surrogate code points to maintain backwards compatibility
and allow for working with lone surrogates.

Note that the whole surrogate debate does not have much to
do with this PEP, since it's mainly about memory footprint
savings. I'd also urge to do a reality check with respect
to surrogates and non-BMP code points: in practice you only
very rarely see any non-BMP code points in your data. Making
all Python users pay for the needs of a tiny fraction is
not really fair. Remember: practicality beats purity.


API
---

Victor already described the needed changes.


Performance
---

The PEP only lists a few low-level benchmarks as basis for the
performance decrease. I'm missing some more adequate real-life
tests, e.g. using an application framework such as Django
(to the extent this is possible with Python3) or a server
like the Radicale calendar server (which is available for Python3).

I'd also like to see a performance comparison which specifically
uses the existing Unicode APIs to create and work with Unicode
objects. Most extensions will use this way of working with the
Unicode API, either because they want to support Python 2 and 3,
or because the effort it takes to port to the new APIs is
too high. The PEP makes some statements that this is slower,
but doesn't quantify those statements.


Memory savings
--

The table only lists string sizes up 8 code points. The memory
savings for these are really only significant for ASCII
strings on 64-bit platforms, if you use the default UCS2
Python build as basis.

For larger strings, I expect the savings to be more significant.
OTOH, a single non-BMP code point in such a string would cause
the savings to drop significantly again.


Complexity
--

In order to benefit from the new API, any code that has to
deal with low-level Py_UNICODE access to the Unicode objects
will have to be adapted.

For best performance, each algorithm will have to be implemented
for all three storage types.

Not doing so, will result in a slow-down, if I read the PEP
correctly. It's difficult to say, of what scale, since that
in

Re: [Python-Dev] PEP 393 merged

2011-09-28 Thread Guido van Rossum
Congrats! Python 3.3 will be better because of this.

On Wed, Sep 28, 2011 at 12:48 AM, "Martin v. Löwis"  wrote:
> I have now merged the PEP 393 implementation into default.
> The main missing piece is the documentation; contributions are
> welcome.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] unittest missing assertNotRaises

2011-09-28 Thread Laurens Van Houtven
Oops, I accidentally hit Reply instead of Reply to All...

On Wed, Sep 28, 2011 at 1:05 PM, Michael Foord wrote:

>  On 27/09/2011 19:59, Laurens Van Houtven wrote:
>
> Sure, you just *do* it. The only advantage I see in assertNotRaises is that
> when that exception is raised, you should (and would) get a failure, not an
> error.
>
> There are some who don't see the distinction between a failure and an error
> as a useful distinction... I'm becoming more sympathetic to that view.
>

I agree. Maybe if there were less failures posing as errors and errors
posing as failures, I'd consider taking the distinction seriously.

The only use case I've personally encountered is with fuzzy tests. The
example that comes to mind is one where we had a fairly complex iterative
algorithm for learning things from huge amounts of test data and there were
certain criteria (goodness of result, time taken) that had to be satisfied.
In that case, "it blew up because someone messed up dependencies" and "it
took 3% longer than is allowable"  are pretty obviously different...
Considering how exotic that use case is, like I said, I'm not really
convinced how generally useful it is :) especially since this isn't even a
unit test...



> All the best,
>
> Michael
>

cheers
lvh
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Heads up: Apple llvm gcc 4.2 miscompiles PEP 393

2011-09-28 Thread Xavier Morel
On 2011-09-28, at 13:24 , mar...@v.loewis.de wrote:
> The gcc that Apple ships with the Lion SDK (not sure what Xcode version that 
> is)
Xcode 4.1

> I'm not aware of a work-around in the code. My work-around is to use gcc-4.0,
> which is still available on my system from an earlier Xcode installation
> (in /Developer-3.2.6)
Does Clang also fail to compile this? Clang was updated from 1.6 to 2.0 with 
Xcode 4, worth a try.

Also, from your version listing it seems to be llvm-gcc (gcc frontend with llvm 
backend I think), is there no more straight gcc (with gcc frontend and backend)?

FWIW, on 10.6 the default gcc is a straight 4.2

> gcc --version
i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5664)

There is an llvm-gcc 4.2 but it uses a slightly different revision of llvm

> llvm-gcc --version
   
i686-apple-darwin10-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 
5658) (LLVM build 2333.4)


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Heads up: Apple llvm gcc 4.2 miscompiles PEP 393

2011-09-28 Thread martin
The gcc that Apple ships with the Lion SDK (not sure what Xcode  
version that is)
miscompiles Python now. I've reported this to Apple as bug 10143715;  
not sure whether

there is a public link to this bug report.

In essence, the code

typedef struct {
long length;
long hash;
int state;
int *wstr;
} PyASCIIObject;

typedef struct {
PyASCIIObject _base;
long utf8_length;

char *utf8;
long wstr_length;

} PyCompactUnicodeObject;

void *_PyUnicode_compact_data(void *unicode) {
return PyASCIIObject*)unicode)->state & 0x20) ?
((void*)((PyASCIIObject*)(unicode) + 1)) :
((void*)((PyCompactUnicodeObject*)(unicode) + 1)));
}

miscompiles (with -O2 -fomit-frame-pointer) to


__PyUnicode_compact_data:
Leh_func_begin1:
leaq32(%rdi), %rax
ret

The compiler version is

gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)

This unconditionally assumes that sizeof(PyASCIIObject) needs to be
added to unicode, independent of whether the state bit is set or not.

I'm not aware of a work-around in the code. My work-around is to use gcc-4.0,
which is still available on my system from an earlier Xcode installation
(in /Developer-3.2.6)

Regards,
Martin


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] range objects in 3.x

2011-09-28 Thread Greg Ewing

Ethan Furman wrote:


Well, actually, I'd be using it with dates.  ;)


Seems to me that one size isn't going to fit all.

Maybe we really want two functions:

   interpolate(start, end, count)
   Requires a type supporting addition and division,
   designed to work predictably and accurately with
   floats

   extrapolate(start, step, end)
   Works for any type supporting addition, not
   recommended for floats

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] unittest missing assertNotRaises

2011-09-28 Thread Michael Foord

On 27/09/2011 19:59, Laurens Van Houtven wrote:
Sure, you just *do* it. The only advantage I see in assertNotRaises is 
that when that exception is raised, you should (and would) get a 
failure, not an error.
There are some who don't see the distinction between a failure and an 
error as a useful distinction... I'm becoming more sympathetic to that view.


All the best,

Michael




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk



--
http://www.voidspace.org.uk/

May you do good and not evil
May you find forgiveness for yourself and forgive others
May you share freely, never taking more than you give.
-- the sqlite blessing http://www.sqlite.org/different.html

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] unittest missing assertNotRaises

2011-09-28 Thread Michael Foord

On 27/09/2011 19:46, Wilfred Hughes wrote:

Hi folks

I wasn't sure if this warranted a bug in the tracker, so I thought I'd 
raise it here first.


unittest has assertIn, assertNotIn, assertEqual, assertNotEqual and so 
on. So, it seems odd to me that there isn't assertNotRaises. Is there 
any particular motivation for not putting it in?


I've attached a simple patch against Python 3's trunk to give an idea 
of what I have in mind.




As others have said, the opposite of assertRaises is just calling the code!

I have several times needed regression tests that call code that *used* 
to raise an exception. It can look slightly odd to have a test without 
an assert, but the singular uselessness of assertNotRaises does not make 
it a better alternative. I usually add a comment:


def test_something_that_used_to_not_work(self):
# this used to raise an exception
do_something()

All the best,

Michael Foord


Thanks
Wilfred


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk



--
http://www.voidspace.org.uk/

May you do good and not evil
May you find forgiveness for yourself and forgive others
May you share freely, never taking more than you give.
-- the sqlite blessing http://www.sqlite.org/different.html

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] unittest missing assertNotRaises

2011-09-28 Thread Oleg Broytman
On Wed, Sep 28, 2011 at 09:43:13AM +1000, Steven D'Aprano wrote:
> Oleg Broytman wrote:
> >On Tue, Sep 27, 2011 at 07:46:52PM +0100, Wilfred Hughes wrote:
> >>+def assertNotRaises(self, excClass, callableObj=None, *args, **kwargs):
> >>+"""Fail if an exception of class excClass is thrown by
> >>+callableObj when invoked with arguments args and keyword
> >>+arguments kwargs.
> >>++"""
> >>+try:
> >>+callableObj(*args, **kwargs)
> >>+except excClass:
> >>+raise self.failureException("%s was raised" % excClass)
> >>++

> But I can't see this being a useful test.

   Me too.

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/p...@phdru.name
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] unittest missing assertNotRaises

2011-09-28 Thread Wilfred Hughes
On 27 September 2011 19:59, Laurens Van Houtven <_...@lvh.cc> wrote:
> Sure, you just *do* it. The only advantage I see in assertNotRaises is that 
> when that exception is raised, you should (and would) get a failure, not an 
> error.

It's a useful distinction. I have found myself writing code of the form:

def test_old_exception_no_longer_raised(self):
try:
do_something():
except OldException:
self.assertTrue(False)

in order to distinguish between a regression and something new
erroring. The limitation of this pattern is that the test failure
message is not as good.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 393 merged

2011-09-28 Thread Martin v. Löwis
I have now merged the PEP 393 implementation into default.
The main missing piece is the documentation; contributions are
welcome.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] cpython: Implement PEP 393.

2011-09-28 Thread Martin v. Löwis
> Surely there must be more new APIs and changes that need documenting?

Correct. All documentation still needs to be written.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com