Re: [Python-Dev] Import and unicode: part two

2011-01-25 Thread Toshio Kuratomi
On Wed, Jan 26, 2011 at 11:24:54AM +0900, Stephen J. Turnbull wrote:
> Toshio Kuratomi writes:
> 
>  > On Linux there's no defined encoding that will work; file names are just
>  > bytes to the Linux kernel so based on people's argument that the convention
>  > is and should be that filenames are utf-8 and anything else is
>  > a misconfigured system -- python should mandate that its module filenames 
> on
>  > Linux are utf-8 rather than using the user's locale settings.
> 
> This isn't going to work where I live (Tsukuba).  At the national
> university alone there are hundreds of pre-existing *nix systems whose
> filesystems were often configured a decade or more ago.  Even if the
> hardware and OS have been upgraded, the filesystems are usually
> migrated as-is, with OS configuration tweaks to accomodate them.  Many
> of them use EUC-JP (and servers often Shift JIS).  That means that you
> won't be able to read module names with ls, and that will make Python
> unacceptable for this purpose.  I imagine that in Russia the same is
> true for the various Cyrillic encodings.
> 
Sure ... but with these systems, neither read-modules-as-locale or
read-modules-as-utf-8 are a good solution to work, correct?  Especially if
the OS does get upgraded but the filesystems with user data (and user
created modules) are migrated as-is, you'll run into situations where system
installed modules are in utf-8 and user created modules are shift-jis and so
something will always be broken.

The only way to make sure that modules work is to restrict them to ASCII-only
on the filesystem.  But because unicode module names are seen as
a necessary feature, the question is which way forward is going to lead to
the least brokenness.  Which could be locale... but from the python2
locale-related bugs that I get to look at, I doubt.

> I really don't think there is anything that can be done here except to
> warn people that "Kids, these stunts are performed by highly-trained
> professionals.  Don't try this at home!"  Of course they will anyway,
> but at least they will have been warned in sufficiently strong terms
> that they might pay attention and be able to recover when they run
> into bizarre import exceptions.
> 
So on the subject of warnings... I think a reason it's better to pick an
encoding for the platform/filesystem rather than to use locale is because
people will get an error or a warning at the appropriate time if that's the
case -- the first time they attempt to create and import a module with
a filename that's not encoded in the correct encoding for the platform.
It's all very well to say: "We wrote in the documentation on
http://docs.python.org/distutils/introduction.html#Choosing-a-name that only
ASCII names should be used when distributing python modules" but if the
interpreter doesn't complain when people use a non-ASCII filename we all
know that they aren't going to look in the documentation; they'll try it and
if it works they'll learn that habit.  

-Toshio


pgpjrrsvd3wof.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-25 Thread Stephen J. Turnbull
Toshio Kuratomi writes:

 > On Linux there's no defined encoding that will work; file names are just
 > bytes to the Linux kernel so based on people's argument that the convention
 > is and should be that filenames are utf-8 and anything else is
 > a misconfigured system -- python should mandate that its module filenames on
 > Linux are utf-8 rather than using the user's locale settings.

This isn't going to work where I live (Tsukuba).  At the national
university alone there are hundreds of pre-existing *nix systems whose
filesystems were often configured a decade or more ago.  Even if the
hardware and OS have been upgraded, the filesystems are usually
migrated as-is, with OS configuration tweaks to accomodate them.  Many
of them use EUC-JP (and servers often Shift JIS).  That means that you
won't be able to read module names with ls, and that will make Python
unacceptable for this purpose.  I imagine that in Russia the same is
true for the various Cyrillic encodings.

I really don't think there is anything that can be done here except to
warn people that "Kids, these stunts are performed by highly-trained
professionals.  Don't try this at home!"  Of course they will anyway,
but at least they will have been warned in sufficiently strong terms
that they might pay attention and be able to recover when they run
into bizarre import exceptions.

Oh, yeah, don't forget to apply Victor's patch, which allows Python to
keep the promises it can make about consistency.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] r88197 - python/branches/py3k/Lib/email/generator.py

2011-01-25 Thread Brett Cannon
This broke the buildbots (R. David Murray thinks you may have
forgotten to call super() in the 'payload is None' branch). Are you
getting code reviews and fully running the test suite before
committing? We are in RC.

On Tue, Jan 25, 2011 at 16:39, victor.stinner
 wrote:
> Author: victor.stinner
> Date: Wed Jan 26 01:39:19 2011
> New Revision: 88197
>
> Log:
> Fix BytesGenerator._handle_text() if the message has no payload (None)
>
> Modified:
>   python/branches/py3k/Lib/email/generator.py
>
> Modified: python/branches/py3k/Lib/email/generator.py
> ==
> --- python/branches/py3k/Lib/email/generator.py (original)
> +++ python/branches/py3k/Lib/email/generator.py Wed Jan 26 01:39:19 2011
> @@ -377,8 +377,11 @@
>     def _handle_text(self, msg):
>         # If the string has surrogates the original source was bytes, so
>         # just write it back out.
> -        if _has_surrogates(msg._payload):
> -            self.write(msg._payload)
> +        payload = msg.get_payload()
> +        if payload is None:
> +            return
> +        if _has_surrogates(payload):
> +            self.write(payload)
>         else:
>             super(BytesGenerator,self)._handle_text(msg)
>
> ___
> Python-checkins mailing list
> python-check...@python.org
> http://mail.python.org/mailman/listinfo/python-checkins
>
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-25 Thread Dj Gilcrease
On Tue, Jan 25, 2011 at 5:43 PM, M.-A. Lemburg  wrote:
> I also don't see how this could save a lot of memory. As an example
> take a French text with say 10mio code points. This would end up
> appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB),
> one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending
> on how many accents are used). That's a saving of -10MB compared to
> today's implementation :-)

If I am reading the pep right, which I may not be as I am no expert on
unicode, the new implementation would actually give a 10MB saving
since the wchar field is optional, so only the str (Latin-1) and utf8
fields would need to be stored. How it decides not to store one field
or another would need to be clarified in the pep is I am right.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] r88197 - python/branches/py3k/Lib/email/generator.py

2011-01-25 Thread Nick Coghlan
On Wed, Jan 26, 2011 at 10:39 AM, victor.stinner
 wrote:
> Author: victor.stinner
> Date: Wed Jan 26 01:39:19 2011
> New Revision: 88197
>
> Log:
> Fix BytesGenerator._handle_text() if the message has no payload (None)

Folks, for the peace of mind of python-checkins watchers, please
remember to mention the reviewer's name when checking in fixes during
the RC period (the last one I checked had been reviewed by Georg on
the issue tracker, but it's hard to check without even an issue number
to look up).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Location of tests for packages

2011-01-25 Thread Nick Coghlan
On Wed, Jan 26, 2011 at 4:16 AM, Alexander Belopolsky
 wrote:
> FWIW, I am +0 on consolidating tests under Lib/test.  One of the
> reasons that I have not seen mentioned is that it is well-known that
> test package is not part of the official stdlib API and can be
> changes/restructured in backward incompatible ways. It is not obvious
> whether the same applies to say lib2to3.tests or ctypes.test.

I am +0 for the same reason as Alexander. The test subpackages should
either be moved under the test package, or, for packages with PyPI
distributed backports for previous versions, they should be prefixed
with a leading underscore to make it clear that they're private
implementation details and backwards compatibility guarantees don't
apply.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-25 Thread Antoine Pitrou
On Tue, 25 Jan 2011 21:08:01 +1000
Nick Coghlan  wrote:
> 
> One change I would propose is that rather than hiding flags in the low
> order bits of the str pointer, we expand the use of the existing
> "state" field to cover the representation information in addition to
> the interning information.

+1, by the way. The "state" field has many bits available (even if we
decide to make it a char rather than an int).

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] r88178 - python/branches/py3k/Lib/test/crashers/underlying_dict.py

2011-01-25 Thread Martin v. Löwis
> Some comments would be nice. Right now it looks pretty close to
> deliberately obfuscated code (especially with the call to
> gc.get_referrers()).

That call tries to get at the class dictionary, rather then just
the dict_proxy that you get from A.__dict__. There should be
two referrers to thingy: the class dict, and the module dict.
The class dict will have a __module__ key.

I agree the program should print 2, though.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-25 Thread Antoine Pitrou

For the record:

> I also don't see how this could save a lot of memory. As an example
> take a French text with say 10mio code points. This would end up
> appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB),
> one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending
> on how many accents are used).

Typical French text seems to have 5% non-ASCII characters. So the
number of UTF-8 bytes needed to represent a French text would only be
5% higher than the number of code points.

Anyway, it's quite obvious that Martin's goal is that only one
representation gets created most of the time. To quote the draft:

“All three representations are optional, although the str form is
considered the canonical representation which can be absent only
while the string is being created.”

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-25 Thread M.-A. Lemburg
I'll comment more on this later this week...

>From my first impression, I'm
not too thrilled by the prospect of making the Unicode implementation
more complicated by having three different representations on each
object.

I also don't see how this could save a lot of memory. As an example
take a French text with say 10mio code points. This would end up
appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB),
one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending
on how many accents are used). That's a saving of -10MB compared to
today's implementation :-)

"Martin v. Löwis" wrote:
> I have been thinking about Unicode representation for some time now.
> This was triggered, on the one hand, by discussions with Glyph Lefkowitz
> (who complained that his server app consumes too much memory), and Carl
> Friedrich Bolz (who profiled Python applications to determine that
> Unicode strings are among the top consumers of memory in Python).
> On the other hand, this was triggered by the discussion on supporting
> surrogates in the library better.
> 
> I'd like to propose PEP 393, which takes a different approach,
> addressing both problems simultaneously: by getting a flexible
> representation (one that can be either 1, 2, or 4 bytes), we can
> support the full range of Unicode on all systems, but still use
> only one byte per character for strings that are pure ASCII (which
> will be the majority of strings for the majority of users).
> 
> You'll find the PEP at
> 
> http://www.python.org/dev/peps/pep-0393/
> 
> For convenience, I include it below.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 25 2011)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] r88178 - python/branches/py3k/Lib/test/crashers/underlying_dict.py

2011-01-25 Thread Antoine Pitrou
Le mardi 25 janvier 2011 à 20:11 +0200, Maciej Fijalkowski a écrit :
> On Tue, Jan 25, 2011 at 1:26 PM, Antoine Pitrou  wrote:
> > On Tue, 25 Jan 2011 01:00:28 +0100 (CET)
> > benjamin.peterson  wrote:
> >> Author: benjamin.peterson
> >> Date: Tue Jan 25 01:00:28 2011
> >> New Revision: 88178
> >>
> >> Log:
> >> another pretty crasher served up by pypy
> >
> > Some comments would be nice. Right now it looks pretty close to
> > deliberately obfuscated code (especially with the call to
> > gc.get_referrers()).
> >
> > Regards
> >
> > Antoine.
> >
> 
> I gets to a dict of class circumventing dictproxy. It's yet unclear
> why it segfaults.

Perhaps the method cache? But why the comment "# should print 1"?
Shouldn't it print 2 instead?



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Location of tests for packages

2011-01-25 Thread Alexander Belopolsky
On Tue, Jan 25, 2011 at 12:38 PM, Brett Cannon  wrote:
>.. If we move some modules and not others purely because some
> distros choose not to ship e.g., ctypes and sqlite3

I don't see why this is a problem.  Regrtest already has a mechanism
that allows skipping tests based on various criteria.  This mechanism
works for both packages and flat modules that can be optionally
installed.

FWIW, I am +0 on consolidating tests under Lib/test.  One of the
reasons that I have not seen mentioned is that it is well-known that
test package is not part of the official stdlib API and can be
changes/restructured in backward incompatible ways. It is not obvious
whether the same applies to say lib2to3.tests or ctypes.test.

If you are interested to see what it takes to move tests from a
package, I moved json tests to Lib/test/json_tests in r86875.  It is
not hard, but does require some changes to imports.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] r88178 - python/branches/py3k/Lib/test/crashers/underlying_dict.py

2011-01-25 Thread Maciej Fijalkowski
On Tue, Jan 25, 2011 at 1:26 PM, Antoine Pitrou  wrote:
> On Tue, 25 Jan 2011 01:00:28 +0100 (CET)
> benjamin.peterson  wrote:
>> Author: benjamin.peterson
>> Date: Tue Jan 25 01:00:28 2011
>> New Revision: 88178
>>
>> Log:
>> another pretty crasher served up by pypy
>
> Some comments would be nice. Right now it looks pretty close to
> deliberately obfuscated code (especially with the call to
> gc.get_referrers()).
>
> Regards
>
> Antoine.
>

I gets to a dict of class circumventing dictproxy. It's yet unclear
why it segfaults.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Location of tests for packages

2011-01-25 Thread Brett Cannon
On Mon, Jan 24, 2011 at 17:19, Raymond Hettinger
 wrote:
>
> On Jan 24, 2011, at 3:40 PM, Michael Foord wrote:
>> It isn't just unittest, it seems that all *test packages* are in their 
>> respective package and not Lib/test except for the json module where Raymond 
>> already moved the tests:
>>
>>    distutils/tests
>>    email/test
>>    ctypes/test
>>    importlib/test
>>    lib2to3/tests
>>    sqlite3/test
>>    tkinter/test
>>
>> So I'm a little confused as to why the focus on the *unittest* test suite.
>
>
> There's not a focus on unittest.  Importlib should also move under Lib/test
> and when email is ready, it too should fully join the organization of
> the overall project (Doc, Lib, Lib/test, Modules, Objects, Tools).

Just to clarify my position since importlib keeps getting brought up
as an example, I'm fine with a move but I won't be putting the work in
to do the move if there is actually consensus to make this a
stdlib-wide policy. And I am assuming that the directory will be moved
wholesale to Lib/test/importlib (with proper fixes for any relative
imports) along with verification that importlib.test.__main__
continues to work (naming it test.importlib_tests seems rather
redundant compared to test.importlib).

While I'm for consistency, obviously a trend was started by ctypes and
sqlite3 that the rest of us who created full packages followed up to
this point. If we move some modules and not others purely because some
distros choose not to ship e.g., ctypes and sqlite3, that will get
annoying w/o some very clear explanation/delineation as to why some
packages have a special rule to follow (I'm guessing "packages that
have external dependencies" would be it).
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-25 Thread Toshio Kuratomi
On Tue, Jan 25, 2011 at 10:22:41AM +0100, Xavier Morel wrote:
> On 2011-01-25, at 04:26 , Toshio Kuratomi wrote:
> > 
> > * If you can pick a set of encodings that are valid (utf-8 for Linux and
> >  MacOS
> 
> HFS+ uses UTF-16 in NFD (actually in an Apple-specific variant of NFD). Right 
> here you've already broken Python modules on OSX.
>
Others have been saying that Mac OSX's HFS+ uses UTF-8.  But the question is
not whether UTF-16 or UTF-8 is used by HFS+.  It's whether you can sensibly
decide on an encoding from the type of system that is being run on.  This
could be querying the filesystem or a check on sys.platform or some other
method.  I don't know what detection the current code does.

On Linux there's no defined encoding that will work; file names are just
bytes to the Linux kernel so based on people's argument that the convention
is and should be that filenames are utf-8 and anything else is
a misconfigured system -- python should mandate that its module filenames on
Linux are utf-8 rather than using the user's locale settings.
> 
> And as far as I know, Linux software/FS generally use NFC (I've already seen 
> this issue cause trouble)
> 
Linux FS's are bytes with a small blacklist (so you can't use the NULL byte
in a filename, for instance).  Linux software would be free to use any
normal form that they want.  If one software used NFC and another used NFD,
the FS would record two separate files with two separate filenames.  Other
programs might or might not display this correctly.

Example:
$ touch cafe
$ python
Python 2.7 (r27:82500, Sep 16 2010, 18:02:00) 
>>> import os
>>> import unicodedata
>>> a=u'café'
>>> b=unicodedata.normalize('NFC', a)
>>> c=unicodedata.normalize('NFD', a)
>>> open(b.encode('utf8'), 'w').close()
>>> open(c.encode('utf8'), 'w').close()
>>> os.listdir(u'.')
>>> [u'people-etc-changes.txt', u'cafe\u0301', u'cafe', 
>>> u'people-etc-changes.sha256sum', u'caf\xe9']
>>> os.listdir('.')
>>> ['people-etc-changes.txt', 'cafe\xcc\x81', 'cafe', 
>>> 'people-etc-changes.sha256sum', 'caf\xc3\xa9']
>>> ^D

$ ls -al .
drwxrwxr-x.  2 badger badger  4096 Jan 25 07:46 .
drwxr-xr-x. 17 badger badger  4096 Jan 24 18:27 ..
-rw-rw-r--.  1 badger badger 0 Jan 25 07:45 cafe
-rw-rw-r--.  1 badger badger 0 Jan 25 07:46 cafe
-rw-rw-r--.  1 badger badger 0 Jan 25 07:46 café

$ ls -al cafe
-rw-rw-r--.  1 badger badger 0 Jan 25 07:45 cafe
$ ls -al cafe?
-rw-rw-r--.  1 badger badger 0 Jan 25 07:46 cafe

Now in this case, the decomposed form of the filename is being displayed
incorrectly and the shell treats the decomposed character as two characters
instead of one.  However, when you view these files in dolphin (the KDE file
manager) you properly see café repeated twice.

-Toshio


pgp2jXsIKYdB7.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] r88155 - python/branches/py3k/Doc/whatsnew/3.2.rst

2011-01-25 Thread Nick Coghlan
On Mon, Jan 24, 2011 at 11:51 AM, raymond.hettinger
 wrote:
> Author: raymond.hettinger
> Date: Mon Jan 24 02:51:49 2011
> New Revision: 88155
>
> Log:
> Add entries for dis, dbm, and ctypes.
>
>
> Modified:
>   python/branches/py3k/Doc/whatsnew/3.2.rst
>
> Modified: python/branches/py3k/Doc/whatsnew/3.2.rst
> ==
> --- python/branches/py3k/Doc/whatsnew/3.2.rst   (original)
> +++ python/branches/py3k/Doc/whatsnew/3.2.rst   Mon Jan 24 02:51:49 2011
> @@ -1599,6 +1599,51 @@
>
>  (Contributed by Ron Adam; :issue:`2001`.)
>
> +dis
> +---

For the dis module there is also the change to dis.dis() itself from
issue 6507 - you can now pass source strings directly to dis without
needing to compile them first:

>>> dis.dis("1 + 2")
  1   0 LOAD_CONST   2 (3)
  3 RETURN_VALUE

> +The :mod:`dis` module gained two new functions for inspecting code,
> +:func:`~dis.code_info` and :func:`~dis.show_code`.  Both provide detailed 
> code
> +object information for the supplied function, method, source code string or 
> code
> +object.  The former returns a string and the latter prints it::
> +
> +    >>> import dis, random
> +    >>> show_code(random.choice)

Typo here - missing a "dis." at the start of the line.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-25 Thread exarkun

On 09:22 am, catch-...@masklinn.net wrote:

On 2011-01-25, at 04:26 , Toshio Kuratomi wrote:


* If you can pick a set of encodings that are valid (utf-8 for Linux 
and

 MacOS


HFS+ uses UTF-16 in NFD (actually in an Apple-specific variant of NFD). 
Right here you've already broken Python modules on OSX.


Are you sure about the UTF-16 part?  Evidence strongly points towards 
UTF-8:


 $ python
 Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)  [GCC 4.2.1 (Apple 
Inc. build 5646)] on darwin

 Type "help", "copyright", "credits" or "license" for more information.
 >>> import unicodedata, os
 >>> file(u'\N{SNOWMAN}', 'w').close()
 >>> os.listdir('.')
 ['\xe2\x98\x83']
 >>> unicodedata.name('\xe2\x98\x83'.decode('utf-8'))
 'SNOWMAN'
 >>>
Jean-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] r88178 - python/branches/py3k/Lib/test/crashers/underlying_dict.py

2011-01-25 Thread Antoine Pitrou
On Tue, 25 Jan 2011 01:00:28 +0100 (CET)
benjamin.peterson  wrote:
> Author: benjamin.peterson
> Date: Tue Jan 25 01:00:28 2011
> New Revision: 88178
> 
> Log:
> another pretty crasher served up by pypy

Some comments would be nice. Right now it looks pretty close to
deliberately obfuscated code (especially with the call to
gc.get_referrers()).

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-25 Thread Nick Coghlan
On Tue, Jan 25, 2011 at 6:17 AM, "Martin v. Löwis"  wrote:
> A new function PyUnicode_AsUTF8 is provided to access the UTF-8
> representation. It is thus identical to the existing
> _PyUnicode_AsString, which is removed. The function will compute the
> utf8 representation when first called. Since this representation will
> consume memory until the string object is released, applications
> should use the existing PyUnicode_AsUTF8String where possible
> (which generates a new string object every time). API that implicitly
> converts a string to a char* (such as the ParseTuple functions) will
> use this function to compute a conversion.

I'm not entirely clear as to what "this function" is referring to here.

I'm also dubious of the "PyUnicode_Finalize" name - "PyUnicode_Ready"
might be a better option (PyType_Ready seems a better analogy for a
"I've filled everything in, please calculate the derived fields now"
than Py_Finalize).

More generally, let me see if I understand the proposed structure correctly:

str: Always set once PyUnicode_Ready() has been called.
  Always points to the canonical representation of the string (as
indicated by PyUnicode_Kind)
length: Always set once PyUnicode_Ready() has been called. Specifies
the number of code points in the string.

wstr: Set only if PyUnicode_AsUnicode has been called on the string.
If (sizeof(wchar_t) == 2 && PyUnicode_Kind() == PyUnicode_2BYTE)
or (sizeof(wchar_t) == 4 && PyUnicode_Kind() == PyUnicode_4BYTE), wstr
= str, otherwise wstr points to dedicated memory
wstr_length: Valid only if wstr != NULL
If wstr_length != length, indicates presence of surrogate pairs in
a UCS-2 string (i.e. sizeof(wchar_t) == 2, PyUnicode_Kind() ==
PyUnicode_4BYTE).

utf8: Set only if PyUnicode_AsUTF8 has been called on the string.
If string contents are pure ASCII, utf8 = str, otherwise utf8
points to dedicated memory.
utf8_length: Valid only if utf8_ptr != NULL

One change I would propose is that rather than hiding flags in the low
order bits of the str pointer, we expand the use of the existing
"state" field to cover the representation information in addition to
the interning information. I would also suggest explicitly flagging
internally whether or not a 1 byte string is ASCII or Latin-1 along
the lines of:

/* Already existing string state constants */
#SSTATE_NOT_INTERNED 0x00
#SSTATE_INTERNED_MORTAL 0x01
#SSTATE_INTERNED_IMMORTAL 0x02
/* New string state constants */
#SSTATE_INTERN_MASK 0x03
#SSTATE_KIND_ASCII 0x00
#SSTATE_KIND_LATIN1 0x04
#SSTATE_KIND_2BYTE 0x08
#SSTATE_KIND_4BYTE 0x0C
#SSTATE_KIND_MASK 0x0C


PyUnicode_Kind would then return PyUnicode_1BYTE for strings that were
flagged internally as either ASCII or LATIN1.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] tahoe-lafs

2011-01-25 Thread Nick Coghlan
On Tue, Jan 25, 2011 at 2:18 AM, Earney, Billy C. wrote:

> I want to make it clear that I am in no way associated with the tahoe-lafs
> project.  I do not want my email to make that project look bad.  That was
> not my intention.
>

Good to know. I was also in a somewhat grumpy mood when I wrote my last
post, so take it with a grain of salt :)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-25 Thread Xavier Morel
On 2011-01-25, at 04:26 , Toshio Kuratomi wrote:
> 
> * If you can pick a set of encodings that are valid (utf-8 for Linux and
>  MacOS

HFS+ uses UTF-16 in NFD (actually in an Apple-specific variant of NFD). Right 
here you've already broken Python modules on OSX.

And as far as I know, Linux software/FS generally use NFC (I've already seen 
this issue cause trouble)

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Import and unicode: part two

2011-01-25 Thread Stephen J. Turnbull
As Nick points out, nobody really seems to think this is an
argument against your patch.  I'm going to bow out of this thread
after this post, as I'm clearly out of my technical depth.

Victor Stinner writes:

 > Le lundi 24 janvier 2011 11:35:22, Stephen J. Turnbull a écrit :
 > > ... VFAT-formatted file systems and Shift JIS file names ...
 > 
 > I missed something: VFAT stores filenames as unicode (whereas FAT only 
 > supports byte filenames). Well, VFAT stores filenames twice: as a 8+3 byte 
 > strings and as a 255 unicode (UTF-16-LE) string (UTF-16-LE).

I don't know what it is; I didn't have char-device-level access to the
file system, nor did I have the specs (it was a proprietary phone by a
Japanese OEM).  It *presented* filenames in Shift JIS when mounted on
Linux with the vfat filesystem (either "mount -t vfat /dev/sde1
/mnt/gadget" or "mount -t auto /dev/sde1 /mnt/gadget").  Maybe there
is some unusual layer to translate from Unicode there, I'm not
familiar with Linux kernel drivers and libc facilities (such
special-casing is a common pattern in programming for Japanese;
remember, the Japanese had to deal with these issues before there was
any standard for them).

 > On which OS do you access this VFAT file system? On Windows, you have two 
 > APIs: bytes (*A) and wide character (*W). If you use the wide character, 
 > there 
 > is explicit encoding at all. Linux has two mount options to control unicode 
 > on 
 > a VFAT filesystem: "codepage" for the byte filenames (use Shift JIS here) 
 > and 
 > "iocharset" for the unicode filenames (I don't understand this
 > option). 

I didn't either, in fact this is the first I've heard of it, so I've
never tried it.

 > I suppose that Shift JIS is used to encode the filename in the 8+3 byte 
 > string 
 > form.

Could be, but I'm pretty sure these were long filenames, although
maybe they were just short enough (that is, I don't recall noticing
any truncation when mounted compared to the way they were presented on
the phone itself).  I don't use that phone anymore, it's in a box of
junk equipment somewhere
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com