Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 26, 2011 at 11:24:54AM +0900, Stephen J. Turnbull wrote: > Toshio Kuratomi writes: > > > On Linux there's no defined encoding that will work; file names are just > > bytes to the Linux kernel so based on people's argument that the convention > > is and should be that filenames are utf-8 and anything else is > > a misconfigured system -- python should mandate that its module filenames > on > > Linux are utf-8 rather than using the user's locale settings. > > This isn't going to work where I live (Tsukuba). At the national > university alone there are hundreds of pre-existing *nix systems whose > filesystems were often configured a decade or more ago. Even if the > hardware and OS have been upgraded, the filesystems are usually > migrated as-is, with OS configuration tweaks to accomodate them. Many > of them use EUC-JP (and servers often Shift JIS). That means that you > won't be able to read module names with ls, and that will make Python > unacceptable for this purpose. I imagine that in Russia the same is > true for the various Cyrillic encodings. > Sure ... but with these systems, neither read-modules-as-locale or read-modules-as-utf-8 are a good solution to work, correct? Especially if the OS does get upgraded but the filesystems with user data (and user created modules) are migrated as-is, you'll run into situations where system installed modules are in utf-8 and user created modules are shift-jis and so something will always be broken. The only way to make sure that modules work is to restrict them to ASCII-only on the filesystem. But because unicode module names are seen as a necessary feature, the question is which way forward is going to lead to the least brokenness. Which could be locale... but from the python2 locale-related bugs that I get to look at, I doubt. > I really don't think there is anything that can be done here except to > warn people that "Kids, these stunts are performed by highly-trained > professionals. Don't try this at home!" Of course they will anyway, > but at least they will have been warned in sufficiently strong terms > that they might pay attention and be able to recover when they run > into bizarre import exceptions. > So on the subject of warnings... I think a reason it's better to pick an encoding for the platform/filesystem rather than to use locale is because people will get an error or a warning at the appropriate time if that's the case -- the first time they attempt to create and import a module with a filename that's not encoded in the correct encoding for the platform. It's all very well to say: "We wrote in the documentation on http://docs.python.org/distutils/introduction.html#Choosing-a-name that only ASCII names should be used when distributing python modules" but if the interpreter doesn't complain when people use a non-ASCII filename we all know that they aren't going to look in the documentation; they'll try it and if it works they'll learn that habit. -Toshio pgpjrrsvd3wof.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Toshio Kuratomi writes: > On Linux there's no defined encoding that will work; file names are just > bytes to the Linux kernel so based on people's argument that the convention > is and should be that filenames are utf-8 and anything else is > a misconfigured system -- python should mandate that its module filenames on > Linux are utf-8 rather than using the user's locale settings. This isn't going to work where I live (Tsukuba). At the national university alone there are hundreds of pre-existing *nix systems whose filesystems were often configured a decade or more ago. Even if the hardware and OS have been upgraded, the filesystems are usually migrated as-is, with OS configuration tweaks to accomodate them. Many of them use EUC-JP (and servers often Shift JIS). That means that you won't be able to read module names with ls, and that will make Python unacceptable for this purpose. I imagine that in Russia the same is true for the various Cyrillic encodings. I really don't think there is anything that can be done here except to warn people that "Kids, these stunts are performed by highly-trained professionals. Don't try this at home!" Of course they will anyway, but at least they will have been warned in sufficiently strong terms that they might pay attention and be able to recover when they run into bizarre import exceptions. Oh, yeah, don't forget to apply Victor's patch, which allows Python to keep the promises it can make about consistency. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Python-checkins] r88197 - python/branches/py3k/Lib/email/generator.py
This broke the buildbots (R. David Murray thinks you may have forgotten to call super() in the 'payload is None' branch). Are you getting code reviews and fully running the test suite before committing? We are in RC. On Tue, Jan 25, 2011 at 16:39, victor.stinner wrote: > Author: victor.stinner > Date: Wed Jan 26 01:39:19 2011 > New Revision: 88197 > > Log: > Fix BytesGenerator._handle_text() if the message has no payload (None) > > Modified: > python/branches/py3k/Lib/email/generator.py > > Modified: python/branches/py3k/Lib/email/generator.py > == > --- python/branches/py3k/Lib/email/generator.py (original) > +++ python/branches/py3k/Lib/email/generator.py Wed Jan 26 01:39:19 2011 > @@ -377,8 +377,11 @@ > def _handle_text(self, msg): > # If the string has surrogates the original source was bytes, so > # just write it back out. > - if _has_surrogates(msg._payload): > - self.write(msg._payload) > + payload = msg.get_payload() > + if payload is None: > + return > + if _has_surrogates(payload): > + self.write(payload) > else: > super(BytesGenerator,self)._handle_text(msg) > > ___ > Python-checkins mailing list > python-check...@python.org > http://mail.python.org/mailman/listinfo/python-checkins > ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393: Flexible String Representation
On Tue, Jan 25, 2011 at 5:43 PM, M.-A. Lemburg wrote: > I also don't see how this could save a lot of memory. As an example > take a French text with say 10mio code points. This would end up > appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB), > one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending > on how many accents are used). That's a saving of -10MB compared to > today's implementation :-) If I am reading the pep right, which I may not be as I am no expert on unicode, the new implementation would actually give a 10MB saving since the wchar field is optional, so only the str (Latin-1) and utf8 fields would need to be stored. How it decides not to store one field or another would need to be clarified in the pep is I am right. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Python-checkins] r88197 - python/branches/py3k/Lib/email/generator.py
On Wed, Jan 26, 2011 at 10:39 AM, victor.stinner wrote: > Author: victor.stinner > Date: Wed Jan 26 01:39:19 2011 > New Revision: 88197 > > Log: > Fix BytesGenerator._handle_text() if the message has no payload (None) Folks, for the peace of mind of python-checkins watchers, please remember to mention the reviewer's name when checking in fixes during the RC period (the last one I checked had been reviewed by Georg on the issue tracker, but it's hard to check without even an issue number to look up). Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Location of tests for packages
On Wed, Jan 26, 2011 at 4:16 AM, Alexander Belopolsky wrote: > FWIW, I am +0 on consolidating tests under Lib/test. One of the > reasons that I have not seen mentioned is that it is well-known that > test package is not part of the official stdlib API and can be > changes/restructured in backward incompatible ways. It is not obvious > whether the same applies to say lib2to3.tests or ctypes.test. I am +0 for the same reason as Alexander. The test subpackages should either be moved under the test package, or, for packages with PyPI distributed backports for previous versions, they should be prefixed with a leading underscore to make it clear that they're private implementation details and backwards compatibility guarantees don't apply. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393: Flexible String Representation
On Tue, 25 Jan 2011 21:08:01 +1000 Nick Coghlan wrote: > > One change I would propose is that rather than hiding flags in the low > order bits of the str pointer, we expand the use of the existing > "state" field to cover the representation information in addition to > the interning information. +1, by the way. The "state" field has many bits available (even if we decide to make it a char rather than an int). Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] r88178 - python/branches/py3k/Lib/test/crashers/underlying_dict.py
> Some comments would be nice. Right now it looks pretty close to > deliberately obfuscated code (especially with the call to > gc.get_referrers()). That call tries to get at the class dictionary, rather then just the dict_proxy that you get from A.__dict__. There should be two referrers to thingy: the class dict, and the module dict. The class dict will have a __module__ key. I agree the program should print 2, though. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393: Flexible String Representation
For the record: > I also don't see how this could save a lot of memory. As an example > take a French text with say 10mio code points. This would end up > appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB), > one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending > on how many accents are used). Typical French text seems to have 5% non-ASCII characters. So the number of UTF-8 bytes needed to represent a French text would only be 5% higher than the number of code points. Anyway, it's quite obvious that Martin's goal is that only one representation gets created most of the time. To quote the draft: “All three representations are optional, although the str form is considered the canonical representation which can be absent only while the string is being created.” Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393: Flexible String Representation
I'll comment more on this later this week... >From my first impression, I'm not too thrilled by the prospect of making the Unicode implementation more complicated by having three different representations on each object. I also don't see how this could save a lot of memory. As an example take a French text with say 10mio code points. This would end up appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB), one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending on how many accents are used). That's a saving of -10MB compared to today's implementation :-) "Martin v. Löwis" wrote: > I have been thinking about Unicode representation for some time now. > This was triggered, on the one hand, by discussions with Glyph Lefkowitz > (who complained that his server app consumes too much memory), and Carl > Friedrich Bolz (who profiled Python applications to determine that > Unicode strings are among the top consumers of memory in Python). > On the other hand, this was triggered by the discussion on supporting > surrogates in the library better. > > I'd like to propose PEP 393, which takes a different approach, > addressing both problems simultaneously: by getting a flexible > representation (one that can be either 1, 2, or 4 bytes), we can > support the full range of Unicode on all systems, but still use > only one byte per character for strings that are pure ASCII (which > will be the majority of strings for the majority of users). > > You'll find the PEP at > > http://www.python.org/dev/peps/pep-0393/ > > For convenience, I include it below. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 25 2011) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] r88178 - python/branches/py3k/Lib/test/crashers/underlying_dict.py
Le mardi 25 janvier 2011 à 20:11 +0200, Maciej Fijalkowski a écrit : > On Tue, Jan 25, 2011 at 1:26 PM, Antoine Pitrou wrote: > > On Tue, 25 Jan 2011 01:00:28 +0100 (CET) > > benjamin.peterson wrote: > >> Author: benjamin.peterson > >> Date: Tue Jan 25 01:00:28 2011 > >> New Revision: 88178 > >> > >> Log: > >> another pretty crasher served up by pypy > > > > Some comments would be nice. Right now it looks pretty close to > > deliberately obfuscated code (especially with the call to > > gc.get_referrers()). > > > > Regards > > > > Antoine. > > > > I gets to a dict of class circumventing dictproxy. It's yet unclear > why it segfaults. Perhaps the method cache? But why the comment "# should print 1"? Shouldn't it print 2 instead? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Location of tests for packages
On Tue, Jan 25, 2011 at 12:38 PM, Brett Cannon wrote: >.. If we move some modules and not others purely because some > distros choose not to ship e.g., ctypes and sqlite3 I don't see why this is a problem. Regrtest already has a mechanism that allows skipping tests based on various criteria. This mechanism works for both packages and flat modules that can be optionally installed. FWIW, I am +0 on consolidating tests under Lib/test. One of the reasons that I have not seen mentioned is that it is well-known that test package is not part of the official stdlib API and can be changes/restructured in backward incompatible ways. It is not obvious whether the same applies to say lib2to3.tests or ctypes.test. If you are interested to see what it takes to move tests from a package, I moved json tests to Lib/test/json_tests in r86875. It is not hard, but does require some changes to imports. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] r88178 - python/branches/py3k/Lib/test/crashers/underlying_dict.py
On Tue, Jan 25, 2011 at 1:26 PM, Antoine Pitrou wrote: > On Tue, 25 Jan 2011 01:00:28 +0100 (CET) > benjamin.peterson wrote: >> Author: benjamin.peterson >> Date: Tue Jan 25 01:00:28 2011 >> New Revision: 88178 >> >> Log: >> another pretty crasher served up by pypy > > Some comments would be nice. Right now it looks pretty close to > deliberately obfuscated code (especially with the call to > gc.get_referrers()). > > Regards > > Antoine. > I gets to a dict of class circumventing dictproxy. It's yet unclear why it segfaults. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Location of tests for packages
On Mon, Jan 24, 2011 at 17:19, Raymond Hettinger wrote: > > On Jan 24, 2011, at 3:40 PM, Michael Foord wrote: >> It isn't just unittest, it seems that all *test packages* are in their >> respective package and not Lib/test except for the json module where Raymond >> already moved the tests: >> >> distutils/tests >> email/test >> ctypes/test >> importlib/test >> lib2to3/tests >> sqlite3/test >> tkinter/test >> >> So I'm a little confused as to why the focus on the *unittest* test suite. > > > There's not a focus on unittest. Importlib should also move under Lib/test > and when email is ready, it too should fully join the organization of > the overall project (Doc, Lib, Lib/test, Modules, Objects, Tools). Just to clarify my position since importlib keeps getting brought up as an example, I'm fine with a move but I won't be putting the work in to do the move if there is actually consensus to make this a stdlib-wide policy. And I am assuming that the directory will be moved wholesale to Lib/test/importlib (with proper fixes for any relative imports) along with verification that importlib.test.__main__ continues to work (naming it test.importlib_tests seems rather redundant compared to test.importlib). While I'm for consistency, obviously a trend was started by ctypes and sqlite3 that the rest of us who created full packages followed up to this point. If we move some modules and not others purely because some distros choose not to ship e.g., ctypes and sqlite3, that will get annoying w/o some very clear explanation/delineation as to why some packages have a special rule to follow (I'm guessing "packages that have external dependencies" would be it). ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Tue, Jan 25, 2011 at 10:22:41AM +0100, Xavier Morel wrote: > On 2011-01-25, at 04:26 , Toshio Kuratomi wrote: > > > > * If you can pick a set of encodings that are valid (utf-8 for Linux and > > MacOS > > HFS+ uses UTF-16 in NFD (actually in an Apple-specific variant of NFD). Right > here you've already broken Python modules on OSX. > Others have been saying that Mac OSX's HFS+ uses UTF-8. But the question is not whether UTF-16 or UTF-8 is used by HFS+. It's whether you can sensibly decide on an encoding from the type of system that is being run on. This could be querying the filesystem or a check on sys.platform or some other method. I don't know what detection the current code does. On Linux there's no defined encoding that will work; file names are just bytes to the Linux kernel so based on people's argument that the convention is and should be that filenames are utf-8 and anything else is a misconfigured system -- python should mandate that its module filenames on Linux are utf-8 rather than using the user's locale settings. > > And as far as I know, Linux software/FS generally use NFC (I've already seen > this issue cause trouble) > Linux FS's are bytes with a small blacklist (so you can't use the NULL byte in a filename, for instance). Linux software would be free to use any normal form that they want. If one software used NFC and another used NFD, the FS would record two separate files with two separate filenames. Other programs might or might not display this correctly. Example: $ touch cafe $ python Python 2.7 (r27:82500, Sep 16 2010, 18:02:00) >>> import os >>> import unicodedata >>> a=u'café' >>> b=unicodedata.normalize('NFC', a) >>> c=unicodedata.normalize('NFD', a) >>> open(b.encode('utf8'), 'w').close() >>> open(c.encode('utf8'), 'w').close() >>> os.listdir(u'.') >>> [u'people-etc-changes.txt', u'cafe\u0301', u'cafe', >>> u'people-etc-changes.sha256sum', u'caf\xe9'] >>> os.listdir('.') >>> ['people-etc-changes.txt', 'cafe\xcc\x81', 'cafe', >>> 'people-etc-changes.sha256sum', 'caf\xc3\xa9'] >>> ^D $ ls -al . drwxrwxr-x. 2 badger badger 4096 Jan 25 07:46 . drwxr-xr-x. 17 badger badger 4096 Jan 24 18:27 .. -rw-rw-r--. 1 badger badger 0 Jan 25 07:45 cafe -rw-rw-r--. 1 badger badger 0 Jan 25 07:46 cafe -rw-rw-r--. 1 badger badger 0 Jan 25 07:46 café $ ls -al cafe -rw-rw-r--. 1 badger badger 0 Jan 25 07:45 cafe $ ls -al cafe? -rw-rw-r--. 1 badger badger 0 Jan 25 07:46 cafe Now in this case, the decomposed form of the filename is being displayed incorrectly and the shell treats the decomposed character as two characters instead of one. However, when you view these files in dolphin (the KDE file manager) you properly see café repeated twice. -Toshio pgp2jXsIKYdB7.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Python-checkins] r88155 - python/branches/py3k/Doc/whatsnew/3.2.rst
On Mon, Jan 24, 2011 at 11:51 AM, raymond.hettinger wrote: > Author: raymond.hettinger > Date: Mon Jan 24 02:51:49 2011 > New Revision: 88155 > > Log: > Add entries for dis, dbm, and ctypes. > > > Modified: > python/branches/py3k/Doc/whatsnew/3.2.rst > > Modified: python/branches/py3k/Doc/whatsnew/3.2.rst > == > --- python/branches/py3k/Doc/whatsnew/3.2.rst (original) > +++ python/branches/py3k/Doc/whatsnew/3.2.rst Mon Jan 24 02:51:49 2011 > @@ -1599,6 +1599,51 @@ > > (Contributed by Ron Adam; :issue:`2001`.) > > +dis > +--- For the dis module there is also the change to dis.dis() itself from issue 6507 - you can now pass source strings directly to dis without needing to compile them first: >>> dis.dis("1 + 2") 1 0 LOAD_CONST 2 (3) 3 RETURN_VALUE > +The :mod:`dis` module gained two new functions for inspecting code, > +:func:`~dis.code_info` and :func:`~dis.show_code`. Both provide detailed > code > +object information for the supplied function, method, source code string or > code > +object. The former returns a string and the latter prints it:: > + > + >>> import dis, random > + >>> show_code(random.choice) Typo here - missing a "dis." at the start of the line. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On 09:22 am, catch-...@masklinn.net wrote: On 2011-01-25, at 04:26 , Toshio Kuratomi wrote: * If you can pick a set of encodings that are valid (utf-8 for Linux and MacOS HFS+ uses UTF-16 in NFD (actually in an Apple-specific variant of NFD). Right here you've already broken Python modules on OSX. Are you sure about the UTF-16 part? Evidence strongly points towards UTF-8: $ python Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import unicodedata, os >>> file(u'\N{SNOWMAN}', 'w').close() >>> os.listdir('.') ['\xe2\x98\x83'] >>> unicodedata.name('\xe2\x98\x83'.decode('utf-8')) 'SNOWMAN' >>> Jean-Paul ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] r88178 - python/branches/py3k/Lib/test/crashers/underlying_dict.py
On Tue, 25 Jan 2011 01:00:28 +0100 (CET) benjamin.peterson wrote: > Author: benjamin.peterson > Date: Tue Jan 25 01:00:28 2011 > New Revision: 88178 > > Log: > another pretty crasher served up by pypy Some comments would be nice. Right now it looks pretty close to deliberately obfuscated code (especially with the call to gc.get_referrers()). Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393: Flexible String Representation
On Tue, Jan 25, 2011 at 6:17 AM, "Martin v. Löwis" wrote: > A new function PyUnicode_AsUTF8 is provided to access the UTF-8 > representation. It is thus identical to the existing > _PyUnicode_AsString, which is removed. The function will compute the > utf8 representation when first called. Since this representation will > consume memory until the string object is released, applications > should use the existing PyUnicode_AsUTF8String where possible > (which generates a new string object every time). API that implicitly > converts a string to a char* (such as the ParseTuple functions) will > use this function to compute a conversion. I'm not entirely clear as to what "this function" is referring to here. I'm also dubious of the "PyUnicode_Finalize" name - "PyUnicode_Ready" might be a better option (PyType_Ready seems a better analogy for a "I've filled everything in, please calculate the derived fields now" than Py_Finalize). More generally, let me see if I understand the proposed structure correctly: str: Always set once PyUnicode_Ready() has been called. Always points to the canonical representation of the string (as indicated by PyUnicode_Kind) length: Always set once PyUnicode_Ready() has been called. Specifies the number of code points in the string. wstr: Set only if PyUnicode_AsUnicode has been called on the string. If (sizeof(wchar_t) == 2 && PyUnicode_Kind() == PyUnicode_2BYTE) or (sizeof(wchar_t) == 4 && PyUnicode_Kind() == PyUnicode_4BYTE), wstr = str, otherwise wstr points to dedicated memory wstr_length: Valid only if wstr != NULL If wstr_length != length, indicates presence of surrogate pairs in a UCS-2 string (i.e. sizeof(wchar_t) == 2, PyUnicode_Kind() == PyUnicode_4BYTE). utf8: Set only if PyUnicode_AsUTF8 has been called on the string. If string contents are pure ASCII, utf8 = str, otherwise utf8 points to dedicated memory. utf8_length: Valid only if utf8_ptr != NULL One change I would propose is that rather than hiding flags in the low order bits of the str pointer, we expand the use of the existing "state" field to cover the representation information in addition to the interning information. I would also suggest explicitly flagging internally whether or not a 1 byte string is ASCII or Latin-1 along the lines of: /* Already existing string state constants */ #SSTATE_NOT_INTERNED 0x00 #SSTATE_INTERNED_MORTAL 0x01 #SSTATE_INTERNED_IMMORTAL 0x02 /* New string state constants */ #SSTATE_INTERN_MASK 0x03 #SSTATE_KIND_ASCII 0x00 #SSTATE_KIND_LATIN1 0x04 #SSTATE_KIND_2BYTE 0x08 #SSTATE_KIND_4BYTE 0x0C #SSTATE_KIND_MASK 0x0C PyUnicode_Kind would then return PyUnicode_1BYTE for strings that were flagged internally as either ASCII or LATIN1. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] tahoe-lafs
On Tue, Jan 25, 2011 at 2:18 AM, Earney, Billy C. wrote: > I want to make it clear that I am in no way associated with the tahoe-lafs > project. I do not want my email to make that project look bad. That was > not my intention. > Good to know. I was also in a somewhat grumpy mood when I wrote my last post, so take it with a grain of salt :) Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On 2011-01-25, at 04:26 , Toshio Kuratomi wrote: > > * If you can pick a set of encodings that are valid (utf-8 for Linux and > MacOS HFS+ uses UTF-16 in NFD (actually in an Apple-specific variant of NFD). Right here you've already broken Python modules on OSX. And as far as I know, Linux software/FS generally use NFC (I've already seen this issue cause trouble) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
As Nick points out, nobody really seems to think this is an argument against your patch. I'm going to bow out of this thread after this post, as I'm clearly out of my technical depth. Victor Stinner writes: > Le lundi 24 janvier 2011 11:35:22, Stephen J. Turnbull a écrit : > > ... VFAT-formatted file systems and Shift JIS file names ... > > I missed something: VFAT stores filenames as unicode (whereas FAT only > supports byte filenames). Well, VFAT stores filenames twice: as a 8+3 byte > strings and as a 255 unicode (UTF-16-LE) string (UTF-16-LE). I don't know what it is; I didn't have char-device-level access to the file system, nor did I have the specs (it was a proprietary phone by a Japanese OEM). It *presented* filenames in Shift JIS when mounted on Linux with the vfat filesystem (either "mount -t vfat /dev/sde1 /mnt/gadget" or "mount -t auto /dev/sde1 /mnt/gadget"). Maybe there is some unusual layer to translate from Unicode there, I'm not familiar with Linux kernel drivers and libc facilities (such special-casing is a common pattern in programming for Japanese; remember, the Japanese had to deal with these issues before there was any standard for them). > On which OS do you access this VFAT file system? On Windows, you have two > APIs: bytes (*A) and wide character (*W). If you use the wide character, > there > is explicit encoding at all. Linux has two mount options to control unicode > on > a VFAT filesystem: "codepage" for the byte filenames (use Shift JIS here) > and > "iocharset" for the unicode filenames (I don't understand this > option). I didn't either, in fact this is the first I've heard of it, so I've never tried it. > I suppose that Shift JIS is used to encode the filename in the 8+3 byte > string > form. Could be, but I'm pretty sure these were long filenames, although maybe they were just short enough (that is, I don't recall noticing any truncation when mounted compared to the way they were presented on the phone itself). I don't use that phone anymore, it's in a box of junk equipment somewhere ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com