Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Raymond Hettinger


[Antoine Pitrou]

Besides, Bob doesn't really seem to care about
porting to py3k (he hasn't said anything about it until now, other than that he
didn't feel competent to do it).


His actual words were: I will need some help with 3.0 since I am not well versed in the changes to the C API or Python code for 
that, but merging for 2.6.1 should be no big deal.



[MvL]

That is quite unfortunate, and suggests that perhaps the module
shouldn't have been added to Python in the first place.


Bob participated actively in http://bugs.python.org/issue4136 and was responsive to detailed patch review.  He gave a popular talk 
at PyCon less than two weeks ago.  He's not derelict.




I can understand that you don't want to spend much time on it. How
about removing it from 3.1? We could re-add it when long-term support
becomes more likely.


I'm speechless.


Raymond 


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Dirkjan Ochtman
On Thu, Apr 9, 2009 at 07:15, Antoine Pitrou solip...@pitrou.net wrote:
 The RFC also specifies a discrimination algorithm for non-supersets of ASCII
 (“Since the first two characters of a JSON text will always be ASCII
   characters [RFC0020], it is possible to determine whether an octet
   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
   at the pattern of nulls in the first four octets.”), but it is not
 implemented in the json module:

Well, your example is bad in the context of the RFC. The RFC states
that JSON-text = object / array, meaning loads for 'hi' isn't
strictly valid. The discrimination algorithm obviously only works in
the context of that grammar, where the first character of a document
must be { or [ and the next character can only be {, [, f, n, t, , -,
a number, or insignificant whitespace (space, \t, \r, \n).

 json.loads('hi')
 'hi'
 json.loads(u'hi'.encode('utf16'))
 Traceback (most recent call last):
  File stdin, line 1, in module
  File /home/antoine/cpython/__svn__/Lib/json/__init__.py, line 310, in loads
    return _default_decoder.decode(s)
  File /home/antoine/cpython/__svn__/Lib/json/decoder.py, line 344, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File /home/antoine/cpython/__svn__/Lib/json/decoder.py, line 362, in 
 raw_decode
    raise ValueError(No JSON object could be decoded)
 ValueError: No JSON object could be decoded

Cheers,

Dirkjan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Deprecating PyOS_ascii_formatd

2009-04-09 Thread Nick Coghlan
Eric Smith wrote:
 And as a reminder, the py3k-short-float-repr changes are on Rietveld at
 http://codereview.appspot.com/33084/show. So far, no comments.

I skipped over the actual number crunching parts (the test suite will do
a better job than I will of telling you whether or not you have those
parts correct), but I had a look at the various other changes to make
use of the new API.

Looks like you were able to delete some fairly respectable chunks of
redundant code!

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Barry Warsaw

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Apr 9, 2009, at 1:15 AM, Antoine Pitrou wrote:


Guido van Rossum guido at python.org writes:


I'm kind of surprised that a serialization protocol like JSON  
wouldn't

support reading/writing bytes (as the serialized format -- I don't
care about having bytes as values, since JavaScript doesn't have
something equivalent AFAIK, and hence JSON doesn't allow it IIRC).
Marshal and Pickle, for example, *always* treat the serialized format
as bytes. And since in most cases it will be sent over a socket, at
some point the serialized representation *will* be bytes, I presume.
What makes supporting this hard?


It's not hard, it just means a lot of duplicated code if the library  
wants to
support both str and bytes in an optimized way as Martin alluded to.  
This
duplicated code already exists in the C parts to support the 2.x  
semantics of
accepting unicode objects as well as str, but not in the Python  
parts, which
explains why the bytes support is broken in py3k - in 2.x, the same  
Python code

can be used for str and unicode.


This is an interesting question, and something I'm struggling with for  
the email package for 3.x.  It turns out to be pretty convenient to  
have both a bytes and a string API, both for input and output, but I  
think email really wants to be represented internally as bytes.   
Maybe.  Or maybe just for content bodies and not headers, or maybe  
both.  Anyway, aside from that decision, I haven't come up with an  
elegant way to allow /output/ in both bytes and strings (input is I  
think theoretically easier by sniffing the arguments).


Barry

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Darwin)

iQCVAwUBSd3Vf3EjvBPtnXfVAQKyNgQApNmI5hh9heTYynyADYaDkP8wzZFXUpgg
cKYL741MbLpOFn3IFGAGaRWBQe4Dt8i4CiIEIbg3X7QZqwQJjoTtFwxsJKmXFd1M
JR0oCB8Du2kE5YzD+avrEp+d8zwl2goxvzD9dJwziBav5V98w7PMiZc3sApklQFD
gNYzbHEOfv4=
=tjGr
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Antoine Pitrou
Dirkjan Ochtman dirkjan at ochtman.nl writes:
 
 The RFC states
 that JSON-text = object / array, meaning loads for 'hi' isn't
 strictly valid.

Sure, but then:

 json.loads('[]')
[]
 json.loads(u'[]'.encode('utf16'))
Traceback (most recent call last):
  File stdin, line 1, in module
  File /home/antoine/cpython/__svn__/Lib/json/__init__.py, line 310, in loads
return _default_decoder.decode(s)
  File /home/antoine/cpython/__svn__/Lib/json/decoder.py, line 344, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File /home/antoine/cpython/__svn__/Lib/json/decoder.py, line 362, in 
raw_decode
raise ValueError(No JSON object could be decoded)
ValueError: No JSON object could be decoded


Cheers

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Adding new features to Python 2.x (PEP 382: Namespace Packages)

2009-04-09 Thread Nick Coghlan
Martin v. Löwis wrote:
 Such a policy would then translate to a dead end for Python 2.x
 based applications.
 
 2.x based applications *are* in a dead end, with the only exit
 being portage to 3.x.

The actual end of the dead end just happens to be in 2013 or so :)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Dirkjan Ochtman
On Thu, Apr 9, 2009 at 13:10, Antoine Pitrou solip...@pitrou.net wrote:
 Sure, but then:

 json.loads('[]')
 []
 json.loads(u'[]'.encode('utf16'))
 Traceback (most recent call last):
  File stdin, line 1, in module
  File /home/antoine/cpython/__svn__/Lib/json/__init__.py, line 310, in loads
    return _default_decoder.decode(s)
  File /home/antoine/cpython/__svn__/Lib/json/decoder.py, line 344, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File /home/antoine/cpython/__svn__/Lib/json/decoder.py, line 362, in 
 raw_decode
    raise ValueError(No JSON object could be decoded)
 ValueError: No JSON object could be decoded

Right. :) Just wanted to point your test might not be testing what you
want to test.

Cheers,

Dirkjan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Steve Holden
Barry Warsaw wrote:
 On Apr 9, 2009, at 1:15 AM, Antoine Pitrou wrote:
 
 Guido van Rossum guido at python.org writes:

 I'm kind of surprised that a serialization protocol like JSON wouldn't
 support reading/writing bytes (as the serialized format -- I don't
 care about having bytes as values, since JavaScript doesn't have
 something equivalent AFAIK, and hence JSON doesn't allow it IIRC).
 Marshal and Pickle, for example, *always* treat the serialized format
 as bytes. And since in most cases it will be sent over a socket, at
 some point the serialized representation *will* be bytes, I presume.
 What makes supporting this hard?
 
 It's not hard, it just means a lot of duplicated code if the library
 wants to
 support both str and bytes in an optimized way as Martin alluded to. This
 duplicated code already exists in the C parts to support the 2.x
 semantics of
 accepting unicode objects as well as str, but not in the Python parts,
 which
 explains why the bytes support is broken in py3k - in 2.x, the same
 Python code
 can be used for str and unicode.
 
 This is an interesting question, and something I'm struggling with for
 the email package for 3.x.  It turns out to be pretty convenient to have
 both a bytes and a string API, both for input and output, but I think
 email really wants to be represented internally as bytes.  Maybe.  Or
 maybe just for content bodies and not headers, or maybe both.  Anyway,
 aside from that decision, I haven't come up with an elegant way to allow
 /output/ in both bytes and strings (input is I think theoretically
 easier by sniffing the arguments).
 
The real problem I came across in storing email in a relational database
was the inability to store messages as Unicode. Some messages have a
body in one encoding and an attachment in another, so the only ways to
store the messages are either as a monolithic bytes string that gets
parsed when the individual components are required or as a sequence of
components in the database's preferred encoding (if you want to keep the
original encoding most relational databases won't be able to help unless
you store the components as bytes).

All in all, as you might expect from a system that's been growing up
since 1970 or so, it can be quite intractable.

regards
 Steve
-- 
Steve Holden   +1 571 484 6266   +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
Watch PyCon on video now!  http://pycon.blip.tv/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] decorator module in stdlib?

2009-04-09 Thread Nick Coghlan
Michele Simionato wrote:
 On Wed, Apr 8, 2009 at 7:51 PM, Guido van Rossum gu...@python.org wrote:
 There was a remark (though perhaps meant humorously) in Michele's page
 about decorators that worried me too: For instance, typical
 implementations of decorators involve nested functions, and we all
 know that flat is better than nested. I find the nested-function
 pattern very clear and easy to grasp, whereas I find using another
 decorator (a meta-decorator?) to hide this pattern unnecessarily
 obscuring what's going on.
 
 I understand your point and I will freely admit that I have always had mixed
 feelings about the advantages of a meta decorator with
 respect to plain simple nested functions. I see pros and contras.
 If functools.update_wrapper could preserve the signature I
 would probably use it over the decorator module.

Yep, update_wrapper was a compromise along the lines of well, at least
we can make sure the relevant metadata refers to the original function
rather than the relatively uninteresting wrapper, even if the signature
itself is lost. The idea being that you can often figure out the
signature from the doc string even when introspection has been broken by
an intervening wrapper.

One of my hopes for PEP 362 was that I would be able to just add
__signature__ to the list of copied attributes, but that PEP is
currently short a champion to work through the process of resolving the
open issues and creating an up to date patch (Brett ended up with too
many things on his plate so he wasn't able to do it, and nobody else has
offered to take it over).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mercurial?

2009-04-09 Thread Nick Coghlan
Martin v. Löwis wrote:
 Nick Coghlan wrote:
 Dirkjan Ochtman wrote:
 I have a stab at an author map at http://dirkjan.ochtman.nl/author-map.
 Could use some review, but it seems like a good start.
 Martin may be able to provide a better list of names based on the
 checkin name-SSH public key mapping in the SVN setup.
 
 I think the identification in the SSH keys is useless. It contains
 strings like loe...@mira or ncogh...@uberwald, or even multiple
 of them (ba...@wooz, ba...@resist, ...).

Ah, I forgot our SVN accounts weren't linked up to our email addresses.
I guess that means the existing list won't be as useful as I thought it
might be.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] decorator module in stdlib?

2009-04-09 Thread Michele Simionato
On Thu, Apr 9, 2009 at 2:11 PM, Nick Coghlan ncogh...@gmail.com wrote:
 One of my hopes for PEP 362 was that I would be able to just add
 __signature__ to the list of copied attributes, but that PEP is
 currently short a champion to work through the process of resolving the
 open issues and creating an up to date patch (Brett ended up with too
 many things on his plate so he wasn't able to do it, and nobody else has
 offered to take it over).

I am totally ignorant about the internals of Python and I cannot certainly
take that role. But I would like to hear from Guido if he wants to support
a __signature__ object or if he does not care. In the first case
I think somebody will take the job, in the second case it is better to
reject the PEP and be done with it.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Deprecating PyOS_ascii_formatd

2009-04-09 Thread Eric Smith

Nick Coghlan wrote:

Eric Smith wrote:

And as a reminder, the py3k-short-float-repr changes are on Rietveld at
http://codereview.appspot.com/33084/show. So far, no comments.



Looks like you were able to delete some fairly respectable chunks of
redundant code!


Wait until you see how much nasty code gets deleted when I can actually 
remove PyOS_ascii_formatd!


And thanks for your comments on Rietveld, especially catching the memory 
leak.


Eric.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Adding new features to Python 2.x (PEP 382: Namespace Packages)

2009-04-09 Thread Aahz
On Thu, Apr 09, 2009, Nick Coghlan wrote:

 Martin v. L?wis wrote:
 Such a policy would then translate to a dead end for Python 2.x
 based applications.
 
 2.x based applications *are* in a dead end, with the only exit
 being portage to 3.x.
 
 The actual end of the dead end just happens to be in 2013 or so :)

More like 2016 or 2020 -- as of January, my former employer was still
using Python 2.3, and I wouldn't be surprised if 1.5.2 was still out in
the wilds.  The transition to 3.x is more extreme, and lots of people
will continue making do for years after any formal support is dropped.

Whether this warrants including PEP 382 in 2.x, I don't know; I still
don't really understand this proposal.
-- 
Aahz (a...@pythoncraft.com)   * http://www.pythoncraft.com/

Why is this newsgroup different from all other newsgroups?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] decorator module in stdlib?

2009-04-09 Thread Nick Coghlan
Michele Simionato wrote:
 On Thu, Apr 9, 2009 at 2:11 PM, Nick Coghlan ncogh...@gmail.com wrote:
 One of my hopes for PEP 362 was that I would be able to just add
 __signature__ to the list of copied attributes, but that PEP is
 currently short a champion to work through the process of resolving the
 open issues and creating an up to date patch (Brett ended up with too
 many things on his plate so he wasn't able to do it, and nobody else has
 offered to take it over).
 
 I am totally ignorant about the internals of Python and I cannot certainly
 take that role. But I would like to hear from Guido if he wants to support
 a __signature__ object or if he does not care. In the first case
 I think somebody will take the job, in the second case it is better to
 reject the PEP and be done with it.

I don't recall Guido being opposed when PEP 362 was first being
discussed (keeping in mind that was more than 2 years ago, so he's quite
entitled to have changed his mind in the meantime!).

That said, it's a sensible, largely straightforward idea, and by
creating the object lazily it doesn't even have to incur a runtime cost
in programs that don't do much introspection.

I think the main problem leading to the current lack of movement on the
PEP is that the existing inspect module is good enough for most
practical purposes (which are fairly rare in the first place), so this
isn't perceived as a huge gain even for the folks that are interested in
introspection.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Adding new features to Python 2.x (PEP 382: Namespace Packages)

2009-04-09 Thread Nick Coghlan
Aahz wrote:
 On Thu, Apr 09, 2009, Nick Coghlan wrote:
 Martin v. L?wis wrote:
 Such a policy would then translate to a dead end for Python 2.x
 based applications.
 2.x based applications *are* in a dead end, with the only exit
 being portage to 3.x.
 The actual end of the dead end just happens to be in 2013 or so :)
 
 More like 2016 or 2020 -- as of January, my former employer was still
 using Python 2.3, and I wouldn't be surprised if 1.5.2 was still out in
 the wilds.

Indeed - I know of a system that will finally be migrating from Python
2.2 to Python *2.4* later this year :)

  The transition to 3.x is more extreme, and lots of people
 will continue making do for years after any formal support is dropped.

Yeah, I was only referring to the likely minimum time frame that
python-dev would continue providing security releases. As you say, the
actual 2.x version of the language will live on long after the day we
close all remaining 2.x only bug reports and patches as out of date.

 Whether this warrants including PEP 382 in 2.x, I don't know; I still
 don't really understand this proposal.

I'd personally still prefer to keep the guideline that new features that
are easy to backport *should* be backported, but that's really a
decision for the authors of each new feature.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] py3k build erroring out on fileio?

2009-04-09 Thread Jeroen Ruigrok van der Werven
Just to make sure I am not doing something silly, with a configure line as
such: ./configure --prefix=/home/asmodai/local --with-wide-unicode
--with-pymalloc --with-threads --with-computed-gotos, would there be any
reason why I am getting the following error with both BSD make and gmake:

make: don't know how to make ./Modules/_fileio.c. Stop

[Will log an issue if it turns out to, indeed, be a problem with the tree
and not me.]

-- 
Jeroen Ruigrok van der Werven asmodai(-at-)in-nomine.org / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Forgive us our trespasses, as we forgive those that trespass against us...
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] py3k build erroring out on fileio?

2009-04-09 Thread Benjamin Peterson
2009/4/9 Jeroen Ruigrok van der Werven asmo...@in-nomine.org:
 Just to make sure I am not doing something silly, with a configure line as
 such: ./configure --prefix=/home/asmodai/local --with-wide-unicode
 --with-pymalloc --with-threads --with-computed-gotos, would there be any
 reason why I am getting the following error with both BSD make and gmake:

 make: don't know how to make ./Modules/_fileio.c. Stop

 [Will log an issue if it turns out to, indeed, be a problem with the tree
 and not me.]

It seems your Makefile is outdated. We moved the _fileio.c module
around a few days, so maybe you just need a make distclean.



-- 
Regards,
Benjamin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] py3k build erroring out on fileio?

2009-04-09 Thread Jeroen Ruigrok van der Werven
-On [20090409 15:41], Benjamin Peterson (benja...@python.org) wrote:
It seems your Makefile is outdated. We moved the _fileio.c module
around a few days, so maybe you just need a make distclean.

Yes, that was the cause. Thanks Benjamin.

-- 
Jeroen Ruigrok van der Werven asmodai(-at-)in-nomine.org / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
You yourself, as much as anybody in the entire universe, deserve your love
and affection...
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Bill Janssen
Barry Warsaw ba...@python.org wrote:

 Anyway, aside from that decision, I haven't come up with an  
 elegant way to allow /output/ in both bytes and strings (input is I  
 think theoretically easier by sniffing the arguments).

Probably a good thing.  It just promotes more confusion to do things
that way, IMO.

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread Aahz
On Thu, Apr 09, 2009, John Arbash Meinel wrote:

 PS I'm not yet subscribed to python-dev, so if you could make sure to
 CC me in replies, I would appreciate it.

Please do subscribe to python-dev ASAP; I also suggest that you subscribe
to python-ideas, because I suspect that this is sufficiently blue-sky to
start there.

As always, this is the kind of thing where code trumps gedanken, so you
shouldn't expect much activity unless either you are willing to make at
least initial attempts at trying out your ideas or someone else just
happens to find it interesting.  In general, the core Python
implementation strives for simplicity, so there's already some built-in
pushback.
-- 
Aahz (a...@pythoncraft.com)   * http://www.pythoncraft.com/

Why is this newsgroup different from all other newsgroups?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread Dirkjan Ochtman
On Thu, Apr 9, 2009 at 17:31, Aahz a...@pythoncraft.com wrote:
 Please do subscribe to python-dev ASAP; I also suggest that you subscribe
 to python-ideas, because I suspect that this is sufficiently blue-sky to
 start there.

It might also be interesting to the unladen-swallow guys.

Cheers,

Dirkjan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Daniel Stutzbach
On Thu, Apr 9, 2009 at 6:01 AM, Barry Warsaw ba...@python.org wrote:

 Anyway, aside from that decision, I haven't come up with an elegant way to
 allow /output/ in both bytes and strings (input is I think theoretically
 easier by sniffing the arguments).


Won't this work? (assuming dumps() always returns a string)

def dumpb(obj, encoding='utf-8', *args, **kw):
s = dumps(obj, *args, **kw)
return s.encode(encoding)

--
Daniel Stutzbach, Ph.D.
President, Stutzbach Enterprises, LLC http://stutzbachenterprises.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] email package Bytes vs Unicode (was Re: Dropping bytes support in json)

2009-04-09 Thread Tony Nelson
(email-sig added)

At 08:07 -0400 04/09/2009, Steve Holden wrote:
Barry Warsaw wrote:
 ...
 This is an interesting question, and something I'm struggling with for
 the email package for 3.x.  It turns out to be pretty convenient to have
 both a bytes and a string API, both for input and output, but I think
 email really wants to be represented internally as bytes.  Maybe.  Or
 maybe just for content bodies and not headers, or maybe both.  Anyway,
 aside from that decision, I haven't come up with an elegant way to allow
 /output/ in both bytes and strings (input is I think theoretically
 easier by sniffing the arguments).

The real problem I came across in storing email in a relational database
was the inability to store messages as Unicode. Some messages have a
body in one encoding and an attachment in another, so the only ways to
store the messages are either as a monolithic bytes string that gets
parsed when the individual components are required or as a sequence of
components in the database's preferred encoding (if you want to keep the
original encoding most relational databases won't be able to help unless
you store the components as bytes).
 ...

I found it confusing myself, and did it wrong for a while.  Now, I
understand that essages come over the wire as bytes, either 7-bit US-ASCII
or 8-bit whatever, and are parsed at the receiver.  I think of the database
as a wire to the future, and store the data as bytes (a BLOB), letting the
future receiver parse them as it did the first time, when I cleaned the
message.  Data I care to query is extracted into fields (in UTF-8, what I
usually use for char fields).  I have no need to store messages as Unicode,
and they aren't Unicode anyway.  I have no need ever to flatten a message
to Unicode, only to US-ASCII or, for messages (spam) that are corrupt, raw
8-bit data.

If you need the data from the message, by all means extract it and store it
in whatever form is useful to the purpose of the database.  If you need the
entire message, store it intact in the database, as the bytes it is.  Email
isn't Unicode any more than a JPEG or other image types (often payloads in
a message) are Unicode.
-- 

TonyN.:'   mailto:tonynel...@georgeanelson.com
  '  http://www.georgeanelson.com/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] email package Bytes vs Unicode (was Re: Dropping bytes support in json)

2009-04-09 Thread Steve Holden
Tony Nelson wrote:
 (email-sig added)
 
 At 08:07 -0400 04/09/2009, Steve Holden wrote:
 Barry Warsaw wrote:
  ...
 This is an interesting question, and something I'm struggling with for
 the email package for 3.x.  It turns out to be pretty convenient to have
 both a bytes and a string API, both for input and output, but I think
 email really wants to be represented internally as bytes.  Maybe.  Or
 maybe just for content bodies and not headers, or maybe both.  Anyway,
 aside from that decision, I haven't come up with an elegant way to allow
 /output/ in both bytes and strings (input is I think theoretically
 easier by sniffing the arguments).

 The real problem I came across in storing email in a relational database
 was the inability to store messages as Unicode. Some messages have a
 body in one encoding and an attachment in another, so the only ways to
 store the messages are either as a monolithic bytes string that gets
 parsed when the individual components are required or as a sequence of
 components in the database's preferred encoding (if you want to keep the
 original encoding most relational databases won't be able to help unless
 you store the components as bytes).
  ...
 
 I found it confusing myself, and did it wrong for a while.  Now, I
 understand that essages come over the wire as bytes, either 7-bit US-ASCII
 or 8-bit whatever, and are parsed at the receiver.  I think of the database
 as a wire to the future, and store the data as bytes (a BLOB), letting the
 future receiver parse them as it did the first time, when I cleaned the
 message.  Data I care to query is extracted into fields (in UTF-8, what I
 usually use for char fields).  I have no need to store messages as Unicode,
 and they aren't Unicode anyway.  I have no need ever to flatten a message
 to Unicode, only to US-ASCII or, for messages (spam) that are corrupt, raw
 8-bit data.
 
 If you need the data from the message, by all means extract it and store it
 in whatever form is useful to the purpose of the database.  If you need the
 entire message, store it intact in the database, as the bytes it is.  Email
 isn't Unicode any more than a JPEG or other image types (often payloads in
 a message) are Unicode.

This is all great, and I did quite quickly realize that the best
approach was to store the mails in their network byte-stream format as
bytes. The approach was negated in my own case because of PostgreSQL's
execrable BLOB-handling capabilities. I took a look at the escaping they
required, snorted with derision and gave it up as a bad job.

PostgreSQL strongly encourages you to store text as encoded columns.
Because emails lack an encoding it turns out this is a most inconvenient
storage type for it. Sadly BLOBs are such a pain in PostgreSQL that it's
easier to store the messages in external files and just use the
relational database to index those files to retrieve content, so that's
what I ended up doing.

regards
 Steve


-- 
Steve Holden   +1 571 484 6266   +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
Watch PyCon on video now!  http://pycon.blip.tv/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread Collin Winter
Hi John,

On Thu, Apr 9, 2009 at 8:02 AM, John Arbash Meinel
j...@arbash-meinel.com wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 I've been doing some memory profiling of my application, and I've found
 some interesting results with how intern() works. I was pretty surprised
 to see that the interned dict was actually consuming a significant
 amount of total memory.
 To give the specific values, after doing:
  bzr branch A B
 of a small project, the total memory consumption is ~21MB

[snip]

 Anyway, I the internals of intern() could be done a bit better. Here are
 some concrete things:

[snip]

Memory usage is definitely something we're interested in improving.
Since you've already looked at this in some detail, could you try
implementing one or two of your ideas and see if it makes a difference
in memory consumption? Changing from a dict to a set looks promising,
and should be a fairly self-contained way of starting on this. If it
works, please post the patch on http://bugs.python.org with your
results and assign it to me for review.

Thanks,
Collin Winter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread John Arbash Meinel
...

 Anyway, I the internals of intern() could be done a bit better. Here are
 some concrete things:
 

 [snip]

 Memory usage is definitely something we're interested in improving.
 Since you've already looked at this in some detail, could you try
 implementing one or two of your ideas and see if it makes a difference
 in memory consumption? Changing from a dict to a set looks promising,
 and should be a fairly self-contained way of starting on this. If it
 works, please post the patch on http://bugs.python.org with your
 results and assign it to me for review.

 Thanks,
 Collin Winter
   
(I did end up subscribing, just with a different email address :)

What is the best branch to start working from? trunk?

John
=:-

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread Collin Winter
On Thu, Apr 9, 2009 at 9:34 AM, John Arbash Meinel
john.arbash.mei...@gmail.com wrote:
 ...

 Anyway, I the internals of intern() could be done a bit better. Here are
 some concrete things:


 [snip]

 Memory usage is definitely something we're interested in improving.
 Since you've already looked at this in some detail, could you try
 implementing one or two of your ideas and see if it makes a difference
 in memory consumption? Changing from a dict to a set looks promising,
 and should be a fairly self-contained way of starting on this. If it
 works, please post the patch on http://bugs.python.org with your
 results and assign it to me for review.

 Thanks,
 Collin Winter

 (I did end up subscribing, just with a different email address :)

 What is the best branch to start working from? trunk?

That's a good place to start, yes. If the idea works well, we'll want
to port it to the py3k branch, too, but that can wait.

Collin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread Christian Heimes
John Arbash Meinel wrote:
 When I looked at the actual references from interned, I saw mostly
 variable names. Considering that every variable goes through the python
 intern dict. And when you look at the intern function, it doesn't use
 setdefault logic, it actually does a get() followed by a set(), which
 means the cost of interning is 1-2 lookups depending on likelyhood, etc.
 (I saw a whole lot of strings as the error codes in win32all /
 winerror.py, and windows error codes tend to be longer-than-average
 variable length.)

I've read your posting twice but I'm still not sure if you are aware of
the most important feature of interned strings. In the first place
interning not about saving some bytes of memory but a speed
optimization. Interned strings can be compared with a simple and fast
pointer comparison. With interend strings you can simple write:

char *a, *b;
if (a == b) {
...
}

Instead of:

char *a, *b;
if (strcmp(a, b) == 0) {
...
}

A compiler can optimize the pointer comparison much better than a
function call.

 Anyway, I the internals of intern() could be done a bit better. Here are
 some concrete things:
 
   a) Don't keep a double reference to both key and value to the same
  object (1 pointer per entry), this could be as simple as using a
  Set() instead of a dict()
 
   b) Don't cache the hash key in the set, as strings already cache them.
  (1 long per entry). This is a big win for space, but would need to
  be balanced against lookup and collision resolving speed.
 
  My guess is that reducing the size of the set will actually improve
  speed more, because more items can fit in cache. It depends on how
  many times you need to resolve a collision. If the string hash is
  sufficiently spread out, and the load factor is reasonable, then
  likely when you actually find an item in the set, it will be the
  item you want, and you'll need to bring the string object into
  cache anyway, so that you can do a string comparison (rather than
  just a hash comparison.)
 
   c) Use the existing lookup function one time. (PySet-lookup())
  Sets already have a lookup which is optimized for strings, and
  returns a pointer to where the object would go if it exists. Which
  means the intern() function can do a single lookup resolving any
  collisions, and return the object or insert without doing a second
  lookup.
 
   d) Having a special structure might also allow for separate optimizing
  of things like 'default size', 'grow rate', 'load factor', etc. A
  lot of this could be tuned specifically knowing that we really only
  have 1 of these objects, and it is going to be pointing at a lot of
  strings that are  50 bytes long.
 
  If hashes of variable name strings are well distributed, we could
  probably get away with a load factor of 2. If we know we are likely
  to have lots and lots that never go away (you rarely *unload*
  modules, and all variable names are in the intern dict), that would
  suggest having a large initial size, and probably a wide growth
  factor to avoid spending a lot of time resizing the set.

I agree that a dict is not the most memory efficient data structure for
interned strings. However dicts are extremely well tested and highly
optimized. Any specialized data structure needs to be desinged and
tested very carefully. If you happen to break the interning system it's
going to lead to rather nasty and hard to debug problems.

   e) How tuned is String.hash() for the fact that most of these strings
  are going to be ascii text? (I know that python wants to support
  non-ascii variable names, but I still think there is going to be an
  overwhelming bias towards characters in the range 65-122 ('A'-'z').

Python 3.0 uses unicode for all names. You have to design something that
can be adopted to unicode, too. By the way do you know that dicts have
an optimized lookup function for strings? It's called lookdict_unicode /
 lookdict_string.

 Also note that the performance of the interned dict gets even worse on
 64-bit platforms. Where the size of a 'dictentry' doubles, but the
 average length of a variable name wouldn't change.
 
 Anyway, I would be happy to implement something along the lines of a
 StringSet, or maybe the InternSet, etc. I just wanted to check if
 people would be interested or not.

Since interning is mostly used in the core and extension modules you
might want to experiment with a different growth rate. The interning
data structure could start with a larger value and have a slower, non
progressive data growth rate.

Christian
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] email package Bytes vs Unicode (was Re: Dropping bytes support in json)

2009-04-09 Thread Tony Nelson
(email-sig dropped, as I didn't see Steve Holden's message there)

At 12:20 -0400 04/09/2009, Steve Holden wrote:
Tony Nelson wrote:
 ...
 If you need the data from the message, by all means extract it and store it
 in whatever form is useful to the purpose of the database.  If you need the
 entire message, store it intact in the database, as the bytes it is.  Email
 isn't Unicode any more than a JPEG or other image types (often payloads in
 a message) are Unicode.

This is all great, and I did quite quickly realize that the best
approach was to store the mails in their network byte-stream format as
bytes. The approach was negated in my own case because of PostgreSQL's
execrable BLOB-handling capabilities. I took a look at the escaping they
required, snorted with derision and gave it up as a bad job.
 ...

I use MySQL, but sort of intend to learn PostgreSQL.  I didn't know that
PostgreSQL has no real support for BLOBs.  I agree that having to import
them from a file is awful.  Also, there appears to be a severe limit on the
size of character data fields, so storing in Base64 is out.  About the only
thing to do then is to use external storage for the BLOBs.

Still, email seems to demand such binary storage, whether all databases
provide it or not.
-- 

TonyN.:'   mailto:tonynel...@georgeanelson.com
  '  http://www.georgeanelson.com/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] BLOBs in Pg (was: email package Bytes vs Unicode)

2009-04-09 Thread Oleg Broytmann
On Thu, Apr 09, 2009 at 01:14:21PM -0400, Tony Nelson wrote:
 I use MySQL, but sort of intend to learn PostgreSQL.  I didn't know that
 PostgreSQL has no real support for BLOBs.

   I think it has - BYTEA data type.

Oleg.
-- 
 Oleg Broytmannhttp://phd.pp.ru/p...@phd.pp.ru
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread John Arbash Meinel
Christian Heimes wrote:
 John Arbash Meinel wrote:
 When I looked at the actual references from interned, I saw mostly
 variable names. Considering that every variable goes through the python
 intern dict. And when you look at the intern function, it doesn't use
 setdefault logic, it actually does a get() followed by a set(), which
 means the cost of interning is 1-2 lookups depending on likelyhood, etc.
 (I saw a whole lot of strings as the error codes in win32all /
 winerror.py, and windows error codes tend to be longer-than-average
 variable length.)
 
 I've read your posting twice but I'm still not sure if you are aware of
 the most important feature of interned strings. In the first place
 interning not about saving some bytes of memory but a speed
 optimization. Interned strings can be compared with a simple and fast
 pointer comparison. With interend strings you can simple write:
 
 char *a, *b;
 if (a == b) {
 ...
 }
 
 Instead of:
 
 char *a, *b;
 if (strcmp(a, b) == 0) {
 ...
 }
 
 A compiler can optimize the pointer comparison much better than a
 function call.
 

Certainly. But there is a cost associated with calling intern() in the
first place. You created a string, and you are now trying to de-dup it.
That cost is both in the memory to track all strings interned so far,
and the cost to do a dict lookup. And the way intern is currently
written, there is a third cost when the item doesn't exist yet, which is
another lookup to insert the object.

I'll also note that increasing memory does have a semi-direct effect on
performance, because more memory requires more time to bring memory back
and forth from main memory to CPU caches.

...

 I agree that a dict is not the most memory efficient data structure for
 interned strings. However dicts are extremely well tested and highly
 optimized. Any specialized data structure needs to be desinged and
 tested very carefully. If you happen to break the interning system it's
 going to lead to rather nasty and hard to debug problems.

Sure. My plan was to basically take the existing Set/Dict design, and
just tweak it slightly for the expected operations of interned.

 
   e) How tuned is String.hash() for the fact that most of these strings
  are going to be ascii text? (I know that python wants to support
  non-ascii variable names, but I still think there is going to be an
  overwhelming bias towards characters in the range 65-122 ('A'-'z').
 
 Python 3.0 uses unicode for all names. You have to design something that
 can be adopted to unicode, too. By the way do you know that dicts have
 an optimized lookup function for strings? It's called lookdict_unicode /
  lookdict_string.

Sure, but so does PySet. I'm not sure about lookset_unicode, but I would
guess that exists or should exist for py3k.

 
 Also note that the performance of the interned dict gets even worse on
 64-bit platforms. Where the size of a 'dictentry' doubles, but the
 average length of a variable name wouldn't change.

 Anyway, I would be happy to implement something along the lines of a
 StringSet, or maybe the InternSet, etc. I just wanted to check if
 people would be interested or not.
 
 Since interning is mostly used in the core and extension modules you
 might want to experiment with a different growth rate. The interning
 data structure could start with a larger value and have a slower, non
 progressive data growth rate.
 
 Christian

I'll also mention that there are other uses for intern() where it is
uniquely suitable. Namely, if you are parsing lots of text with
redundant strings, it is a way to decrease total memory consumption.
(And potentially speed up future comparisons, etc.)

The main reason why intern() is useful for this is because it doesn't
make strings immortal, as would happen if you used some other structure.
Because strings know about the interned object.

The options for a 3rd-party structure fall down into something like:

1) A cache that makes the strings immortal. (IIRC this is what older
versions of Python did.)

2) A cache that is periodically walked to see if any of the objects are
no longer externally referenced. The main problem here is that walking
is O(all-objects), whereas doing the checking at refcount=0 time means
you only check objects when you think the last reference has gone away.

3) Hijacking PyStringType-dealloc, so that when the refcount goes to 0
and Python want's to destroy the string, you then trigger your own cache
to look and see if it should remove the object.

Even further, you either have to check on every string dealloc, or
re-use PyStringObject-ob_sstate to track that you have placed this
string into your custom structure. Which would preclude ever calling
intern() on this string, because intern() doesn't just check a couple
bits, it looks at the entire ob_sstate value.

I think you could make it work, such that if your custom cache had set
some values, then intern() would just 

Re: [Python-Dev] BLOBs in Pg

2009-04-09 Thread Steve Holden
Oleg Broytmann wrote:
 On Thu, Apr 09, 2009 at 01:14:21PM -0400, Tony Nelson wrote:
 I use MySQL, but sort of intend to learn PostgreSQL.  I didn't know that
 PostgreSQL has no real support for BLOBs.
 
I think it has - BYTEA data type.
 
But the Python DB adapters appears to require some fairly hairy escaping
of the data to make it usable with the cursor execute() method. IMHO you
shouldn't have to escape data that is passed for insertion via a
parameterized query.

regards
 Steve
-- 
Steve Holden   +1 571 484 6266   +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
Watch PyCon on video now!  http://pycon.blip.tv/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread John Arbash Meinel
Alexander Belopolsky wrote:
 On Thu, Apr 9, 2009 at 11:02 AM, John Arbash Meinel
 j...@arbash-meinel.com wrote:
 ...
  a) Don't keep a double reference to both key and value to the same
 object (1 pointer per entry), this could be as simple as using a
 Set() instead of a dict()

 
 There is a rejected patch implementing just that:
 http://bugs.python.org/issue1507011 .
 

Thanks for the heads up.


So reading that thread, the final reason it was rejected was 2 part:

  Without reviewing the patch again, I also doubt it is capable of
  getting rid of the reference count cheating: essentially, this
  cheating enables the interning dictionary to have weak references to
  strings, this is important to allow automatic collection of certain
  interned strings. This feature needs to be preserved, so the cheating
  in the reference count must continue.

That specific argument was invalid. Because the patch just changed the
refcount trickery to use +- 1. And I'm pretty sure Alexander's argument
was just that +- 2 was weird, not that the weakref behavior was bad.

The other argument against the patch was based on the idea that:
  The operation give me the member equal but not identical to E is
  conceptually a lookup operation; the mathematical set construct has no
  such operation, and the Python set models it closely. IOW, set is
  *not* a dict with key==value.


I don't know if there was any consensus reached on this, since only
Martin responded this way.


I can say that for my do some work with a medium size code base, the
overhead of interned as a dictionary was 1.5MB out of 20MB total memory.

Simply changing it to a Set would drop this to 1.0MB. I have no proof
about the impact on performance, since I haven't benchmarked it yet.

Changing it to a StringSet could further drop it to 0.5MB. I would guess
that any performance impact would depend on whether the total size of
'interned' would fit inside L2 cache or not.


There is a small bug in the original patch adding the string to the set
failed. Namely it would return t == NULL which would be t != s and
the intern in place would end up setting your pointer to NULL rather
than doing nothing and clearing the error code.


So I guess some of it comes down to whether loweis would also reject
this change on the basis that mathematically a set is not a dict.
Though given that his claim nobody else is speaking in favor of the
patch, while at least Colin Winter has expressed some interest at this
point.

John
=:-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Martin v. Löwis
 This is an interesting question, and something I'm struggling with for
 the email package for 3.x.  It turns out to be pretty convenient to have
 both a bytes and a string API, both for input and output, but I think
 email really wants to be represented internally as bytes.  Maybe.  Or
 maybe just for content bodies and not headers, or maybe both.  Anyway,
 aside from that decision, I haven't come up with an elegant way to allow
 /output/ in both bytes and strings (input is I think theoretically
 easier by sniffing the arguments).

If you allow for content-transfer-encoding: 8bit, I think there is just
no way to represent email as text. You have to accept conversion to,
say, base64 (or quoted-unreadable) when converting an email message to
text.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] BLOBs in Pg (was: email package Bytes vs Unicode)

2009-04-09 Thread Tony Nelson
At 21:24 +0400 04/09/2009, Oleg Broytmann wrote:
On Thu, Apr 09, 2009 at 01:14:21PM -0400, Tony Nelson wrote:
 I use MySQL, but sort of intend to learn PostgreSQL.  I didn't know that
 PostgreSQL has no real support for BLOBs.

   I think it has - BYTEA data type.

So it does; I see that now that I've opened up the PostgreSQL docs.  I
don't find escaping data to be a problem -- I do it for all untrusted data.

So, after all, there isn't an example of a database that makes onerous the
storing of email and other such byte-oriented data, and Python's email
package has no need for workarounds in that area.
-- 

TonyN.:'   mailto:tonynel...@georgeanelson.com
  '  http://www.georgeanelson.com/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread Martin v. Löwis
 So I guess some of it comes down to whether loweis would also reject
 this change on the basis that mathematically a set is not a dict.

I'd like to point out that this was not the reason to reject it.
Instead, this (or, the opposite of it) was given as a reason why this
patch should be accepted (in msg50482). I found that a weak rationale
for making that change, in particular because I think the rationale
is incorrect.

I like your rationale (save memory) much more, and was asking in the
tracker for specific numbers, which weren't forthcoming.

 Though given that his claim nobody else is speaking in favor of the
 patch, while at least Colin Winter has expressed some interest at this
 point.

Again, at that point in the tracker, none of the other committers had
spoken in favor of the patch. Since I wasn't convinced of its
correctness, and nobody else (whom I trust) had reviewed it as correct,
I rejected it.

Now that you brought up a specific numbers, I tried to verify them,
and found them correct (although a bit unfortunate), please see my
test script below. Up to 21800 interned strings, the dict takes (only)
384kiB. It then grows, requiring 1536kiB. Whether or not having 22k
interned strings is typical, I still don't know.

Wrt. your proposed change, I would be worried about maintainability,
in particular if it would copy parts of the set implementation.

Regards,
Martin

import gc, sys
def find_interned_dict():
cand = None
for o in gc.get_objects():
if not isinstance(o, dict):
continue
if find_interned_dict not in o:
continue
for k,v in o.iteritems():
if k is not v:
break
else:
assert not cand
cand = o
return cand

d = find_interned_dict()
print len(d), sys.getsizeof(d)

l = []
for i in range(2):
if i%100==0:
print len(d), sys.getsizeof(d)

l.append(intern(repr(i)))
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] calling dictresize outside dictobject.c

2009-04-09 Thread Benjamin Peterson
Hi Dan,
Thanks for your interest.

2009/4/6 Dan Schult dsch...@colgate.edu:
 Hi,
 I'm trying to write a C extension which is a subclass of dict.
 I want to do something like a setdefault() but with a single lookup.

 Looking through the dictobject code, the three workhorse
 routines lookdict, insertdict and dictresize are not available
 directly for functions outside dictobject.c,
 but I can get at lookdict through dict-ma_lookup().

 So I use lookdict to get the PyDictEntry (call it ep) I'm looking for.
 The comments for lookdict say ep is ready to be set... so I do that.
 Then I check whether the dict needs to be resized--following the
 nice example of PyDict_SetItem.  But I can't call dictresize to finish
 off the process.

 Should I be using PyDict_SetItem directly?  No... it does its own lookup.
 I don't want a second lookup!   I already know which entry will be filled.

 So then I look at the code for setdefault and it also does
 a double lookup for checking and setting an entry.

 What subtle issue am I missing?
 Why does setdefault do a double lookup?
 More globally, why isn't dictresize available through the C-API?

Because it's not useful outside the intimate implementation details of
dictobject.c


 If there isn't a reason to do a double lookup I have a patch for setdefault,
 but I thought I should ask here first.

Raymond tells me the cost of the second lookup is negligible because
of caching, but PyObject_Hash needn't be called two times. He's
working on a patch later today.


-- 
Regards,
Benjamin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Alexandre Vassalotti
On Thu, Apr 9, 2009 at 1:15 AM, Antoine Pitrou solip...@pitrou.net wrote:
 As for reading/writing bytes over the wire, JSON is often used in the same
 context as HTML: you are supposed to know the charset and decode/encode the
 payload using that charset. However, the RFC specifies a default encoding of
 utf-8. (*)


 (*) http://www.ietf.org/rfc/rfc4627.txt


That is one short and sweet RFC. :-)

 The RFC also specifies a discrimination algorithm for non-supersets of ASCII
 (“Since the first two characters of a JSON text will always be ASCII
   characters [RFC0020], it is possible to determine whether an octet
   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
   at the pattern of nulls in the first four octets.”), but it is not
 implemented in the json module:


Given the RFC specifies that the encoding used should be one of the
encodings defined by Unicode, wouldn't be a better idea to remove the
unicode support, instead? To me, it would make sense to use the
detection algorithms for Unicode to sniff the encoding of the JSON
stream and then use the detected encoding to decode the strings embed
in the JSON stream.

Cheers,
-- Alexandre
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread John Arbash Meinel

...

 I like your rationale (save memory) much more, and was asking in the
 tracker for specific numbers, which weren't forthcoming.
 

...

 Now that you brought up a specific numbers, I tried to verify them,
 and found them correct (although a bit unfortunate), please see my
 test script below. Up to 21800 interned strings, the dict takes (only)
 384kiB. It then grows, requiring 1536kiB. Whether or not having 22k
 interned strings is typical, I still don't know.

Given that every variable name in any file is interned, it can grow
pretty rapidly. As an extreme case, consider the file
win32/lib/winerror.py which tracks all possible win32 errors.

 import winerror
 print len(winerror.__dict__)
1872

So a single error file has 1.9k strings.

My python version (2.5.2) doesn't have 'sys.getsizeof()', but otherwise
your code looks correct.

If all I do is find the interned dict, I see:
 print len(d)
5037

So stock python, without importing much extra (just os, sys, gc, etc.)
has almost 5k strings already.

I don't have a great regex yet for just extracting how many unique
strings there are in a given bit of source code.

However, if I do:

import gc, sys
def find_interned_dict():
cand = None
for o in gc.get_objects():
if not isinstance(o, dict):
continue
if find_interned_dict not in o:
continue
for k,v in o.iteritems():
if k is not v:
break
else:
assert not cand
cand = o
return cand

d = find_interned_dict()
print len(d)

# Just import a few of the core structures
from bzrlib import branch, repository, workingtree, builtins
print len(d)

I start at 5k strings, and after just importing the important bits of
bzrlib, I'm at:
19,316

Now, the bzrlib source code isn't particularly huge. It is about 3.7MB /
91k lines of .py files (that is, without importing the test suite).

Memory consumption with just importing bzrlib shows up at 15MB, with
300kB taken up by the intern dict.

If I then import some extra bits of bzrlib, like http support, ftp
support, and sftp support (which brings in python's httplib, and
paramiko, and ssh/sftp implementation), I'm up to:
 print len(d)
25186

Memory has jumped to 23MB, (interned is now 1.57MB) and I haven't
actually done anything but import python code yet. If I sum the size of
the PyString objects held in intern() it ammounts to 940KB. Though they
refer to only 335KB of char data. (or an average of 13 bytes per string).

 
 Wrt. your proposed change, I would be worried about maintainability,
 in particular if it would copy parts of the set implementation.

Right, so in the first part, I would just use Set(), as it could then
save 1/3rd of the memory it uses today. (Dropping down to 1MB from 1.5MB.)

I don't have numbers on how much that would improve CPU times, I would
imagine improving 'intern()' would impact import times more than run
times, simply because import time is interning a *lot* of strings.

Though honestly, Bazaar would really like this, because startup overhead
for us is almost 400ms to 'do nothing', which is a lot for a command
line app.

John
=:-

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Martin v. Löwis
 I can understand that you don't want to spend much time on it. How
 about removing it from 3.1? We could re-add it when long-term support
 becomes more likely.
 
 I'm speechless.

It seems that my statement has surprised you, so let me explain:

I think we should refrain from making design decisions (such as
API decisions) without Bob's explicit consent, unless we assign
a new maintainer for the simplejson module (perhaps just for the
3k branch, which perhaps would be a fork from Bob's code).

Antoine suggests that Bob did not comment on the issues at hand,
therefore, we should not proceed with the proposed design. Since
the 3.1 release is only a few weeks ahead, we have the choice of
either shipping with the broken version that is currently in the
3k branch, or drop the module from the 3k branch. I believe our
users are better served by not having to waste time with a module
that doesn't quite work, or may change.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread Martin v. Löwis
 I don't have numbers on how much that would improve CPU times, I would
 imagine improving 'intern()' would impact import times more than run
 times, simply because import time is interning a *lot* of strings.
 
 Though honestly, Bazaar would really like this, because startup overhead
 for us is almost 400ms to 'do nothing', which is a lot for a command
 line app.

Maybe I misunderstand your proposed change: how could the representation
of the interning dict possibly change the runtime of interning? (let
alone significantly)

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Martin v. Löwis
Alexandre Vassalotti wrote:
 On Thu, Apr 9, 2009 at 1:15 AM, Antoine Pitrou solip...@pitrou.net wrote:
 As for reading/writing bytes over the wire, JSON is often used in the same
 context as HTML: you are supposed to know the charset and decode/encode the
 payload using that charset. However, the RFC specifies a default encoding of
 utf-8. (*)


 (*) http://www.ietf.org/rfc/rfc4627.txt

 
 That is one short and sweet RFC. :-)

It is indeed well-specified. Unfortunately, it only talks about the
application/json type; the pre-existing other versions of json in MIME
types vary widely, such as text/plain (possibly with a charset=
parameter), text/json, or text/javascript. For these, the RFC doesn't
apply.

 Given the RFC specifies that the encoding used should be one of the
 encodings defined by Unicode, wouldn't be a better idea to remove the
 unicode support, instead? To me, it would make sense to use the
 detection algorithms for Unicode to sniff the encoding of the JSON
 stream and then use the detected encoding to decode the strings embed
 in the JSON stream.

That might be reasonable. (but then, I also stand by my view that we
shouldn't proceed without Bob's approval).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread John Arbash Meinel
Martin v. Löwis wrote:
 I don't have numbers on how much that would improve CPU times, I would
 imagine improving 'intern()' would impact import times more than run
 times, simply because import time is interning a *lot* of strings.

 Though honestly, Bazaar would really like this, because startup overhead
 for us is almost 400ms to 'do nothing', which is a lot for a command
 line app.
 
 Maybe I misunderstand your proposed change: how could the representation
 of the interning dict possibly change the runtime of interning? (let
 alone significantly)
 
 Regards,
 Martin
 

Decreasing memory consumption lets more things fit in cache. Once the
size of 'interned' is greater than fits into L2 cache, you start paying
the cost of a full memory fetch, which is usually measured in 100s of
cpu cycles.

Avoiding double lookups in the dictionary would be less overhead, though
the second lookup is probably pretty fast if there are no collisions,
since everything would already be in the local CPU cache.

If we were dealing in objects that were KB in size, it wouldn't matter.
But as the intern dict quickly gets into MB, it starts to make a bigger
difference.

How big of a difference would be very CPU and dataset size specific. But
certainly caches make certain things much faster, and once you overflow
a cache, performance can take a surprising turn.

So my primary goal is certainly a decrease of memory consumption. I
think it will have a small knock-on effect of improving performance, I
don't have anything to give concrete numbers.

Also, consider that resizing has to evaluate every object, thus paging
in all X bytes, and assigning to another 2X bytes. Cutting X by
(potentially 3), would probably have a small but measurable effect.

John
=:-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] BLOBs in Pg

2009-04-09 Thread Steve Holden
Tony Nelson wrote:
 At 21:24 +0400 04/09/2009, Oleg Broytmann wrote:
 On Thu, Apr 09, 2009 at 01:14:21PM -0400, Tony Nelson wrote:
 I use MySQL, but sort of intend to learn PostgreSQL.  I didn't know that
 PostgreSQL has no real support for BLOBs.
   I think it has - BYTEA data type.
 
 So it does; I see that now that I've opened up the PostgreSQL docs.  I
 don't find escaping data to be a problem -- I do it for all untrusted data.
 
You shouldn't have to when you are using parameterized queries.

 So, after all, there isn't an example of a database that makes onerous the
 storing of email and other such byte-oriented data, and Python's email
 package has no need for workarounds in that area.

Create a table:

CREATE TABLE tst
(
   id serial,
   byt bytea,
PRIMARY KEY (id)
) WITH (OIDS=FALSE)
;
ALTER TABLE tst OWNER TO steve;

The following program prints 0:

import psycopg2 as db
conn = db.connect(database=maildb, user=@@@, password=@@@,
host=localhost, port=5432)
curs = conn.cursor()
curs.execute(DELETE FROM tst)
curs.execute(INSERT INTO tst (byt) VALUES (%s),
 (.join(chr(i) for i in range(256)), ))
conn.commit()
curs.execute(SELECT byt FROM tst)
for st, in curs.fetchall():
print len(st)

If I change the date to use range(1, 256) I get a ProgrammingError fron
PostgreSQL invalid input syntax for type bytea.

If I can't pass a 256-byte string into a BLOB and get it back without
anything like this happening then there's *something* in the chain that
makes the database useless. My current belief is that this something is
fairly deeply embedded in the PostgreSQL engine. No syntax should be
necessary.

I suppose if we have to go round again on this we should take it to
email as we have gotten pretty far off-topic for python-dev.

regards
 Steve
-- 
Steve Holden   +1 571 484 6266   +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
Watch PyCon on video now!  http://pycon.blip.tv/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] BLOBs in Pg

2009-04-09 Thread Aahz
On Thu, Apr 09, 2009, Steve Holden wrote:

 import psycopg2 as db
 conn = db.connect(database=maildb, user=@@@, password=@@@,
 host=localhost, port=5432)
 curs = conn.cursor()
 curs.execute(DELETE FROM tst)
 curs.execute(INSERT INTO tst (byt) VALUES (%s),
  (.join(chr(i) for i in range(256)), ))
 conn.commit()
 curs.execute(SELECT byt FROM tst)
 for st, in curs.fetchall():
 print len(st)
 
 If I change the date to use range(1, 256) I get a ProgrammingError fron
 PostgreSQL invalid input syntax for type bytea.
 
 If I can't pass a 256-byte string into a BLOB and get it back without
 anything like this happening then there's *something* in the chain that
 makes the database useless. My current belief is that this something is
 fairly deeply embedded in the PostgreSQL engine. No syntax should be
 necessary.

You're not using a parameterized query.  I suggest you post to c.l.py for
more information.  ;-)
-- 
Aahz (a...@pythoncraft.com)   * http://www.pythoncraft.com/

Why is this newsgroup different from all other newsgroups?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] BLOBs in Pg

2009-04-09 Thread Oleg Broytmann
On Thu, Apr 09, 2009 at 04:42:21PM -0400, Steve Holden wrote:
 If I can't pass a 256-byte string into a BLOB and get it back without
 anything like this happening then there's *something* in the chain that
 makes the database useless.

import psycopg2

con = psycopg2.connect(database=test)
cur = con.cursor()
cur.execute(CREATE TABLE test (id serial, data BYTEA))
cur.execute('INSERT INTO test (data) VALUES (%s)', 
(psycopg2.Binary(''.join([chr(i) for i in range(256)])),))
cur.execute('SELECT * FROM test ORDER BY id')
for rec in cur.fetchall():
   print rec[0], type(rec[1]), repr(str(rec[1]))

Result:

1 type 'buffer' 
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
 
!#$%\'()*+,-./0123456789:;=?...@abcdefghijklmnopqrstuvwxyz[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'

   What am I doing wrong?

Oleg.
-- 
 Oleg Broytmannhttp://phd.pp.ru/p...@phd.pp.ru
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Bob Ippolito
On Thu, Apr 9, 2009 at 1:05 PM, Martin v. Löwis mar...@v.loewis.de wrote:
 I can understand that you don't want to spend much time on it. How
 about removing it from 3.1? We could re-add it when long-term support
 becomes more likely.

 I'm speechless.

 It seems that my statement has surprised you, so let me explain:

 I think we should refrain from making design decisions (such as
 API decisions) without Bob's explicit consent, unless we assign
 a new maintainer for the simplejson module (perhaps just for the
 3k branch, which perhaps would be a fork from Bob's code).

 Antoine suggests that Bob did not comment on the issues at hand,
 therefore, we should not proceed with the proposed design. Since
 the 3.1 release is only a few weeks ahead, we have the choice of
 either shipping with the broken version that is currently in the
 3k branch, or drop the module from the 3k branch. I believe our
 users are better served by not having to waste time with a module
 that doesn't quite work, or may change.

Most of my time to spend on json/simplejson and these mailing list
discussions is on weekends, I try not to bother with it when I'm busy
doing Actual Work unless there is a bug or some other issue that needs
more immediate attention. I also wasn't aware that I was expected to
comment on those issues. I'm CC'ed on the discussion for issue4136 but
I don't see any unanswered questions directed at me.

I have the issues (issue5723, issue4136) starred in my gmail and I
planned to look at it more closely later, hopefully on Friday or
Saturday.

As far as Python 3 goes, I honestly have not yet familiarized myself
with the changes to the IO infrastructure and what the new idioms are.
At this time, I can't make any educated decisions with regard to how
it should be done because I don't know exactly how bytes are supposed
to work and what the common idioms are for other libraries in the
stdlib that do similar things. Until I figure that out, someone else
is better off making decisions about the Python 3 version. My guess is
that it should work the same way as it does in Python 2.x: take bytes
or unicode input in loads (which means encoding is still relevant). I
also think the output of dumps should also be bytes, since it is a
serialization, but I am not sure how other libraries do this in Python
3 because one could argue that it is also text. If other libraries
that do text/text encodings (e.g. binascii, mimelib, ...) use str for
input and output instead of bytes then maybe Antoine's changes are the
right solution and I just don't know better because I'm not up to
speed with how people write Python 3 code.

I'll do my best to find some time to look into Python 3 more closely
soon, but thus far I have not been very motivated to do so because
Python 3 isn't useful for us at work and twiddling syntax isn't a very
interesting problem for me to solve.

-bob
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Martin v. Löwis
 As far as Python 3 goes, I honestly have not yet familiarized myself
 with the changes to the IO infrastructure and what the new idioms are.
 At this time, I can't make any educated decisions with regard to how
 it should be done because I don't know exactly how bytes are supposed
 to work and what the common idioms are for other libraries in the
 stdlib that do similar things.

It's really very similar to 2.x: the bytes type is to used in all
interfaces that operate on byte sequences that may or may not represent
characters; in particular, for interface where the operating system
deliberately uses bytes - ie. low-level file IO and socket IO; also
for cases where the encoding is embedded in the stream that still
needs to be processed (e.g. XML parsing).

(Unicode) strings should be used where the data is truly text by
nature, i.e. where no encoding information is necessary to find out
what characters are intended. It's used on interfaces where the
encoding is known (e.g. text IO, where the encoding is specified
on opening, XML parser results, with the declared encoding, and
GUI libraries, which naturally expect text).

 Until I figure that out, someone else
 is better off making decisions about the Python 3 version.

Some of us can certainly explain to you how this is supposed to
work. However, we need you to check any assumption against the
known use cases - would the users of the module be happy if it
worked one way or the other?

 My guess is
 that it should work the same way as it does in Python 2.x: take bytes
 or unicode input in loads (which means encoding is still relevant). I
 also think the output of dumps should also be bytes, since it is a
 serialization, but I am not sure how other libraries do this in Python
 3 because one could argue that it is also text.

This, indeed, had been an endless debate, and, in the end, the decision
was somewhat arbitrary. Here are some examples:

- base64.encodestring expects bytes (naturally, since it is supposed to
  encode arbitrary binary data), and produces bytes (debatably)
- binascii.b2a_hex likewise (expect and produce bytes)
- pickle.dumps produces bytes (uniformly, both for binary and text
  pickles)
- marshal.dumps likewise
- email.message.Message().as_string produces a (unicode) string
  (see Barry's recent thread on whether that's a good thing; the
  email package hasn't been fully ported to 3k, either)
- the XML libraries (continue to) parse bytes, and produce
  Unicode strings
- for the IO libraries, see above

 If other libraries
 that do text/text encodings (e.g. binascii, mimelib, ...) use str for
 input and output

See above - most of them don't; mimetools is no longer (replaced by
email package)

 instead of bytes then maybe Antoine's changes are the
 right solution and I just don't know better because I'm not up to
 speed with how people write Python 3 code.

There isn't too much fresh end-user code out there, so we can't really
tell, either. As for standard library users - users will do whatever
the library forces them to do.

This is why I'm so concerned about this issue: we should get it right,
or not done at all. I still think you would be the best person to
determine what is right.

 I'll do my best to find some time to look into Python 3 more closely
 soon, but thus far I have not been very motivated to do so because
 Python 3 isn't useful for us at work and twiddling syntax isn't a very
 interesting problem for me to solve.

And I didn't expect you to - it seems people are quite willing to do
the actual work, as long as there is some guidance.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread Martin v. Löwis
 Also, consider that resizing has to evaluate every object, thus paging
 in all X bytes, and assigning to another 2X bytes. Cutting X by
 (potentially 3), would probably have a small but measurable effect.

I'm *very* skeptical about claims on performance in the absence of
actual measurements. Too many effects come together, so the actual
performance is difficult to predict (and, for that prediction, you
would need *at least* a work load that you want to measure - starting
bzr would be such a workload, of course).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] BLOBs in Pg

2009-04-09 Thread Steve Holden
Oleg Broytmann wrote:
 On Thu, Apr 09, 2009 at 04:42:21PM -0400, Steve Holden wrote:
 If I can't pass a 256-byte string into a BLOB and get it back without
 anything like this happening then there's *something* in the chain that
 makes the database useless.
 
 import psycopg2
 
 con = psycopg2.connect(database=test)
 cur = con.cursor()
 cur.execute(CREATE TABLE test (id serial, data BYTEA))
 cur.execute('INSERT INTO test (data) VALUES (%s)', 
 (psycopg2.Binary(''.join([chr(i) for i in range(256)])),))
 cur.execute('SELECT * FROM test ORDER BY id')
 for rec in cur.fetchall():
print rec[0], type(rec[1]), repr(str(rec[1]))
 
 Result:
 
 1 type 'buffer' 
 '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
  
 !#$%\'()*+,-./0123456789:;=?...@abcdefghijklmnopqrstuvwxyz[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
 
What am I doing wrong?
 
 Oleg.

Corresponding with me, probably. Thank you Oleg. I feel suddenly saner
again.

regards
 Steve
-- 
Steve Holden   +1 571 484 6266   +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
Watch PyCon on video now!  http://pycon.blip.tv/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread Jake McGuire

On Apr 9, 2009, at 12:06 PM, Martin v. Löwis wrote:

Now that you brought up a specific numbers, I tried to verify them,
and found them correct (although a bit unfortunate), please see my
test script below. Up to 21800 interned strings, the dict takes (only)
384kiB. It then grows, requiring 1536kiB. Whether or not having 22k
interned strings is typical, I still don't know.

Wrt. your proposed change, I would be worried about maintainability,
in particular if it would copy parts of the set implementation.



I connected to a random one of our processes, which has been running  
for a typical amount of time and is currently at ~300MB RSS.


(gdb) p *(PyDictObject*)interned
$2 = {ob_refcnt = 1,
  ob_type = 0x8121240,
  ma_fill = 97239,
  ma_used = 95959,
  ma_mask = 262143,
  ma_table = 0xa493c008,
  }

Going from 3MB to 2.25MB isn't much, but it's not nothing, either.

I'd be skeptical of cache performance arguments given that the strings  
used in any particular bit of code should be spread pretty much evenly  
throughout the hash table, and 3MB seems solidly bigger than any L2  
cache I know of.  You should be able to get meaningful numbers out of  
a C profiler, but I'd be surprised to see the act of interning taking  
a noticeable amount of time.


-jake
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread Greg Ewing

John Arbash Meinel wrote:

And when you look at the intern function, it doesn't use
setdefault logic, it actually does a get() followed by a set(), which
means the cost of interning is 1-2 lookups depending on likelyhood, etc.


Keep in mind that intern() is called fairly rarely, mostly
only at module load time. It may not be worth attempting
to speed it up.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread Benjamin Peterson
2009/4/9 Greg Ewing greg.ew...@canterbury.ac.nz:
 John Arbash Meinel wrote:

 And when you look at the intern function, it doesn't use
 setdefault logic, it actually does a get() followed by a set(), which
 means the cost of interning is 1-2 lookups depending on likelyhood, etc.

 Keep in mind that intern() is called fairly rarely, mostly
 only at module load time. It may not be worth attempting
 to speed it up.

That's very important, though, for a command line tool for bazaar.
Even a few fractions of a second can make a difference in user
perception of speed.



-- 
Regards,
Benjamin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread John Arbash Meinel
Greg Ewing wrote:
 John Arbash Meinel wrote:
 And the way intern is currently
 written, there is a third cost when the item doesn't exist yet, which is
 another lookup to insert the object.
 
 That's even rarer still, since it only happens the first
 time you load a piece of code that uses a given variable
 name anywhere in any module.
 

Somewhat true, though I know it happens 25k times during startup of
bzr... And I would be a *lot* happier if startup time was 100ms instead
of 400ms.

John
=:-

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Evaluated cmake as an autoconf replacement

2009-04-09 Thread Neil Hodgson
   cmake does not produce relative paths in its generated make and
project files. There is an option CMAKE_USE_RELATIVE_PATHS which
appears to do this but the documentation says:

This option does not work for more complicated projects, and
relative paths are used when possible. In general, it is not possible
to move CMake generated makefiles to a different location regardless
of the value of this variable.

   This means that generated Visual Studio project files will not work
for other people unless a particular absolute build location is
specified for everyone which will not suit most. Each person that
wants to build Python will have to run cmake before starting Visual
Studio thus increasing the prerequisites.

   Neil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Barry Warsaw

On Apr 9, 2009, at 8:07 AM, Steve Holden wrote:

The real problem I came across in storing email in a relational  
database

was the inability to store messages as Unicode. Some messages have a
body in one encoding and an attachment in another, so the only ways to
store the messages are either as a monolithic bytes string that gets
parsed when the individual components are required or as a sequence of
components in the database's preferred encoding (if you want to keep  
the
original encoding most relational databases won't be able to help  
unless

you store the components as bytes).

All in all, as you might expect from a system that's been growing up
since 1970 or so, it can be quite intractable.


There are really two ways to look at an email message.  It's either an  
unstructured blob of bytes, or it's a structured tree of objects.   
Those objects have headers and payload.  The payload can be of any  
type, though I think it generally breaks down into strings for text/ 
* types and bytes for anything else (not counting multiparts).


The email package isn't a perfect mapping to this, which is something  
I want to improve.  That aside, I think storing a message in a  
database means storing some or all of the headers separately from the  
byte stream (or text?) of its payload.  That's for non-multipart  
types.  It would be more complicated to represent a message tree of  
course.


It does seem to make sense to think about headers as text header names  
and text header values.  Of course, header values can contain almost  
anything and there's an encoding to bring it back to 7-bit ASCII, but  
again, you really have two views of a header value.  Which you want  
really depends on your application.


Maybe you just care about the text of both the header name and value.   
In that case, I think you want the values as unicodes, and probably  
the headers as unicodes containing only ASCII.  So your table would be  
strings in both cases.  OTOH, maybe your application cares about the  
raw underlying encoded data, in which case the header names are  
probably still strings of ASCII-ish unicodes and the values are  
bytes.  It's this distinction (and I think the competing use cases)  
that make a true Python 3.x API for email more complicated.


Thinking about this stuff makes me nostalgic for the sloppy happy days  
of Python 2.x


-Barry



PGP.sig
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Barry Warsaw

On Apr 9, 2009, at 11:08 AM, Bill Janssen wrote:


Barry Warsaw ba...@python.org wrote:


Anyway, aside from that decision, I haven't come up with an
elegant way to allow /output/ in both bytes and strings (input is I
think theoretically easier by sniffing the arguments).


Probably a good thing.  It just promotes more confusion to do things
that way, IMO.


Very possibly so.  But applications will definitely want stuff like  
the text/plain payload as a unicode, or the image/gif payload as a  
bytes (or even as a PIL image or whatever).


Not that I think the email package needs to know about every content  
type under the sun, but I do think that it should be pluggable so as  
to allow applications to more conveniently access the data that way.   
Possibly the defaults should be unicodes for any text/* type and bytes  
for everything else.


-Barry



PGP.sig
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] the email module, text, and bytes (was Re: Dropping bytes support in json)

2009-04-09 Thread Barry Warsaw

On Apr 9, 2009, at 11:11 PM, gl...@divmod.com wrote:

I think this is a problematic way to model bytes vs. text; it gives  
text a special relationship to bytes which should be avoided.


IMHO the right way to think about domains like this is a multi-level  
representation.  The low level representation is always bytes,  
whether your MIME type is text/whatever or application/x-i-dont-know.


This is a really good point, and I really should be clearer when  
describing my current thinking (sleep would help :).


The thing that's special about text is that it's a high level  
representation that the standard library can know about.  But the  
'email' package ought to support being extended to support other  
types just as well.  For example, I want to ask for image/png  
content as PIL.Image objects, not bags of bytes.  Of course this  
presupposes some way for PIL itself to get at some bytes, but then  
you need the email module itself to get at the bytes to convert to  
text in much the same way.  There also needs to be layering at the  
level of bytes-base64-some different bytes-PIL-Image.  There are  
mail clients that will base64-encode unusual encodings so you have  
to do that same layering for text sometimes.


I'm also being somewhat handwavy with talk of low and high level  
representations; of course there are actually multiple levels beyond  
that.  I might want text/x-python content to show up as an AST, but  
the intermediate DOM-parsing representation really wants to operate  
on characters.  Similarly for a DOM and text/html content.  (Modulo  
the usual encoding-detection weirdness present in parsers.)


When I was talking about supporting text/* content types as strings, I  
was definitely thinking about using basically the same plug-in or  
higher level or whatever API to do that as you might use to get PIL  
images from an image/gif.


So, as long as there's a crisp definition of what layer of the MIME  
stack one is operating on, I don't think that there's really any  
ambiguity at all about what type you should be getting.


In that case, we really need the bytes-in-bytes-out-bytes-in-the-chewy- 
center API first, and build things on top of that.


-Barry



PGP.sig
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Barry Warsaw

On Apr 9, 2009, at 10:52 PM, Aahz wrote:


On Thu, Apr 09, 2009, Barry Warsaw wrote:


So, what I'm really asking is this.  Let's say you agree that there  
are
use cases for accessing a header value as either the raw encoded  
bytes or

the decoded unicode.  What should this return:


message['Subject']


The raw bytes or the decoded unicode?


Let's make that the raw bytes by default -- we can add a parameter to
Message() to specify that the default where possible is unicode for
returned values, if that isn't too painful.


I don't know whether the parameter thing will work or not, but you're  
probably right that we need to get the bytes-everywhere API first.


-Barry



PGP.sig
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Nick Coghlan
Barry Warsaw wrote:
 I don't know whether the parameter thing will work or not, but you're
 probably right that we need to get the bytes-everywhere API first.

Given that json is a wire protocol, that sounds like the right approach
for json as well. Once bytes-everywhere works, then a text API can be
built on top of it, but it is difficult to build a bytes API on top of a
text one.

So I guess the IO library *is* the right model: bytes at the bottom of
the stack, with text as a wrapper around it (mediated by codecs).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Barry Warsaw

On Apr 9, 2009, at 2:25 PM, Martin v. Löwis wrote:

This is an interesting question, and something I'm struggling with  
for
the email package for 3.x.  It turns out to be pretty convenient to  
have

both a bytes and a string API, both for input and output, but I think
email really wants to be represented internally as bytes.  Maybe.  Or
maybe just for content bodies and not headers, or maybe both.   
Anyway,
aside from that decision, I haven't come up with an elegant way to  
allow

/output/ in both bytes and strings (input is I think theoretically
easier by sniffing the arguments).


If you allow for content-transfer-encoding: 8bit, I think there is  
just

no way to represent email as text. You have to accept conversion to,
say, base64 (or quoted-unreadable) when converting an email message to
text.


Agreed.  But applications will want to deal with some parts of the  
message as text on the boundaries.  Internally, it should be all bytes  
(although even that is a pain to write ;).


-Barry



PGP.sig
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] the email module, text, and bytes (was Re: Dropping bytes support in json)

2009-04-09 Thread glyph


On 02:26 am, ba...@python.org wrote:
There are really two ways to look at an email message.  It's either an 
unstructured blob of bytes, or it's a structured tree of objects. 
Those objects have headers and payload.  The payload can be of any 
type, though I think it generally breaks down into strings for text/ 
* types and bytes for anything else (not counting multiparts).


I think this is a problematic way to model bytes vs. text; it gives text 
a special relationship to bytes which should be avoided.


IMHO the right way to think about domains like this is a multi-level 
representation.  The low level representation is always bytes, whether 
your MIME type is text/whatever or application/x-i-dont-know.


The thing that's special about text is that it's a high level 
representation that the standard library can know about.  But the 
'email' package ought to support being extended to support other types 
just as well.  For example, I want to ask for image/png content as 
PIL.Image objects, not bags of bytes.  Of course this presupposes some 
way for PIL itself to get at some bytes, but then you need the email 
module itself to get at the bytes to convert to text in much the same 
way.  There also needs to be layering at the level of 
bytes-base64-some different bytes-PIL-Image.  There are mail clients 
that will base64-encode unusual encodings so you have to do that same 
layering for text sometimes.


I'm also being somewhat handwavy with talk of low and high level 
representations; of course there are actually multiple levels beyond 
that.  I might want text/x-python content to show up as an AST, but the 
intermediate DOM-parsing representation really wants to operate on 
characters.  Similarly for a DOM and text/html content.  (Modulo the 
usual encoding-detection weirdness present in parsers.)


So, as long as there's a crisp definition of what layer of the MIME 
stack one is operating on, I don't think that there's really any 
ambiguity at all about what type you should be getting.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Aahz
On Thu, Apr 09, 2009, Barry Warsaw wrote:

 So, what I'm really asking is this.  Let's say you agree that there are 
 use cases for accessing a header value as either the raw encoded bytes or 
 the decoded unicode.  What should this return:

  message['Subject']

 The raw bytes or the decoded unicode?

Let's make that the raw bytes by default -- we can add a parameter to
Message() to specify that the default where possible is unicode for
returned values, if that isn't too painful.

Here's my reasoning: ultimately, everyone NEEDS to understand that the
underlying transport for e-mail is bytes (similar to sockets).  We do
people no favors by pasting over this too much.  We can overlay
convenience at various points, but except for text payloads, everything
should be bytes by default.  

Even for text payloads, I'm not entirely certain the default shouldn't be
bytes: consider an HTML attachment that you want to compare against the
output from a webserver.  Still, as long as it's easy to get bytes for
text payloads, I think overall I'm still leaning toward unicode for them.
-- 
Aahz (a...@pythoncraft.com)   * http://www.pythoncraft.com/

Why is this newsgroup different from all other newsgroups?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Barry Warsaw

On Apr 9, 2009, at 11:55 AM, Daniel Stutzbach wrote:


On Thu, Apr 9, 2009 at 6:01 AM, Barry Warsaw ba...@python.org wrote:
Anyway, aside from that decision, I haven't come up with an elegant  
way to allow /output/ in both bytes and strings (input is I think  
theoretically easier by sniffing the arguments).


Won't this work? (assuming dumps() always returns a string)

def dumpb(obj, encoding='utf-8', *args, **kw):
s = dumps(obj, *args, **kw)
return s.encode(encoding)


So, what I'm really asking is this.  Let's say you agree that there  
are use cases for accessing a header value as either the raw encoded  
bytes or the decoded unicode.  What should this return:


 message['Subject']

The raw bytes or the decoded unicode?

Okay, so you've picked one.  Now how do you spell the other way?

The Message class probably has these explicit methods:

 Message.get_header_bytes('Subject')
 Message.get_header_string('Subject')

(or better names... it's late and I'm tired ;).  One of those maps to  
message['Subject'] but which is the more obvious choice?


Now, setting headers.  Sometimes you have some unicode thing and  
sometimes you have some bytes.  You need to end up with bytes in the  
ASCII range and you'd like to leave the header value unencoded if so.   
But in both cases, you might have bytes or characters outside that  
range, so you need an explicit encoding, defaulting to utf-8 probably.


 Message.set_header('Subject', 'Some text', encoding='utf-8')
 Message.set_header('Subject', b'Some bytes')

One of those maps to

 message['Subject'] = ???

I'm open to any suggestions here!
-Barry



PGP.sig
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] email package Bytes vs Unicode (was Re: Dropping bytes support in json)

2009-04-09 Thread Barry Warsaw

On Apr 9, 2009, at 12:20 PM, Steve Holden wrote:


PostgreSQL strongly encourages you to store text as encoded columns.
Because emails lack an encoding it turns out this is a most  
inconvenient
storage type for it. Sadly BLOBs are such a pain in PostgreSQL that  
it's

easier to store the messages in external files and just use the
relational database to index those files to retrieve content, so  
that's

what I ended up doing.


That's not insane for other reasons.  Do you really want to store 10MB  
of mp3 data in your database?


Which of course reminds me that I want to add an interface, probably  
to the parser and message class, to allow an application to store  
message payloads in other than memory.  Parsing and holding onto  
messages with huge payloads can kill some applications, when you might  
not care too much about the actual payload content.


Barry



PGP.sig
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread Barry Warsaw

On Apr 9, 2009, at 11:21 PM, Nick Coghlan wrote:


Barry Warsaw wrote:

I don't know whether the parameter thing will work or not, but you're
probably right that we need to get the bytes-everywhere API first.


Given that json is a wire protocol, that sounds like the right  
approach

for json as well. Once bytes-everywhere works, then a text API can be
built on top of it, but it is difficult to build a bytes API on top  
of a

text one.


Agreed!


So I guess the IO library *is* the right model: bytes at the bottom of
the stack, with text as a wrapper around it (mediated by codecs).


Yes, that's a very interesting (and proven?) model.  I don't quite see  
how we could apply that email and json, but it seems like there's a  
good idea there. ;)


-Barry



PGP.sig
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Email-SIG] Dropping bytes support in json

2009-04-09 Thread Tony Nelson
At 22:38 -0400 04/09/2009, Barry Warsaw wrote:
 ...
So, what I'm really asking is this.  Let's say you agree that there
are use cases for accessing a header value as either the raw encoded
bytes or the decoded unicode.  What should this return:

  message['Subject']

The raw bytes or the decoded unicode?

That's an easy one:  Subject: is an unstructured header, so it must be
text, thus Unicode.  We're looking at a high-level representation of an
email message, with parsed header fields and a MIME message tree.


Okay, so you've picked one.  Now how do you spell the other way?

message.get_header_bytes('Subject')

Oh, I see that's what you picked.

The Message class probably has these explicit methods:

  Message.get_header_bytes('Subject')
  Message.get_header_string('Subject')

(or better names... it's late and I'm tired ;).  One of those maps to
message['Subject'] but which is the more obvious choice?

Structured header fields are more of a problem.  Any header with addresses
should return a list of addresses.  I think the default return type should
depend on the data type.  To get an explicit bytes or string or list of
addresses, be explicit; otherwise, for convenience, return the appropriate
type for the particular header field name.


Now, setting headers.  Sometimes you have some unicode thing and
sometimes you have some bytes.  You need to end up with bytes in the
ASCII range and you'd like to leave the header value unencoded if so.
But in both cases, you might have bytes or characters outside that
range, so you need an explicit encoding, defaulting to utf-8 probably.

Never for header fields.  The default is always RFC 2047, unless it isn't,
say for params.

The Message class should create an object of the appropriate subclass of
Header based on the name (or use the existing object, see other
discussion), and that should inspect its argument and DTRT or complain.


  Message.set_header('Subject', 'Some text', encoding='utf-8')
  Message.set_header('Subject', b'Some bytes')

One of those maps to

  message['Subject'] = ???

The expected data type should depend on the header field.  For Subject:, it
should be bytes to be parsed or verbatim text.  For To:, it should be a
list of addresses or bytes or text to be parsed.

The email package should be pythonic, and not require deep understanding of
dozens of RFCs to use properly.  Users don't need to know about the raw
bytes; that's the whole point of MIME and any email package.  It should be
easy to set header fields with their natural data types, and doing it with
bad data should produce an error.  This may require a bit more care in the
message parser, to always produce a parsed message with defects.
-- 

TonyN.:'   mailto:tonynel...@georgeanelson.com
  '  http://www.georgeanelson.com/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread Mike Klaas


On 9-Apr-09, at 6:24 PM, John Arbash Meinel wrote:


Greg Ewing wrote:

John Arbash Meinel wrote:

And the way intern is currently
written, there is a third cost when the item doesn't exist yet,  
which is

another lookup to insert the object.


That's even rarer still, since it only happens the first
time you load a piece of code that uses a given variable
name anywhere in any module.



Somewhat true, though I know it happens 25k times during startup of
bzr... And I would be a *lot* happier if startup time was 100ms  
instead

of 400ms.


I don't want to quash your idealism too severely, but it is extremely  
unlikely that you are going to get anywhere near that kind of speed up  
by tweaking string interning.  25k times doing anything (computation)  
just isn't all that much.


$ python -mtimeit -s 'd=dict.fromkeys(xrange(1000))' 'for x in  
xrange(25000): d.get(x)'

100 loops, best of 3: 8.28 msec per loop

Perhaps this isn't representative (int hashing is ridiculously cheap,  
for instance), but the dict itself is far bigger than the dict you are  
dealing with and such would have similar cache-busting properties.   
And yet, 25k accesses (plus python-c dispatching costs which you are  
paying with interning) consume only ~10ms.  You could do more good by  
eliminating a handful of disk seeks by reducing the number of imported  
modules...


-Mike
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] decorator module in stdlib?

2009-04-09 Thread Guido van Rossum
On Wed, Apr 8, 2009 at 9:31 PM, Michele Simionato
michele.simion...@gmail.com wrote:
 Then perhaps you misunderstand the goal of the decorator module.
 The raison d'etre of the module is to PRESERVE the signature:
 update_wrapper unfortunately *changes* it.

 When confronted with a library which I do not not know, I often run
 over it pydoc, or sphinx, or a custom made documentation tool, to extract the
 signature of functions.

Ah, I see. Personally I rarely trust automatically extracted
documentation -- too often in my experience it is out of date or
simply absent. Extracting the signatures in theory wouldn't lie, but
in practice I still wouldn't trust it -- not only because of what
decorators might or might not do, but because it might still be
misleading. Call me old-fashioned, but I prefer to read the source
code.

 For instance, if I see a method
 get_user(self, username) I have a good hint about what it is supposed
 to do. But if the library (say a web framework) uses non signature-preserving
 decorators, my documentation tool says to me that there is function
 get_user(*args, **kwargs) which frankly is not enough [this is the
 optimistic case, when the author of the decorator has taken care
 to preserve the name of the original function].

But seeing the decorator is often essential for understanding what
goes on! Even if the decorator preserves the signature (in truth or
according inspect), many decorators *do* something, and it's important
to know how a function is decorated. For example, I work a lot with a
small internal framework at Google whose decorators can raise
exceptions and set instance variables; they also help me understand
under which conditions a method can be called.

  I *hate* losing information about the true signature of functions, since I 
 also
 use a lot IPython, Python help, etc.

I guess we just have different styles. That's fine.

 I must admit that while I still like decorators, I do like them as
 much as in the past.

 Of course there was a missing NOT in this sentence, but you all understood
 the intended meaning.

 (All this BTW is not to say that I don't trust you with commit
 privileges if you were to be interested in contributing. I just don't
 think that adding that particular decorator module to the stdlib would
 be wise. It can be debated though.)

 Fine. As I have repeated many time that particular module was never
 meant for inclusion in the standard library.

Then perhaps it shouldn't -- I haven't looked but if you don't plan
stdlib inclusion it is often the case that the API style and/or
implementation details make stdlib inclusion unrealistic. (Though
admittedly some older modules wouldn't be accepted by today's
standards either -- and I'm not just talking PEP-8 compliance! :-)

 But I feel strongly about
 the possibility of being able to preserve (not change!) the function
 signature.

That could be added to functools if enough people want it.

 I do not think everybody disagree with your point here. My point still
 stands, though: objects should not lie about their signature, especially
 during  debugging and when generating documentation from code.

Source code never lies. Debuggers should make access to the source
code a key point. And good documentation should be written by a human,
not automatically cobbled together from source code and a few doc
strings.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Email-SIG] Dropping bytes support in json

2009-04-09 Thread Tony Nelson
At 22:26 -0400 04/09/2009, Barry Warsaw wrote:

There are really two ways to look at an email message.  It's either an
unstructured blob of bytes, or it's a structured tree of objects.
Those objects have headers and payload.  The payload can be of any
type, though I think it generally breaks down into strings for text/
* types and bytes for anything else (not counting multiparts).

The email package isn't a perfect mapping to this, which is something
I want to improve.  That aside, I think storing a message in a
database means storing some or all of the headers separately from the
byte stream (or text?) of its payload.  That's for non-multipart
types.  It would be more complicated to represent a message tree of
course.

Storing an email message in a database does mean storing some of the header
fields as database fields, but the set of email header fields is open, so
any unused fields in a message must be stored elsewhere.  It isn't useful
to just have a bag of name/value pairs in a table.  General message MIME
payload trees don't map well to a database either, unless one wants to get
very relational.  Sometimes the database needs to represent the entire
email message, header fields and MIME tree, but only if it is an email
program and usually not even then.  Usually, the database has a specific
purpose, and can be designed for the data it cares about; it may choose to
keep the original message as bytes.


It does seem to make sense to think about headers as text header names
and text header values.  Of course, header values can contain almost
anything and there's an encoding to bring it back to 7-bit ASCII, but
again, you really have two views of a header value.  Which you want
really depends on your application.

I think of header fields as having text-like names (the set of allowed
characters is more than just text, though defined headers don't make use of
that), but the data is either bytes or it should be parsed into something
appropriate:  text for unstructured fields like Subject:, a list of
addresses for address fields like To:.  Many of the structured header
fields have a reasonable mapping to text; certainly this is true for adress
header fields.  Content-Type header fields are barely text, they can be so
convolutedly structured, but I suppose one could flatten one of them to
text instead of bytes if the user wanted.  It's not very useful, though,
except for debugging (either by the programmer or the recipient who wants
to know what was cleaned from the message).


Maybe you just care about the text of both the header name and value.
In that case, I think you want the values as unicodes, and probably
the headers as unicodes containing only ASCII.  So your table would be
strings in both cases.  OTOH, maybe your application cares about the
raw underlying encoded data, in which case the header names are
probably still strings of ASCII-ish unicodes and the values are
bytes.  It's this distinction (and I think the competing use cases)
that make a true Python 3.x API for email more complicated.

If a database stores the Subject: header field, it would be as text.  The
various recipient address fields are a one message to many names and
addresses mapping, and need a related table of name/address fields, with
each field being text.  The original message (or whatever part of it one
preserves) should be bytes.  I don't think this complicates the email
package API; rather, it just shows where generality is needed.


Thinking about this stuff makes me nostalgic for the sloppy happy days
of Python 2.x

You now have the opportunity to finally unsnarl that mess.  It is not an
insurmountable opportunity.
-- 

TonyN.:'   mailto:tonynel...@georgeanelson.com
  '  http://www.georgeanelson.com/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread Jeffrey Yasskin
On Thu, Apr 9, 2009 at 6:24 PM, John Arbash Meinel
john.arbash.mei...@gmail.com wrote:
 Greg Ewing wrote:
 John Arbash Meinel wrote:
 And the way intern is currently
 written, there is a third cost when the item doesn't exist yet, which is
 another lookup to insert the object.

 That's even rarer still, since it only happens the first
 time you load a piece of code that uses a given variable
 name anywhere in any module.


 Somewhat true, though I know it happens 25k times during startup of
 bzr... And I would be a *lot* happier if startup time was 100ms instead
 of 400ms.

I think you have plenty of a case to try it out. If you code it up and
it doesn't speed anything up, well then we've learned something, and
maybe it'll be useful anyway for the memory savings. If it does speed
things up, well then Python's faster. I wouldn't waste time arguing
about it before you have the change written.

Good luck!
Jeffrey
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread Collin Winter
On Thu, Apr 9, 2009 at 6:24 PM, John Arbash Meinel
john.arbash.mei...@gmail.com wrote:
 Greg Ewing wrote:
 John Arbash Meinel wrote:
 And the way intern is currently
 written, there is a third cost when the item doesn't exist yet, which is
 another lookup to insert the object.

 That's even rarer still, since it only happens the first
 time you load a piece of code that uses a given variable
 name anywhere in any module.


 Somewhat true, though I know it happens 25k times during startup of
 bzr... And I would be a *lot* happier if startup time was 100ms instead
 of 400ms.

Quite so. We have a number of internal tools, and they find that
frequently just starting up Python takes several times the duration of
the actual work unit itself. I'd be very interested to review any
patches you come up with to improve start-up time; so far on this
thread, there's been a lot of theory and not much practice. I'd
approach this iteratively: first replace the dict with a set, then if
that bears fruit, consider a customized data structure; if that bears
fruit, etc.

Good luck, and be sure to let us know what you find,
Collin Winter
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread Guido van Rossum
On Thu, Apr 9, 2009 at 9:07 PM, Collin Winter coll...@gmail.com wrote:
 On Thu, Apr 9, 2009 at 6:24 PM, John Arbash Meinel 
 john.arbash.mei...@gmail.com wrote:

 And I would be a *lot* happier if startup time was 100ms instead
  of 400ms.

 Quite so. We have a number of internal tools, and they find that
 frequently just starting up Python takes several times the duration of
 the actual work unit itself. I'd be very interested to review any
 patches you come up with to improve start-up time; so far on this
 thread, there's been a lot of theory and not much practice. I'd
 approach this iteratively: first replace the dict with a set, then if
 that bears fruit, consider a customized data structure; if that bears
 fruit, etc.

 Good luck, and be sure to let us know what you find,

Just to add some skepticism, has anyone done any kind of
instrumentation of bzr start-up behavior?  IIRC every time I was asked
to reduce the start-up cost of some Python app, the cause was too many
imports, and the solution was either to speed up import itself (.pyc
files were the first thing ever that came out of that -- importing
from a single .zip file is one of the more recent tricks) or to reduce
the number of modules imported at start-up (or both :-). Heavy-weight
frameworks are usually the root cause, but usually there's nothing
that can be done about that by the time you've reached this point. So,
amen on the good luck, but please start with a bit of analysis.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Adding new features to Python 2.x (PEP 382: Namespace Packages)

2009-04-09 Thread Guido van Rossum
On Thu, Apr 9, 2009 at 5:53 AM, Aahz a...@pythoncraft.com wrote:
 On Thu, Apr 09, 2009, Nick Coghlan wrote:

 Martin v. L?wis wrote:
 Such a policy would then translate to a dead end for Python 2.x
 based applications.

 2.x based applications *are* in a dead end, with the only exit
 being portage to 3.x.

 The actual end of the dead end just happens to be in 2013 or so :)

 More like 2016 or 2020 -- as of January, my former employer was still
 using Python 2.3, and I wouldn't be surprised if 1.5.2 was still out in
 the wilds.  The transition to 3.x is more extreme, and lots of people
 will continue making do for years after any formal support is dropped.

There's nothing wrong with that. People using 1.5.2 today certainly
aren't asking for support, and people using 2.3 probably aren't
expecting much either. That's fine, those Python versions are as
stable as the rest of their environment. (I betcha they're still using
GCC 2.96 too, though they probably don't have any reason to build a
new Python binary from source. :-)

People *will* be using 2.6 well past 2013. But will they care about
the Python community actively supporting it? Of course not! Anything
we did would probably break something for them.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Rethinking intern() and its data structure

2009-04-09 Thread John Arbash Meinel

...
 Somewhat true, though I know it happens 25k times during startup of
 bzr... And I would be a *lot* happier if startup time was 100ms instead
 of 400ms.
 
 I don't want to quash your idealism too severely, but it is extremely
 unlikely that you are going to get anywhere near that kind of speed up
 by tweaking string interning.  25k times doing anything (computation)
 just isn't all that much.
 
 $ python -mtimeit -s 'd=dict.fromkeys(xrange(1000))' 'for x in
 xrange(25000): d.get(x)'
 100 loops, best of 3: 8.28 msec per loop
 
 Perhaps this isn't representative (int hashing is ridiculously cheap,
 for instance), but the dict itself is far bigger than the dict you are
 dealing with and such would have similar cache-busting properties.  And
 yet, 25k accesses (plus python-c dispatching costs which you are paying
 with interning) consume only ~10ms.  You could do more good by
 eliminating a handful of disk seeks by reducing the number of imported
 modules...
 
 -Mike
 

You're also using timeit over the same set of 25k keys, which means it
only has to load that subset. And as you are using identical runs each
time, those keys are already loaded into your cache lines... And given
how hash(int) works, they are all sequential in memory, and all 10M in
your original set have 0 collisions. Actually, at 10M, you'll have a
dict of size 20M entries, and the first 10M entries will be full, and
the trailing 10M entries will all be empty.

That said, you're right, the benefits of a smaller structure are going
to be small. I'll just point that if I just do a small tweak to your
timing and do:

$ python -mtimeit -s 'd=dict.fromkeys(xrange(1000))' 'for x in
  xrange(25000): d.get(x)'
100 loops, best of 3: 6.27 msec per loop

So slightly faster than yours, *but*, lets try a much smaller dict:

$ python -mtimeit -s 'd=dict.fromkeys(xrange(25000))' 'for x in
  xrange(25000): d.get(x)'
100 loops, best of 3: 6.35 msec per loop

Pretty much the same time. Well within the noise margin. But if I go
back to the big dict and actually select 25k keys across the whole set:

$ TIMEIT -s 'd=dict.fromkeys(xrange(1000));' \
 -s keys=range(0,1000,1000/25000)' \
 'for x in keys: d.get(x)'
100 loops, best of 3: 13.1 msec per loop

Now I'm still accessing 25k keys, but I'm doing it across the whole
range, and suddenly the time *doubled*.

What about slightly more random access:
$ TIMEIT -s 'import random; d=dict.fromkeys(xrange(1000));'
-s 'bits = range(0, 1000, 400); random.shuffle(bits)'\
'for x in bits: d.get(x)'
100 loops, best of 3: 15.5 msec per loop

Not as big of a difference as I thought it would be... But I bet if
there was a way to put the random shuffle in the inner loop, so you
weren't accessing the same identical 25k keys internally, you might get
more interesting results.

As for other bits about exercising caches:

$ shuffle(range(0, 1000, 400))
100 loops, best of 3: 15.5 msec per loop

$ shuffle(range(0, 1000, 40))
10 loops, best of 3: 175 msec per loop

10x more keys, costs 11.3x, pretty close to linear.

$ shuffle(range(0, 1000, 10))
10 loops, best of 3: 739 msec per loop

4x the keys, 4.5x the time, starting to get more into nonlinear effects.

Anyway, you're absolutely right. intern() overhead is a tiny fraction of
'import bzrlib.*' time, so I don't expect to see amazing results. That
said, accessing 25k keys in a smaller structure is 2x faster than
accessing 25k keys spread across a larger structure.

John
=:-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread glyph


On 02:38 am, ba...@python.org wrote:
So, what I'm really asking is this.  Let's say you agree that there 
are use cases for accessing a header value as either the raw encoded 
bytes or the decoded unicode.  What should this return:


 message['Subject']

The raw bytes or the decoded unicode?


My personal preference would be to just get deprecate this API, and get 
rid of it, replacing it with a slightly more explicit one.


   message.headers['Subject']
   message.bytes_headers['Subject']
Now, setting headers.  Sometimes you have some unicode thing and 
sometimes you have some bytes.  You need to end up with bytes in the 
ASCII range and you'd like to leave the header value unencoded if so. 
But in both cases, you might have bytes or characters outside that 
range, so you need an explicit encoding, defaulting to utf-8 probably.


   message.headers['Subject'] = 'Some text'

should be equivalent to

   message.headers['Subject'] = Header('Some text')

My preference would be that

   message.headers['Subject'] = b'Some Bytes'

would simply raise an exception.  If you've got some bytes, you should 
instead do


   message.bytes_headers['Subject'] = b'Some Bytes'

or

   message.headers['Subject'] = Header(bytes=b'Some Bytes', 
encoding='utf-8')


Explicit is better than implicit, right?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Dropping bytes support in json

2009-04-09 Thread glyph


On 03:21 am, ncogh...@gmail.com wrote:

Barry Warsaw wrote:



I don't know whether the parameter thing will work or not, but you're
probably right that we need to get the bytes-everywhere API first.



Given that json is a wire protocol, that sounds like the right approach
for json as well. Once bytes-everywhere works, then a text API can be
built on top of it, but it is difficult to build a bytes API on top of 
a

text one.


I wish I could agree, but JSON isn't really a wire protocol.  According 
to http://www.ietf.org/rfc/rfc4627.txt JSON is a text format for the 
serialization of structured data.  There are some notes about encoding, 
but it is very clearly described in terms of unicode code points.

So I guess the IO library *is* the right model: bytes at the bottom of
the stack, with text as a wrapper around it (mediated by codecs).


In email's case this is true, but in JSON's case it's not.  JSON is a 
format defined as a sequence of code points; MIME is defined as a 
sequence of octets.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Email-SIG] Dropping bytes support in json

2009-04-09 Thread Stephen J. Turnbull
Barry Warsaw writes:

  There are really two ways to look at an email message.  It's either an  
  unstructured blob of bytes, or it's a structured tree of objects.

Indeed!

  Those objects have headers and payload.  The payload can be of any  
  type, though I think it generally breaks down into strings for text/ 
  * types and bytes for anything else (not counting multiparts).

*sigh*  Why are you back-tracking?

The payload should be of an appropriate *object* type.  Atomic object
types will have their content stored as string or bytes [nb I use
Python 3 terminology throughout].  Composite types (multipart/*) won't
need string or bytes attributes AFAICS.

Start by implementing the application/octet-stream and
text/plain;charset=utf-8 object types, of course.

  It does seem to make sense to think about headers as text header names  
  and text header values.

I disagree.  IMHO, structured header types should have object values,
and something like

message['to'] = Barry 'da FLUFL' Warsaw ba...@python.org

should be smart enough to detect that it's a string and attempt to
(flexibly) parse it into a fullname and a mailbox adding escapes, etc.
Whether these should be structured objects or they can be strings or
bytes, I'm not sure (probably bytes, not strings, though -- see next
exampl).  OTOH

message['to'] = b'''Barry 'da.FLUFL' Warsaw ba...@python.org'''

should assume that the client knows what they are doing, and should
parse it strictly (and I mean be a real bastard, eg, raise an
exception on any non-ASCII octet), merely dividing it into fullname
and mailbox, and caching the bytes for later insertion in a
wire-format message.

  In that case, I think you want the values as unicodes, and probably  
  the headers as unicodes containing only ASCII.  So your table would be  
  strings in both cases.  OTOH, maybe your application cares about the  
  raw underlying encoded data, in which case the header names are  
  probably still strings of ASCII-ish unicodes and the values are  
  bytes.  It's this distinction (and I think the competing use cases)  
  that make a true Python 3.x API for email more complicated.

I don't see why you can't have the email API be specific, with
message['to'] always returning a structured_header object (or maybe
even more specifically an address_header object), and methods like

message['to'].build_header_as_text()

which returns

To: Barry 'da.FLUFL' Warsaw ba...@python.org

and

message['to'].build_header_in_wire_format()

which returns

bTo: Barry 'da.FLUFL' Warsaw ba...@python.org

Then have email.textview.Message and email.wireview.Message which
provide a simple interface where message['to'] would invoke
.build_header_as_text() and .build_header_in_wire_format()
respectively.

  Thinking about this stuff makes me nostalgic for the sloppy happy days  
  of Python 2.x

Er, yeah.

Nostalgic-for-the-BITNET-days-where-everything-was-Just-EBCDIC-ly y'rs,
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] decorator module in stdlib?

2009-04-09 Thread Daniel Fetchinson
 Then perhaps you misunderstand the goal of the decorator module.
 The raison d'etre of the module is to PRESERVE the signature:
 update_wrapper unfortunately *changes* it.

 When confronted with a library which I do not not know, I often run
 over it pydoc, or sphinx, or a custom made documentation tool, to extract
 the
 signature of functions.

 Ah, I see. Personally I rarely trust automatically extracted
 documentation -- too often in my experience it is out of date or
 simply absent. Extracting the signatures in theory wouldn't lie, but
 in practice I still wouldn't trust it -- not only because of what
 decorators might or might not do, but because it might still be
 misleading. Call me old-fashioned, but I prefer to read the source
 code.

  For instance, if I see a method
 get_user(self, username) I have a good hint about what it is supposed
 to do. But if the library (say a web framework) uses non
 signature-preserving
 decorators, my documentation tool says to me that there is function
 get_user(*args, **kwargs) which frankly is not enough [this is the
 optimistic case, when the author of the decorator has taken care
 to preserve the name of the original function].

 But seeing the decorator is often essential for understanding what
 goes on! Even if the decorator preserves the signature (in truth or
 according inspect), many decorators *do* something, and it's important
 to know how a function is decorated. For example, I work a lot with a
 small internal framework at Google whose decorators can raise
 exceptions and set instance variables; they also help me understand
 under which conditions a method can be called.

  I *hate* losing information about the true signature of functions, since
 I also
 use a lot IPython, Python help, etc.

 I guess we just have different styles. That's fine.

 I must admit that while I still like decorators, I do like them as
 much as in the past.

 Of course there was a missing NOT in this sentence, but you all understood
 the intended meaning.

 (All this BTW is not to say that I don't trust you with commit
 privileges if you were to be interested in contributing. I just don't
 think that adding that particular decorator module to the stdlib would
 be wise. It can be debated though.)

 Fine. As I have repeated many time that particular module was never
 meant for inclusion in the standard library.

 Then perhaps it shouldn't -- I haven't looked but if you don't plan
 stdlib inclusion it is often the case that the API style and/or
 implementation details make stdlib inclusion unrealistic. (Though
 admittedly some older modules wouldn't be accepted by today's
 standards either -- and I'm not just talking PEP-8 compliance! :-)

 But I feel strongly about
 the possibility of being able to preserve (not change!) the function
 signature.

 That could be added to functools if enough people want it.

My original suggestion for inclusion in stdlib was motivated by this
reason alone: I'd like to see an official one way of preserving
function signatures by decorators. If there are better ways of doing
it than the decorator module, that's totally fine, but there should be
one.


Cheers,
Daniel

 I do not think everybody disagree with your point here. My point still
 stands, though: objects should not lie about their signature, especially
 during  debugging and when generating documentation from code.

 Source code never lies. Debuggers should make access to the source
 code a key point. And good documentation should be written by a human,
 not automatically cobbled together from source code and a few doc
 strings.




-- 
Psss, psss, put it down! - http://www.cafepress.com/putitdown
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com