Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread "Martin v. Löwis"
Stephen J. Turnbull wrote:
So there is a standard for the UTF-8 signature, and I know of
applications which produce it.  While I agree with you that Python's
codecs shouldn't produce it (by default), providing an option to strip
is a good idea.
I would personally like to see an "utf-8-bom" codec (perhaps better
named "utf-8-sig", which strips the BOM on reading (if present)
and generates it on writing.
However, this option should be part of the initialization of an IO
stream which produces Unicodes, _not_ an operation on arbitrary
internal strings (whether raw or Unicode).
With the UTF-8-SIG codec, it would apply to all operation modes of
the codec, whether stream-based or from strings. Whether or not to
use the codec would be the application's choice.
Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread M.-A. Lemburg
Martin v. Löwis wrote:
> Stephen J. Turnbull wrote:
> 
>> So there is a standard for the UTF-8 signature, and I know of
>> applications which produce it.  While I agree with you that Python's
>> codecs shouldn't produce it (by default), providing an option to strip
>> is a good idea.
> 
> I would personally like to see an "utf-8-bom" codec (perhaps better
> named "utf-8-sig", which strips the BOM on reading (if present)
> and generates it on writing.

+1.

>> However, this option should be part of the initialization of an IO
>> stream which produces Unicodes, _not_ an operation on arbitrary
>> internal strings (whether raw or Unicode).
> 
> 
> With the UTF-8-SIG codec, it would apply to all operation modes of
> the codec, whether stream-based or from strings. Whether or not to
> use the codec would be the application's choice.

I'd suggest to use the same mode of operation as we have in
the UTF-16 codec: it removes the BOM mark on the first call
to the StreamReader .decode() method and writes a BOM mark
on the first call to .encode() on a StreamWriter.

Note that the UTF-16 codec is strict w/r to the presence
of the BOM mark: you get a UnicodeError if a stream does
not start with a BOM mark. For the UTF-8-SIG codec, this
should probably be relaxed to not require the BOM.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 05 2005)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Walter Dörwald
M.-A. Lemburg wrote:
[...]
With the UTF-8-SIG codec, it would apply to all operation modes of
the codec, whether stream-based or from strings. Whether or not to
use the codec would be the application's choice.
I'd suggest to use the same mode of operation as we have in
the UTF-16 codec: it removes the BOM mark on the first call
to the StreamReader .decode() method and writes a BOM mark
on the first call to .encode() on a StreamWriter.
Note that the UTF-16 codec is strict w/r to the presence
of the BOM mark: you get a UnicodeError if a stream does
not start with a BOM mark. For the UTF-8-SIG codec, this
should probably be relaxed to not require the BOM.
I've started writing such a codec. Making the BOM optional on decoding 
definitely simplifies the implementation.

Bye,
   Walter Dörwald
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread M.-A. Lemburg
Stephen J. Turnbull wrote:
>>"MAL" == M  <[EMAIL PROTECTED]> writes:
> 
> 
> MAL> The BOM (byte order mark) was a non-standard Microsoft
> MAL> invention to detect Unicode text data as such (MS always uses
> MAL> UTF-16-LE for Unicode text files).
> 
> The Japanese "memopado" (Notepad) uses UTF-8 signatures; it even adds
> them to existing UTF-8 files lacking them.

Is that a MS application ? AFAIK, notepad, wordpad and MS Office
always use UTF-16-LE + BOM when saving text as "Unicode text".

> MAL> -1; there's no standard for UTF-8 BOMs - adding it to the
> MAL> codecs module was probably a mistake to begin with. You
> MAL> usually only get UTF-8 files with BOM marks as the result of
> MAL> recoding UTF-16 files into UTF-8.
> 
> There is a standard for UTF-8 _signatures_, however.  I don't have the
> most recent version of the ISO-10646 standard, but Amendment 2 (which
> defined UTF-8 for ISO-10646) specifically added the UTF-8 signature to
> Annex F of that standard.  Evan quotes Version 4 of the Unicode
> standard, which explicitly defines the UTF-8 signature.

Ok, as signature the BOM does make some sense - whether to
strip signatures from a document is a good idea or not
is a different matter, though.

Here's the Unicode Cons. FAQ on the subject:

http://www.unicode.org/faq/utf_bom.html#22

They also explicitly warn about adding BOMs to UTF-8 data
since it can break applications and protocols that do not
expect such a signature.

> So there is a standard for the UTF-8 signature, and I know of
> applications which produce it.  While I agree with you that Python's
> codecs shouldn't produce it (by default), providing an option to strip
> is a good idea.
> 
> However, this option should be part of the initialization of an IO
> stream which produces Unicodes, _not_ an operation on arbitrary
> internal strings (whether raw or Unicode).

Right.

> MAL> BTW, how do you know that s came from the start of a file and
> MAL> not from slicing some already loaded file somewhere in the
> MAL> middle ?
> 
> The programmer or the application might, but Python's codecs don't.
> The point is that this is also true of rawstrings that happen to
> contain UTF-16 or UTF-32 data.  The UTF-16 ("auto-endian") codec
> shouldn't strip leading BOMs either, unless it has been told it has
> the beginning of the string.

The UTF-16 stream codecs implement this logic.

The UTF-16 encode and decode functions will however always strip
the BOM mark from the beginning of a string.

If the application doesn't want this stripping to happen,
it should use the UTF-16-LE or -BE codec resp.

> MAL> Evan Jones wrote:
> 
> >> This is *not* a valid Unicode character. The Unicode
> >> specification (version 4, section 15.8) says the following
> >> about non-characters:
> >> 
> >>> Applications are free to use any of these noncharacter code
> >>> points internally but should never attempt to exchange
> >>> them. If a noncharacter is received in open interchange, an
> >>> application is not required to interpret it in any way. It is
> >>> good practice, however, to recognize it as a noncharacter and
> >>> to take appropriate action, such as removing it from the
> >>> text. Note that Unicode conformance freely allows the removal
> >>> of these characters. (See C10 in Section3.2, Conformance
> >>> Requirements.)
> >> 
> >> My interpretation of the specification means that Python should
> 
> The specification _permits_ silent removal; it does not recommend.
> 
> >> silently remove the character, resulting in a zero length
> >> Unicode string.  Similarly, both of the following lines should
> >> also result in a zero length Unicode string:
> 
>  '\xff\xfe\xfe\xff'.decode( "utf16" )
> > u'\ufffe'
>  '\xff\xfe\xff\xff'.decode( "utf16" )
> > u'\u'
> 
> I strongly disagree; these decisions should be left to a higher layer.
> In the case of specified UTFs, the codecs should simply invert the UTF
> to Python's internal encoding.
> 
> MAL> Hmm, wouldn't it be better to raise an error ? After all, a
> MAL> reversed BOM mark in the stream looks a lot like you're
> MAL> trying to decode a UTF-16 stream assuming the wrong byte
> MAL> order ?!
> 
> +1 on (optionally) raising an error. 

The advantage of raising an error is that the application
can deal with the situation in whatever way seems fit (by
registering a special error handler or by simply using
"ignore" or "replace").

I agree that much of this lies outside the scope of codecs
and should be handled at an application or protocol level.

> -1 on removing it or anything
> like that, unless under control of the application (ie, the program
> written in Python, not Python itself).  It's far too easy for software
> to generate broken Unicode streams[1], and the choice of how to deal
> with those should be with the application, not with the im

Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Stephen J. Turnbull
> "Martin" == Martin v Löwis <[EMAIL PROTECTED]> writes:

Martin> Stephen J. Turnbull wrote:

>> However, this option should be part of the initialization of an
>> IO stream which produces Unicodes, _not_ an operation on
>> arbitrary internal strings (whether raw or Unicode).

Martin> With the UTF-8-SIG codec, it would apply to all operation
Martin> modes of the codec, whether stream-based or from strings.

I had in mind the ability to treat a string as a stream.

Martin> Whether or not to use the codec would be the application's
Martin> choice.

What I think should be provided is a stateful object encapsulating the
codec.  Ie, to avoid the need to write

out = chunk[0].encode("utf-8-sig") + chunk[1].encode("utf-8")



-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Stephen J. Turnbull
>>"MAL" == M  <[EMAIL PROTECTED]> writes:

MAL> Stephen J. Turnbull wrote:

>> The Japanese "memopado" (Notepad) uses UTF-8 signatures; it
>> even adds them to existing UTF-8 files lacking them.

MAL> Is that a MS application ? AFAIK, notepad, wordpad and MS
MAL> Office always use UTF-16-LE + BOM when saving text as "Unicode
MAL> text".

Yes, it is an MS application.  I'll have to borrow somebody's box to
check, but IIRC UTF-8 is the native "text" encoding for Japanese now.
(Japanized applications generally behave differently from everything
else, as there are so many "standards" for encoding Japanese.)

M> The UTF-16 stream codecs implement this logic.

M> The UTF-16 encode and decode functions will however always
M> strip the BOM mark from the beginning of a string.

M> If the application doesn't want this stripping to happen, it
M> should use the UTF-16-LE or -BE codec resp.

That sounds like it would work fine almost all the time.  If it
doesn't it's straightforward to work around, and certainly would be
more convenient for the non-standards-geek programmer.


-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mail.python.org

2005-04-05 Thread Skip Montanaro

Grant> Not a big deal, but I noticed that https://mail.python.org/ is
Grant> live and shows a generic "Welcome to your new home in
Grant> cyberspace!" message.  One of the webmasters may want to
Grant> automatically redirect to http://mail.python.org.

Thanks, I forwarded this along to the folks who can deal with this.

Skip
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Martin v. Löwis
Stephen J. Turnbull wrote:
Martin> With the UTF-8-SIG codec, it would apply to all operation
Martin> modes of the codec, whether stream-based or from strings.
I had in mind the ability to treat a string as a stream.
Hmm. A string is not a stream, but it could be the contents of a stream.
A typical application of codecs goes like this:
data = stream.read()
[analyze data, e.g. by checking whether there is encoding= in 
So people do use the "decode-it-all" mode, where no sequential access
is necessary - yet the beginning of the string is still the beginning of
what once was a stream. This case must be supported.
Martin> Whether or not to use the codec would be the application's
Martin> choice.
What I think should be provided is a stateful object encapsulating the
codec.  Ie, to avoid the need to write
out = chunk[0].encode("utf-8-sig") + chunk[1].encode("utf-8")
No. People who want streaming should use cStringIO, i.e.
>>> s=cStringIO.StringIO()
>>> s1=codecs.getwriter("utf-8")(s)
>>> s1.write(u"Hallo")
>>> s.getvalue()
'Hallo'
Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Walter Dörwald
Walter Dörwald sagte:

> M.-A. Lemburg wrote:
>
>>> [...]
>>>With the UTF-8-SIG codec, it would apply to all operation
>>> modes of the codec, whether stream-based or from strings. Whether
>>>or not to use the codec would be the application's choice.
>>
>> I'd suggest to use the same mode of operation as we have in
>> the UTF-16 codec: it removes the BOM mark on the first call
>> to the StreamReader .decode() method and writes a BOM mark
>> on the first call to .encode() on a StreamWriter.
>>
>> Note that the UTF-16 codec is strict w/r to the presence
>> of the BOM mark: you get a UnicodeError if a stream does
>> not start with a BOM mark. For the UTF-8-SIG codec, this
>> should probably be relaxed to not require the BOM.
>
> I've started writing such a codec. Making the BOM optional
> on decoding definitely simplifies the implementation.

OK, here is the patch: http://www.python.org/sf/1177307

The stateful decoder has a little problem: At least three bytes
have to be available from the stream until the StreamReader
decides whether these bytes are a BOM that has to be skipped.
This means that if the file only contains "ab", the user will
never see these two characters.

A solution for this would be to add an argument named final to
the decode and read methods that tells the decoder that the
stream has ended and the remaining buffered bytes have to be
handled now.

Bye,
   Walter Dörwald



___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Evan Jones
On Apr 5, 2005, at 15:33, Walter Dörwald wrote:
The stateful decoder has a little problem: At least three bytes
have to be available from the stream until the StreamReader
decides whether these bytes are a BOM that has to be skipped.
This means that if the file only contains "ab", the user will
never see these two characters.
Shouldn't the decoder be capable of doing a partial match and quitting 
early? After all, "ab" is encoded in UTF8 as <61> <62> but the BOM is 
  . If it did this type of partial matching, this issue 
would be avoided except in rare situations.

A solution for this would be to add an argument named final to
the decode and read methods that tells the decoder that the
stream has ended and the remaining buffered bytes have to be
handled now.
This functionality is provided by a flush() method on similar objects, 
such as the zlib compression objects.

Evan Jones
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Fred Drake
On Tuesday 05 April 2005 15:53, Evan Jones wrote:
 > This functionality is provided by a flush() method on similar objects,
 > such as the zlib compression objects.

Or by close() on other objects (htmllib, HTMLParser, the SAX incremental 
parser, etc.).

Too bad there's more than one way to do it.  :-(


  -Fred

-- 
Fred L. Drake, Jr.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Martin v. Löwis
Walter Dörwald wrote:
The stateful decoder has a little problem: At least three bytes
have to be available from the stream until the StreamReader
decides whether these bytes are a BOM that has to be skipped.
This means that if the file only contains "ab", the user will
never see these two characters.
This can be improved, of course: If the first byte is "a", it most
definitely is *not* an UTF-8 signature.
So we only need a second byte for the characters between U+F000
and U+, and a third byte only for the characters
U+FEC0...U+FEFF. But with the first byte being  \xef, we need
three bytes *anyway*, so we can always decide with the first
byte only whether we need to wait for three bytes.
A solution for this would be to add an argument named final to
the decode and read methods that tells the decoder that the
stream has ended and the remaining buffered bytes have to be
handled now.
Shouldn't an empty read from the underlying stream be taken
as an EOF?
Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] longobject.c & ob_size

2005-04-05 Thread Tim Peters
[Michael Hudson]
> Asking mostly for curiousity, how hard would it be to have longs store
> their sign bit somewhere less aggravating?

Depends on where that is.

> It seems to me that the top bit of ob_digit[0] is always 0, for example,

Yes, the top bit of ob_digit[i], for all relevant i, is 0 on all platforms now.

> and I'm sure this would result no less convolution in longobject.c it'd be
> considerably more localized convolution.

I'd much rather give struct _longobject a distinct sign member (say, 0
== zero, -1 = non-zero negative, 1 == non-zero positive).  That would
simplify code.  It would cost no extra bytes for some longs, and 8
extra bytes for others (since obmalloc rounds up to a multiple of 8);
I don't care about that (e.g., I never use millions of longs
simultaneously, but often use a few dozen very big longs
simultaneously; the memory difference is in the noise then).

Note that longintrepr.h isn't included by Python.h.  Only longobject.h
is, and longobject.h doesn't reveal the internal structure of longs. 
IOW, changing the internal layout of longs shouldn't even hurt binary
compatibility.

The ob_size member of PyObject_VAR_HEAD would also be redeclared as
size_t in an ideal world.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Walter Dörwald
Martin v. Löwis sagte:
> Walter Dörwald wrote:
>> The stateful decoder has a little problem: At least three bytes
>> have to be available from the stream until the StreamReader
>> decides whether these bytes are a BOM that has to be skipped.
>> This means that if the file only contains "ab", the user will
>> never see these two characters.
>
> This can be improved, of course: If the first byte is "a",
> it most definitely is *not* an UTF-8 signature.
>
> So we only need a second byte for the characters between U+F000
> and U+, and a third byte only for the characters
> U+FEC0...U+FEFF. But with the first byte being  \xef, we need
> three bytes *anyway*, so we can always decide with the first
> byte only whether we need to wait for three bytes.

OK, I've updated the patch so that the first bytes will only be kept
in the buffer if they are a prefix of the BOM.

>> A solution for this would be to add an argument named final to
>> the decode and read methods that tells the decoder that the
>> stream has ended and the remaining buffered bytes have to be
>> handled now.
>
> Shouldn't an empty read from the underlying stream be taken
> as an EOF?

There are situations where the byte stream might be temporarily
exhausted, e.g. an XML parser that tries to support the
IncrementalParser interface, or when you want to decode
encoded data piecewise, because you want to give a progress
report.

Bye,
   Walter Dörwald



___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Walter Dörwald
Evan Jones sagte:
> On Apr 5, 2005, at 15:33, Walter Dörwald wrote:
>> The stateful decoder has a little problem: At least three bytes
>> have to be available from the stream until the StreamReader
>> decides whether these bytes are a BOM that has to be skipped.
>> This means that if the file only contains "ab", the user will
>> never see these two characters.
>
> Shouldn't the decoder be capable of doing a partial match and quitting  
> early? After all, "ab" is encoded in UTF8 as <61>
> <62> but the BOM is. If it did this type of partial matching, 
> this issue  would be avoided except in rare
> situations.
>
>> A solution for this would be to add an argument named final to
>> the decode and read methods that tells the decoder that the
>> stream has ended and the remaining buffered bytes have to be
>> handled now.
>
> This functionality is provided by a flush() method on similar objects,  such 
> as the zlib compression objects.

Theoretically the name is unimportant, but read(..., final=True) or flush()
or close() should subject the pending bytes to normal error handling and
must return the result of decoding these pending bytes just like the
other methods do. This would mean that we would have to implement
a decodecode(), a readclose() and a readlineclose(). IMHO it would be
best to add this argument to decode, read and readline directly. But I'm
not sure, what this would mean for iterating through a StreamReader.

Bye,
Walter Dörwald



___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread "Martin v. Löwis"
Walter Dörwald wrote:
There are situations where the byte stream might be temporarily
exhausted, e.g. an XML parser that tries to support the
IncrementalParser interface, or when you want to decode
encoded data piecewise, because you want to give a progress
report.
Yes, but these are not file-like objects. In the IncrementalParser,
it is *not* the case that a read operation returns an empty
string. Instead, the application repeatedly feeds data explicitly.
For a file-like object, returning "" indicates EOF.
Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Developer list update

2005-04-05 Thread Raymond Hettinger
FYI, I'm starting a project to see what has become of some of the
inactive developers.

Essentially, it involves sending them a note to see if they still have
use for their checkin permissions.  If not, then we can make the change
and improve security a bit.

Also, to help with institutional memory, I started a log of changes to
developer permissions.  The goal is to remember who was given access, by
whom, and why (some folks are given access for a one-shot project for
example).  The file is at Misc/developers.

The first entry is for Nick Coghlan who was just granted tracker
permissions so he can help manage outstanding bugs and patches.



Raymond Hettinger

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Developer list update

2005-04-05 Thread Fred Drake
On Tuesday 05 April 2005 06:47, Raymond Hettinger wrote:
 > Also, to help with institutional memory, I started a log of changes to
 > developer permissions.  The goal is to remember who was given access, by
 > whom, and why (some folks are given access for a one-shot project for
 > example).  The file is at Misc/developers.

Thanks, Raymond!

Would anyone here object to renaming the file to developers.txt, though?


  -Fred

-- 
Fred L. Drake, Jr.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Developer list update

2005-04-05 Thread Barry Warsaw
On Tue, 2005-04-05 at 19:06, Fred Drake wrote:

> Would anyone here object to renaming the file to developers.txt, though?

+1, please!
-Barry



signature.asc
Description: This is a digitally signed message part
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Stephen J. Turnbull
> "Martin" == Martin v Löwis <[EMAIL PROTECTED]> writes:

Martin> So people do use the "decode-it-all" mode, where no
Martin> sequential access is necessary - yet the beginning of the
Martin> string is still the beginning of what once was a
Martin> stream. This case must be supported.

Of course it must be supported.  My point is that many strings (in my
applications, all but those strings that result from slurping in a
file or process output in one go -- example, not a statistically valid
sample!) are not the beginning of "what once was a stream".  It is
error-prone (not to mention unaesthetic) to not make that distinction.

"Explicit is better than implicit."

Martin> Whether or not to use the codec would be the application's
Martin> choice.

>> What I think should be provided is a stateful object
>> encapsulating the codec.  Ie, to avoid the need to write

>> out = chunk[0].encode("utf-8-sig") + chunk[1].encode("utf-8")

Martin> No. People who want streaming should use cStringIO, i.e.

 >>> s=cStringIO.StringIO()
 >>> s1=codecs.getwriter("utf-8")(s)
 >>> s1.write(u"Hallo")
 >>> s.getvalue()
'Hallo'

Yes!  Exactly (except in reverse, we want to _read_ from the slurped
stream-as-string, not write to one)!  ... and there's no need for a
utf-8-sig codec for strings, since you can support the usage in
exactly this way.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Developer list update

2005-04-05 Thread Tim Peters
[Fred Drake]
>> Would anyone here object to renaming the file to developers.txt, though?

[Barry Warsaw]
> +1, please!

I voted with my DOS box.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] inconsistency when swapping obj.__dict__ with a dict-like object...

2005-04-05 Thread Alex A. Naanou
Hi!

here is a simple piece of code

---cut---
class Dict(dict):
def __init__(self, dct={}):
self._dict = dct
def __getitem__(self, name):
return self._dct[name]
def __setitem__(self, name, value):
self._dct[name] = value
def __delitem__(self, name):
del self._dct[name]
def __contains__(self, name):
return name in self._dct
def __iter__(self):
return iter(self._dct)

class A(object):
def __new__(cls, *p, **n):
o = object.__new__(cls)
o.__dict__ = Dict()
return o

a = A()
a.xxx = 123
print a.__dict__._dict
a.__dict__._dict['yyy'] = 321
print a.yyy

--uncut--


Here there are two problems, the first is minor, and it is that
anything assigned to the __dict__ attribute is checked to be a
descendant of the dict class (mixing this in does not seem to work)...
and the second problem is a real annoyance, it is that the mapping
protocol supported by the Dict object in the example above is not used
by the attribute access mechanics (the same thing that once happened
in exec)...

P.S. (IMHO) the type check here is not that necessary (at least in its
current state), as what we need to assert is not the relation to the
dict class but the support of the mapping protocol

thanks.
-- 
Alex.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] inconsistency when swapping obj.__dict__ with a dict-like object...

2005-04-05 Thread Brett C.
Alex A. Naanou wrote:
> Hi!
> 
> here is a simple piece of code
> 
> ---cut---
> class Dict(dict):
> def __init__(self, dct={}):
> self._dict = dct
> def __getitem__(self, name):
> return self._dct[name]
> def __setitem__(self, name, value):
> self._dct[name] = value
> def __delitem__(self, name):
> del self._dct[name]
> def __contains__(self, name):
> return name in self._dct
> def __iter__(self):
> return iter(self._dct)
> 
> class A(object):
> def __new__(cls, *p, **n):
> o = object.__new__(cls)
> o.__dict__ = Dict()
> return o
> 
> a = A()
> a.xxx = 123
> print a.__dict__._dict
> a.__dict__._dict['yyy'] = 321
> print a.yyy
> 
> --uncut--
> 
> 
> Here there are two problems, the first is minor, and it is that
> anything assigned to the __dict__ attribute is checked to be a
> descendant of the dict class (mixing this in does not seem to work)...
> and the second problem is a real annoyance, it is that the mapping
> protocol supported by the Dict object in the example above is not used
> by the attribute access mechanics (the same thing that once happened
> in exec)...
> 

Actually, overriding __getattribute__() does work; __getattr__() and
__getitem__() doesn't.  This was brought up last month at some point without
any resolve (I think Steve Bethard pointed it out).

> P.S. (IMHO) the type check here is not that necessary (at least in its
> current state), as what we need to assert is not the relation to the
> dict class but the support of the mapping protocol
> 

Semantically necessary, no.  But simplicity- and performance-wise, maybe.  If
you grep around in Objects/classobject.c, for instance, you will see
PyClassObject.cl_dict is accessed using PyDict_GetItem() and I spotted at least
one use of PyDict_DelItem().  To use the mapping protocol would require
changing all of these to PyObject_GetItem() and such.

Which will be a performance penalty compared to PyDict_GetItem().  So the
question is whether the flexibility is worth it.

-Brett
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode byte order mark decoding

2005-04-05 Thread Martin v. Löwis
Stephen J. Turnbull wrote:
Of course it must be supported.  My point is that many strings (in my
applications, all but those strings that result from slurping in a
file or process output in one go -- example, not a statistically valid
sample!) are not the beginning of "what once was a stream".  It is
error-prone (not to mention unaesthetic) to not make that distinction.
"Explicit is better than implicit."
I can't put these two paragraphs together. If you think that explicit
is better than implicit, why do you not want to make different calls
for the first chunk of a stream, and the subsequent chunks?
 >>> s=cStringIO.StringIO()
 >>> s1=codecs.getwriter("utf-8")(s)
 >>> s1.write(u"Hallo")
 >>> s.getvalue()
'Hallo'
Yes!  Exactly (except in reverse, we want to _read_ from the slurped
stream-as-string, not write to one)!  ... and there's no need for a
utf-8-sig codec for strings, since you can support the usage in
exactly this way.
However, if there is an utf-8-sig codec for streams, there is currently
no way of *preventing* this codec to also be available for strings. The
very same code is used for streams and for strings, and automatically
so.
Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com