Re: [Python-Dev] Unicode byte order mark decoding
Stephen J. Turnbull wrote: So there is a standard for the UTF-8 signature, and I know of applications which produce it. While I agree with you that Python's codecs shouldn't produce it (by default), providing an option to strip is a good idea. I would personally like to see an "utf-8-bom" codec (perhaps better named "utf-8-sig", which strips the BOM on reading (if present) and generates it on writing. However, this option should be part of the initialization of an IO stream which produces Unicodes, _not_ an operation on arbitrary internal strings (whether raw or Unicode). With the UTF-8-SIG codec, it would apply to all operation modes of the codec, whether stream-based or from strings. Whether or not to use the codec would be the application's choice. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode byte order mark decoding
Martin v. Löwis wrote: > Stephen J. Turnbull wrote: > >> So there is a standard for the UTF-8 signature, and I know of >> applications which produce it. While I agree with you that Python's >> codecs shouldn't produce it (by default), providing an option to strip >> is a good idea. > > I would personally like to see an "utf-8-bom" codec (perhaps better > named "utf-8-sig", which strips the BOM on reading (if present) > and generates it on writing. +1. >> However, this option should be part of the initialization of an IO >> stream which produces Unicodes, _not_ an operation on arbitrary >> internal strings (whether raw or Unicode). > > > With the UTF-8-SIG codec, it would apply to all operation modes of > the codec, whether stream-based or from strings. Whether or not to > use the codec would be the application's choice. I'd suggest to use the same mode of operation as we have in the UTF-16 codec: it removes the BOM mark on the first call to the StreamReader .decode() method and writes a BOM mark on the first call to .encode() on a StreamWriter. Note that the UTF-16 codec is strict w/r to the presence of the BOM mark: you get a UnicodeError if a stream does not start with a BOM mark. For the UTF-8-SIG codec, this should probably be relaxed to not require the BOM. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 05 2005) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode byte order mark decoding
M.-A. Lemburg wrote: [...] With the UTF-8-SIG codec, it would apply to all operation modes of the codec, whether stream-based or from strings. Whether or not to use the codec would be the application's choice. I'd suggest to use the same mode of operation as we have in the UTF-16 codec: it removes the BOM mark on the first call to the StreamReader .decode() method and writes a BOM mark on the first call to .encode() on a StreamWriter. Note that the UTF-16 codec is strict w/r to the presence of the BOM mark: you get a UnicodeError if a stream does not start with a BOM mark. For the UTF-8-SIG codec, this should probably be relaxed to not require the BOM. I've started writing such a codec. Making the BOM optional on decoding definitely simplifies the implementation. Bye, Walter Dörwald ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode byte order mark decoding
Stephen J. Turnbull wrote:
>>"MAL" == M <[EMAIL PROTECTED]> writes:
>
>
> MAL> The BOM (byte order mark) was a non-standard Microsoft
> MAL> invention to detect Unicode text data as such (MS always uses
> MAL> UTF-16-LE for Unicode text files).
>
> The Japanese "memopado" (Notepad) uses UTF-8 signatures; it even adds
> them to existing UTF-8 files lacking them.
Is that a MS application ? AFAIK, notepad, wordpad and MS Office
always use UTF-16-LE + BOM when saving text as "Unicode text".
> MAL> -1; there's no standard for UTF-8 BOMs - adding it to the
> MAL> codecs module was probably a mistake to begin with. You
> MAL> usually only get UTF-8 files with BOM marks as the result of
> MAL> recoding UTF-16 files into UTF-8.
>
> There is a standard for UTF-8 _signatures_, however. I don't have the
> most recent version of the ISO-10646 standard, but Amendment 2 (which
> defined UTF-8 for ISO-10646) specifically added the UTF-8 signature to
> Annex F of that standard. Evan quotes Version 4 of the Unicode
> standard, which explicitly defines the UTF-8 signature.
Ok, as signature the BOM does make some sense - whether to
strip signatures from a document is a good idea or not
is a different matter, though.
Here's the Unicode Cons. FAQ on the subject:
http://www.unicode.org/faq/utf_bom.html#22
They also explicitly warn about adding BOMs to UTF-8 data
since it can break applications and protocols that do not
expect such a signature.
> So there is a standard for the UTF-8 signature, and I know of
> applications which produce it. While I agree with you that Python's
> codecs shouldn't produce it (by default), providing an option to strip
> is a good idea.
>
> However, this option should be part of the initialization of an IO
> stream which produces Unicodes, _not_ an operation on arbitrary
> internal strings (whether raw or Unicode).
Right.
> MAL> BTW, how do you know that s came from the start of a file and
> MAL> not from slicing some already loaded file somewhere in the
> MAL> middle ?
>
> The programmer or the application might, but Python's codecs don't.
> The point is that this is also true of rawstrings that happen to
> contain UTF-16 or UTF-32 data. The UTF-16 ("auto-endian") codec
> shouldn't strip leading BOMs either, unless it has been told it has
> the beginning of the string.
The UTF-16 stream codecs implement this logic.
The UTF-16 encode and decode functions will however always strip
the BOM mark from the beginning of a string.
If the application doesn't want this stripping to happen,
it should use the UTF-16-LE or -BE codec resp.
> MAL> Evan Jones wrote:
>
> >> This is *not* a valid Unicode character. The Unicode
> >> specification (version 4, section 15.8) says the following
> >> about non-characters:
> >>
> >>> Applications are free to use any of these noncharacter code
> >>> points internally but should never attempt to exchange
> >>> them. If a noncharacter is received in open interchange, an
> >>> application is not required to interpret it in any way. It is
> >>> good practice, however, to recognize it as a noncharacter and
> >>> to take appropriate action, such as removing it from the
> >>> text. Note that Unicode conformance freely allows the removal
> >>> of these characters. (See C10 in Section3.2, Conformance
> >>> Requirements.)
> >>
> >> My interpretation of the specification means that Python should
>
> The specification _permits_ silent removal; it does not recommend.
>
> >> silently remove the character, resulting in a zero length
> >> Unicode string. Similarly, both of the following lines should
> >> also result in a zero length Unicode string:
>
> '\xff\xfe\xfe\xff'.decode( "utf16" )
> > u'\ufffe'
> '\xff\xfe\xff\xff'.decode( "utf16" )
> > u'\u'
>
> I strongly disagree; these decisions should be left to a higher layer.
> In the case of specified UTFs, the codecs should simply invert the UTF
> to Python's internal encoding.
>
> MAL> Hmm, wouldn't it be better to raise an error ? After all, a
> MAL> reversed BOM mark in the stream looks a lot like you're
> MAL> trying to decode a UTF-16 stream assuming the wrong byte
> MAL> order ?!
>
> +1 on (optionally) raising an error.
The advantage of raising an error is that the application
can deal with the situation in whatever way seems fit (by
registering a special error handler or by simply using
"ignore" or "replace").
I agree that much of this lies outside the scope of codecs
and should be handled at an application or protocol level.
> -1 on removing it or anything
> like that, unless under control of the application (ie, the program
> written in Python, not Python itself). It's far too easy for software
> to generate broken Unicode streams[1], and the choice of how to deal
> with those should be with the application, not with the im
Re: [Python-Dev] Unicode byte order mark decoding
> "Martin" == Martin v Löwis <[EMAIL PROTECTED]> writes:
Martin> Stephen J. Turnbull wrote:
>> However, this option should be part of the initialization of an
>> IO stream which produces Unicodes, _not_ an operation on
>> arbitrary internal strings (whether raw or Unicode).
Martin> With the UTF-8-SIG codec, it would apply to all operation
Martin> modes of the codec, whether stream-based or from strings.
I had in mind the ability to treat a string as a stream.
Martin> Whether or not to use the codec would be the application's
Martin> choice.
What I think should be provided is a stateful object encapsulating the
codec. Ie, to avoid the need to write
out = chunk[0].encode("utf-8-sig") + chunk[1].encode("utf-8")
--
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
Ask not how you can "do" free software business;
ask what your business can "do for" free software.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode byte order mark decoding
>>"MAL" == M <[EMAIL PROTECTED]> writes: MAL> Stephen J. Turnbull wrote: >> The Japanese "memopado" (Notepad) uses UTF-8 signatures; it >> even adds them to existing UTF-8 files lacking them. MAL> Is that a MS application ? AFAIK, notepad, wordpad and MS MAL> Office always use UTF-16-LE + BOM when saving text as "Unicode MAL> text". Yes, it is an MS application. I'll have to borrow somebody's box to check, but IIRC UTF-8 is the native "text" encoding for Japanese now. (Japanized applications generally behave differently from everything else, as there are so many "standards" for encoding Japanese.) M> The UTF-16 stream codecs implement this logic. M> The UTF-16 encode and decode functions will however always M> strip the BOM mark from the beginning of a string. M> If the application doesn't want this stripping to happen, it M> should use the UTF-16-LE or -BE codec resp. That sounds like it would work fine almost all the time. If it doesn't it's straightforward to work around, and certainly would be more convenient for the non-standards-geek programmer. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Mail.python.org
Grant> Not a big deal, but I noticed that https://mail.python.org/ is Grant> live and shows a generic "Welcome to your new home in Grant> cyberspace!" message. One of the webmasters may want to Grant> automatically redirect to http://mail.python.org. Thanks, I forwarded this along to the folks who can deal with this. Skip ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode byte order mark decoding
Stephen J. Turnbull wrote:
Martin> With the UTF-8-SIG codec, it would apply to all operation
Martin> modes of the codec, whether stream-based or from strings.
I had in mind the ability to treat a string as a stream.
Hmm. A string is not a stream, but it could be the contents of a stream.
A typical application of codecs goes like this:
data = stream.read()
[analyze data, e.g. by checking whether there is encoding= in
So people do use the "decode-it-all" mode, where no sequential access
is necessary - yet the beginning of the string is still the beginning of
what once was a stream. This case must be supported.
Martin> Whether or not to use the codec would be the application's
Martin> choice.
What I think should be provided is a stateful object encapsulating the
codec. Ie, to avoid the need to write
out = chunk[0].encode("utf-8-sig") + chunk[1].encode("utf-8")
No. People who want streaming should use cStringIO, i.e.
>>> s=cStringIO.StringIO()
>>> s1=codecs.getwriter("utf-8")(s)
>>> s1.write(u"Hallo")
>>> s.getvalue()
'Hallo'
Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode byte order mark decoding
Walter Dörwald sagte: > M.-A. Lemburg wrote: > >>> [...] >>>With the UTF-8-SIG codec, it would apply to all operation >>> modes of the codec, whether stream-based or from strings. Whether >>>or not to use the codec would be the application's choice. >> >> I'd suggest to use the same mode of operation as we have in >> the UTF-16 codec: it removes the BOM mark on the first call >> to the StreamReader .decode() method and writes a BOM mark >> on the first call to .encode() on a StreamWriter. >> >> Note that the UTF-16 codec is strict w/r to the presence >> of the BOM mark: you get a UnicodeError if a stream does >> not start with a BOM mark. For the UTF-8-SIG codec, this >> should probably be relaxed to not require the BOM. > > I've started writing such a codec. Making the BOM optional > on decoding definitely simplifies the implementation. OK, here is the patch: http://www.python.org/sf/1177307 The stateful decoder has a little problem: At least three bytes have to be available from the stream until the StreamReader decides whether these bytes are a BOM that has to be skipped. This means that if the file only contains "ab", the user will never see these two characters. A solution for this would be to add an argument named final to the decode and read methods that tells the decoder that the stream has ended and the remaining buffered bytes have to be handled now. Bye, Walter Dörwald ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode byte order mark decoding
On Apr 5, 2005, at 15:33, Walter Dörwald wrote: The stateful decoder has a little problem: At least three bytes have to be available from the stream until the StreamReader decides whether these bytes are a BOM that has to be skipped. This means that if the file only contains "ab", the user will never see these two characters. Shouldn't the decoder be capable of doing a partial match and quitting early? After all, "ab" is encoded in UTF8 as <61> <62> but the BOM is . If it did this type of partial matching, this issue would be avoided except in rare situations. A solution for this would be to add an argument named final to the decode and read methods that tells the decoder that the stream has ended and the remaining buffered bytes have to be handled now. This functionality is provided by a flush() method on similar objects, such as the zlib compression objects. Evan Jones ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode byte order mark decoding
On Tuesday 05 April 2005 15:53, Evan Jones wrote: > This functionality is provided by a flush() method on similar objects, > such as the zlib compression objects. Or by close() on other objects (htmllib, HTMLParser, the SAX incremental parser, etc.). Too bad there's more than one way to do it. :-( -Fred -- Fred L. Drake, Jr. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode byte order mark decoding
Walter Dörwald wrote: The stateful decoder has a little problem: At least three bytes have to be available from the stream until the StreamReader decides whether these bytes are a BOM that has to be skipped. This means that if the file only contains "ab", the user will never see these two characters. This can be improved, of course: If the first byte is "a", it most definitely is *not* an UTF-8 signature. So we only need a second byte for the characters between U+F000 and U+, and a third byte only for the characters U+FEC0...U+FEFF. But with the first byte being \xef, we need three bytes *anyway*, so we can always decide with the first byte only whether we need to wait for three bytes. A solution for this would be to add an argument named final to the decode and read methods that tells the decoder that the stream has ended and the remaining buffered bytes have to be handled now. Shouldn't an empty read from the underlying stream be taken as an EOF? Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] longobject.c & ob_size
[Michael Hudson] > Asking mostly for curiousity, how hard would it be to have longs store > their sign bit somewhere less aggravating? Depends on where that is. > It seems to me that the top bit of ob_digit[0] is always 0, for example, Yes, the top bit of ob_digit[i], for all relevant i, is 0 on all platforms now. > and I'm sure this would result no less convolution in longobject.c it'd be > considerably more localized convolution. I'd much rather give struct _longobject a distinct sign member (say, 0 == zero, -1 = non-zero negative, 1 == non-zero positive). That would simplify code. It would cost no extra bytes for some longs, and 8 extra bytes for others (since obmalloc rounds up to a multiple of 8); I don't care about that (e.g., I never use millions of longs simultaneously, but often use a few dozen very big longs simultaneously; the memory difference is in the noise then). Note that longintrepr.h isn't included by Python.h. Only longobject.h is, and longobject.h doesn't reveal the internal structure of longs. IOW, changing the internal layout of longs shouldn't even hurt binary compatibility. The ob_size member of PyObject_VAR_HEAD would also be redeclared as size_t in an ideal world. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode byte order mark decoding
Martin v. Löwis sagte: > Walter Dörwald wrote: >> The stateful decoder has a little problem: At least three bytes >> have to be available from the stream until the StreamReader >> decides whether these bytes are a BOM that has to be skipped. >> This means that if the file only contains "ab", the user will >> never see these two characters. > > This can be improved, of course: If the first byte is "a", > it most definitely is *not* an UTF-8 signature. > > So we only need a second byte for the characters between U+F000 > and U+, and a third byte only for the characters > U+FEC0...U+FEFF. But with the first byte being \xef, we need > three bytes *anyway*, so we can always decide with the first > byte only whether we need to wait for three bytes. OK, I've updated the patch so that the first bytes will only be kept in the buffer if they are a prefix of the BOM. >> A solution for this would be to add an argument named final to >> the decode and read methods that tells the decoder that the >> stream has ended and the remaining buffered bytes have to be >> handled now. > > Shouldn't an empty read from the underlying stream be taken > as an EOF? There are situations where the byte stream might be temporarily exhausted, e.g. an XML parser that tries to support the IncrementalParser interface, or when you want to decode encoded data piecewise, because you want to give a progress report. Bye, Walter Dörwald ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode byte order mark decoding
Evan Jones sagte: > On Apr 5, 2005, at 15:33, Walter Dörwald wrote: >> The stateful decoder has a little problem: At least three bytes >> have to be available from the stream until the StreamReader >> decides whether these bytes are a BOM that has to be skipped. >> This means that if the file only contains "ab", the user will >> never see these two characters. > > Shouldn't the decoder be capable of doing a partial match and quitting > early? After all, "ab" is encoded in UTF8 as <61> > <62> but the BOM is. If it did this type of partial matching, > this issue would be avoided except in rare > situations. > >> A solution for this would be to add an argument named final to >> the decode and read methods that tells the decoder that the >> stream has ended and the remaining buffered bytes have to be >> handled now. > > This functionality is provided by a flush() method on similar objects, such > as the zlib compression objects. Theoretically the name is unimportant, but read(..., final=True) or flush() or close() should subject the pending bytes to normal error handling and must return the result of decoding these pending bytes just like the other methods do. This would mean that we would have to implement a decodecode(), a readclose() and a readlineclose(). IMHO it would be best to add this argument to decode, read and readline directly. But I'm not sure, what this would mean for iterating through a StreamReader. Bye, Walter Dörwald ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode byte order mark decoding
Walter Dörwald wrote: There are situations where the byte stream might be temporarily exhausted, e.g. an XML parser that tries to support the IncrementalParser interface, or when you want to decode encoded data piecewise, because you want to give a progress report. Yes, but these are not file-like objects. In the IncrementalParser, it is *not* the case that a read operation returns an empty string. Instead, the application repeatedly feeds data explicitly. For a file-like object, returning "" indicates EOF. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Developer list update
FYI, I'm starting a project to see what has become of some of the inactive developers. Essentially, it involves sending them a note to see if they still have use for their checkin permissions. If not, then we can make the change and improve security a bit. Also, to help with institutional memory, I started a log of changes to developer permissions. The goal is to remember who was given access, by whom, and why (some folks are given access for a one-shot project for example). The file is at Misc/developers. The first entry is for Nick Coghlan who was just granted tracker permissions so he can help manage outstanding bugs and patches. Raymond Hettinger ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Developer list update
On Tuesday 05 April 2005 06:47, Raymond Hettinger wrote: > Also, to help with institutional memory, I started a log of changes to > developer permissions. The goal is to remember who was given access, by > whom, and why (some folks are given access for a one-shot project for > example). The file is at Misc/developers. Thanks, Raymond! Would anyone here object to renaming the file to developers.txt, though? -Fred -- Fred L. Drake, Jr. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Developer list update
On Tue, 2005-04-05 at 19:06, Fred Drake wrote: > Would anyone here object to renaming the file to developers.txt, though? +1, please! -Barry signature.asc Description: This is a digitally signed message part ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode byte order mark decoding
> "Martin" == Martin v Löwis <[EMAIL PROTECTED]> writes:
Martin> So people do use the "decode-it-all" mode, where no
Martin> sequential access is necessary - yet the beginning of the
Martin> string is still the beginning of what once was a
Martin> stream. This case must be supported.
Of course it must be supported. My point is that many strings (in my
applications, all but those strings that result from slurping in a
file or process output in one go -- example, not a statistically valid
sample!) are not the beginning of "what once was a stream". It is
error-prone (not to mention unaesthetic) to not make that distinction.
"Explicit is better than implicit."
Martin> Whether or not to use the codec would be the application's
Martin> choice.
>> What I think should be provided is a stateful object
>> encapsulating the codec. Ie, to avoid the need to write
>> out = chunk[0].encode("utf-8-sig") + chunk[1].encode("utf-8")
Martin> No. People who want streaming should use cStringIO, i.e.
>>> s=cStringIO.StringIO()
>>> s1=codecs.getwriter("utf-8")(s)
>>> s1.write(u"Hallo")
>>> s.getvalue()
'Hallo'
Yes! Exactly (except in reverse, we want to _read_ from the slurped
stream-as-string, not write to one)! ... and there's no need for a
utf-8-sig codec for strings, since you can support the usage in
exactly this way.
--
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
Ask not how you can "do" free software business;
ask what your business can "do for" free software.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Developer list update
[Fred Drake] >> Would anyone here object to renaming the file to developers.txt, though? [Barry Warsaw] > +1, please! I voted with my DOS box. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] inconsistency when swapping obj.__dict__ with a dict-like object...
Hi!
here is a simple piece of code
---cut---
class Dict(dict):
def __init__(self, dct={}):
self._dict = dct
def __getitem__(self, name):
return self._dct[name]
def __setitem__(self, name, value):
self._dct[name] = value
def __delitem__(self, name):
del self._dct[name]
def __contains__(self, name):
return name in self._dct
def __iter__(self):
return iter(self._dct)
class A(object):
def __new__(cls, *p, **n):
o = object.__new__(cls)
o.__dict__ = Dict()
return o
a = A()
a.xxx = 123
print a.__dict__._dict
a.__dict__._dict['yyy'] = 321
print a.yyy
--uncut--
Here there are two problems, the first is minor, and it is that
anything assigned to the __dict__ attribute is checked to be a
descendant of the dict class (mixing this in does not seem to work)...
and the second problem is a real annoyance, it is that the mapping
protocol supported by the Dict object in the example above is not used
by the attribute access mechanics (the same thing that once happened
in exec)...
P.S. (IMHO) the type check here is not that necessary (at least in its
current state), as what we need to assert is not the relation to the
dict class but the support of the mapping protocol
thanks.
--
Alex.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] inconsistency when swapping obj.__dict__ with a dict-like object...
Alex A. Naanou wrote:
> Hi!
>
> here is a simple piece of code
>
> ---cut---
> class Dict(dict):
> def __init__(self, dct={}):
> self._dict = dct
> def __getitem__(self, name):
> return self._dct[name]
> def __setitem__(self, name, value):
> self._dct[name] = value
> def __delitem__(self, name):
> del self._dct[name]
> def __contains__(self, name):
> return name in self._dct
> def __iter__(self):
> return iter(self._dct)
>
> class A(object):
> def __new__(cls, *p, **n):
> o = object.__new__(cls)
> o.__dict__ = Dict()
> return o
>
> a = A()
> a.xxx = 123
> print a.__dict__._dict
> a.__dict__._dict['yyy'] = 321
> print a.yyy
>
> --uncut--
>
>
> Here there are two problems, the first is minor, and it is that
> anything assigned to the __dict__ attribute is checked to be a
> descendant of the dict class (mixing this in does not seem to work)...
> and the second problem is a real annoyance, it is that the mapping
> protocol supported by the Dict object in the example above is not used
> by the attribute access mechanics (the same thing that once happened
> in exec)...
>
Actually, overriding __getattribute__() does work; __getattr__() and
__getitem__() doesn't. This was brought up last month at some point without
any resolve (I think Steve Bethard pointed it out).
> P.S. (IMHO) the type check here is not that necessary (at least in its
> current state), as what we need to assert is not the relation to the
> dict class but the support of the mapping protocol
>
Semantically necessary, no. But simplicity- and performance-wise, maybe. If
you grep around in Objects/classobject.c, for instance, you will see
PyClassObject.cl_dict is accessed using PyDict_GetItem() and I spotted at least
one use of PyDict_DelItem(). To use the mapping protocol would require
changing all of these to PyObject_GetItem() and such.
Which will be a performance penalty compared to PyDict_GetItem(). So the
question is whether the flexibility is worth it.
-Brett
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode byte order mark decoding
Stephen J. Turnbull wrote:
Of course it must be supported. My point is that many strings (in my
applications, all but those strings that result from slurping in a
file or process output in one go -- example, not a statistically valid
sample!) are not the beginning of "what once was a stream". It is
error-prone (not to mention unaesthetic) to not make that distinction.
"Explicit is better than implicit."
I can't put these two paragraphs together. If you think that explicit
is better than implicit, why do you not want to make different calls
for the first chunk of a stream, and the subsequent chunks?
>>> s=cStringIO.StringIO()
>>> s1=codecs.getwriter("utf-8")(s)
>>> s1.write(u"Hallo")
>>> s.getvalue()
'Hallo'
Yes! Exactly (except in reverse, we want to _read_ from the slurped
stream-as-string, not write to one)! ... and there's no need for a
utf-8-sig codec for strings, since you can support the usage in
exactly this way.
However, if there is an utf-8-sig codec for streams, there is currently
no way of *preventing* this codec to also be available for strings. The
very same code is used for streams and for strings, and automatically
so.
Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
