Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
> The name "utf8b" suggested in the PEP is not in line with the codec > design Where is that design documented, and how exactly violates the name the design (chapter and verse, please). > Error handlers and codecs are two different things, so the namespaces > need to be clearly separate. They *are* separate naemspaces; that's guaranteed by the implementation. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
Stephen J. Turnbull wrote: > "Martin v. Löwis" writes: > > > It occurs to me that the PEP maybe should say that it is an error > > > to have your POSIX locale set to UTF-16 or something like that. > > > > No. It is *impossible* to have UTF-16 as the locale character set, > > not an error. Your statement is like saying "it is an error to > > breathe in the vacuum". > > I realize this is not useful, so maybe you don't need to mention it. > However, it certainly is possible to set LANG with an absurd, or > merely dangerous, encoding. How so? The C library will filter it out. > > In any case, the discussion says > > > > # Encodings that are not compatible with ASCII are not supported by > > # this specification; bytes in the ASCII range that fail to decode > > # will cause an exception. It is widely agreed that such encodings > > # should not be used as locale charsets. > > Which is your excuse for not supporting Shift JIS fully. It doesn't > stop people from setting LC_ALL=ja_JP.shift_jis, Well, it *does* stop them from doing so if their systems don't support the locale setting. In any case, if they do this, PEP 383 will not support them. > or using Shift JIS as the default encoding for certain media. I fail to see how this could ever matter. If, by "media", you mean things like removable disks, and the file name encoding used on them, it's fairly irrelevant for the PEP, since Python won't start using Shift JIS as its file system encoding just because that's the encoding used on the disk. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
> > > Second, I suggest "surrogate-replace" as the name of the error handler > > > rather than "utf8b". > > > > I think this is bike-shedding. > > I don't personally care (I already was aware of UTF-8B), but there are > plenty of others who do. I think it is a fairly bad name, because it is easy to confuse it with the "surrogates" error handler (unless you suggest to rename that also). > You have to fix the existing uses of > the obsolete "python-escape", anyway. Indeed - but only in the PEP. In the implementation, it's already utf8b throughout. Now it is also in the PEP; thanks for pointing that out. > > It's a security risk. If U+DCXX would map to \xXX, then somebody could > > embed U+DC2E U+DC2E U+DC2F into a character string; even if this gets > > sanitized, nobody would expect that this will actually access ../ > > The odds that anybody will actually take notice of U+002E U+002E > U+002F in a string are sufficiently small that any number of exploits > have already been based on it. I agree that there is some additional > risk from this if people make the check for "../" before they prepend > "\ucd2e\udc2e\udc2f", but I think that risk is very small compared to > the pain of having a error handler whose raison d'etre is to not raise > exceptions go ahead and raise them anyway. The problem is that functions like normpath will recognize ../, and that applications rely on them for file name sanitation. If they could be tricked into writing outside of their target folders, this would be a huge security risk. OTOH, I don't care breaking applications on misconfigured systems. People using SJIS as their locale encodings have bigger problems than Python raising exceptions. > See also my reply to Lino Mastrodomenico. URL? > But you're writing the PEP, so this battle will have to be deferred. > Eventually Python will have to take a stand on Unicode conformance, > but it's not urgent yet. I think it's always applications that are conforming or not, rather than libraries. Libraries should allow to write conforming applications. They may refuse to write certain non-conforming applications (although users then replace the library with one that does allow them to do what they want). Libraries can never enforce that applications conform to some standard. > Sorry! I suggest substituting the paragraph above for the paragraph > which begins "The encode error handler interface presentlyrequires..." > at line 129. Ah, ok. This was Glen Linderman's text before - now it's yours :-) > I think I forgot to do this before: "I hereby dedicate all text > I suggest for inclusion in the PEP to the public domain." :-) Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
> Yeah, yeah, this is the same old same old from PEP 3131. Anything > that handles the various attacks based on ASCII-alike characters > should at least rule out invalid Unicode, too! > > And where is this U+DC2F supposed to be coming from, anyway? The > user's *local* environment or the user's *local* filesystem! Why is that not a threat? Suppose you have a setuid application, and you pass some string on the command line that decodes to /../. Then the setuid application will be tricked into modifying files it didn't mean to modify. Likewise, it might come from a relational database. Use a relational database that supports unicode code units, or lone surrogates through utf-8, and fill in some bogus data. Then have the Python application (running as root) read it. > Of course I can't prove that there's no vector for an exploit here (in > fact, I'm sure there is one with sufficiently careless handling of > input), but I think "consenting adults" covers the Shift JIS use case. > Make it an option, but it should be explicitly part of the PEP. Nothing is lost at the moment. If users complain, we can still think of ways to enhance the experience. In any case, Python 3.1b1 may get released today, so it's way too late for new features in the PEP. They can wait for Python 3.2. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Help on issue 5941
Hello, I need some help on http://bugs.python.org/issue5941 The bug is quite simple: the Distutils unixcompiler used to set the archiver command to "ar -rc". For quite a while now, this behavior has changed in order to be able to customize the compiler behavior from the environment. That introduced a regression because the mechanism in Distutils that looks for the AR variable in the environment also looks into the Makefile of Python. (in the Makefile then is os.environ) And as a matter of fact, AR is set to "ar" in there, so the -cr option is not set anymore. So my question is : should I make a change into the Makefile by adding for example a variable called AR_OPTIONS then build the ar command with AR + AR_OPTIONS *or* that doesn't make sense and I just need to change the behavior so it doesn't look for AR into the Makefile. (just in os.environ) Thanks Tarek -- Tarek Ziadé | http://ziade.org ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
Martin v. Löwis v.loewis.de> writes: > > > I don't personally care (I already was aware of UTF-8B), but there are > > plenty of others who do. > > I think it is a fairly bad name, because it is easy to confuse it with > the "surrogates" error handler (unless you suggest to rename that also). I didn't bother to say it at the time, but I think "surrogates" is a pretty bad name. It should be more indicative of what it does, e.g. "surrogates-pass", or "surrogates-accept". > > > It's a security risk. If U+DCXX would map to \xXX, then somebody could > > > embed U+DC2E U+DC2E U+DC2F into a character string; even if this gets > > > sanitized, nobody would expect that this will actually access ../ Agreed this is an annoying security breach. The whole point of the PEP is that application developers do not have to care about filename encoding issues, which is defeated is they have to check for strange (illegal) combinations of characters. By the way, what are the ASCII characters that are not suppported by Shift-JIS? Not many I suppose? (if I read the Wikipedia entry correctly, it's only the backslash and the tilde). Regards Antoine. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
"Martin v. Löwis" writes: > I fail to see how this could ever matter. If, by "media", you mean > things like removable disks, and the file name encoding used on them, > it's fairly irrelevant for the PEP, since Python won't start using > Shift JIS as its file system encoding just because that's the encoding > used on the disk. I'm sorry for the lack of clarity of my posts, but somehow you're completely missing the point. The point is precisely that Python *won't* use Shift JIS as the file system encoding (if it did there would be no problem with reading Shift JIS), but the people who created the media *did*. Now, with Python's file system encoding == UTF-8 or any packed EUC, and more than a handful of Shift JIS or Big5 characters in file names, one is *almost certain* to encounter ASCII as the second byte of a multibyte sequence. PEP 383 can't handle this, but it is sure to be the most common use case for PEP 383 in East Asia. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
Martin v. Löwis wrote: >> The name "utf8b" suggested in the PEP is not in line with the codec >> design > > Where is that design documented, and how exactly violates the name > the design (chapter and verse, please). Martin, I designed the whole Python codec machinery, so even if this is not explicitly written down somewhere, you can take my word for it. I don't want users to be confused by such an error handler name, so please change it ! Here's a list of the currently available error handlers (taken from codecs.py): The .encode()/.decode() methods may use different error handling schemes by providing the errors argument. These string values are predefined: 'strict' - raise a ValueError error (or a subclass) 'ignore' - ignore the character and continue with the next 'replace' - replace with a suitable replacement character; Python will use the official U+FFFD REPLACEMENT CHARACTER for the builtin Unicode codecs on decoding and '?' on encoding. 'xmlcharrefreplace' - Replace with the appropriate XML character reference (only for encoding). 'backslashreplace' - Replace with backslashed escape sequences (only for encoding). The set of allowed values can be extended via register_error. >> Error handlers and codecs are two different things, so the namespaces >> need to be clearly separate. > > They *are* separate naemspaces; that's guaranteed by the implementation. In the implementation, yes, but not in the head of a typical user: the 'utf8b' looks more like a codec name than an error handler name. I want to avoid any such confusion with Python codecs and don't understand why you are making a problem out of this. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 06 2009) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2009-06-29: EuroPython 2009, Birmingham, UK53 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
M.-A. Lemburg wrote: Martin v. Löwis wrote: The name "utf8b" suggested in the PEP is not in line with the codec design Where is that design documented, and how exactly violates the name the design (chapter and verse, please). Martin, I designed the whole Python codec machinery, so even if this is not explicitly written down somewhere, you can take my word for it. I don't want users to be confused by such an error handler name, so please change it ! Here's a list of the currently available error handlers (taken from codecs.py): The .encode()/.decode() methods may use different error handling schemes by providing the errors argument. These string values are predefined: 'strict' - raise a ValueError error (or a subclass) 'ignore' - ignore the character and continue with the next 'replace' - replace with a suitable replacement character; Python will use the official U+FFFD REPLACEMENT CHARACTER for the builtin Unicode codecs on decoding and '?' on encoding. 'xmlcharrefreplace' - Replace with the appropriate XML character reference (only for encoding). 'backslashreplace' - Replace with backslashed escape sequences (only for encoding). The set of allowed values can be extended via register_error. Error handlers and codecs are two different things, so the namespaces need to be clearly separate. They *are* separate naemspaces; that's guaranteed by the implementation. In the implementation, yes, but not in the head of a typical user: the 'utf8b' looks more like a codec name than an error handler name. Judging by the existing names, I think that 'surrogate' would be reasonable. It already contains the meaning of substitute, it's not too long, and the codes which act as replacements are already called surrogates. I want to avoid any such confusion with Python codecs and don't understand why you are making a problem out of this. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
MRAB mrabarnett.plus.com> writes: > > Judging by the existing names, I think that 'surrogate' would be > reasonable. It already contains the meaning of substitute, Only if you are a native English-speaker I suppose... For me it's just a technical term denoting a certain class of unicode code points (I'm not sure of the latter terminology ;-)). Regards Antoine. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
2009/5/6 Antoine Pitrou : > By the way, what are the ASCII characters that are not suppported by > Shift-JIS? > Not many I suppose? (if I read the Wikipedia entry correctly, it's only the > backslash and the tilde). The biggest problem with Shift-JIS is that a perfectly valid unicode character above 127 can be encoded to a byte sequence that includes bytes in range(128). E.g. the character 掛 (a.k.a. '\u639b') when encoded with Shift-JIS becomes the two bytes sequence b'\x8a|'. Notice that the second byte is 124, which on POSIX is usually interpreted as the pipe character and can have security implications. It's a know problem with Shift-JIS and was fixed in UTF-8. -- Lino Mastrodomenico ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
On Wed, May 6, 2009 at 09:31, "Martin v. Löwis" wrote: > They *are* separate naemspaces; that's guaranteed by the implementation. Yes. But utf8b *sounds like* an encoding. When it isn't. I sure thought it was when it was first mentioned. I agree that it would be better to find another name. 'utf8-binary-replace'? Is it only usable with utf8 as an encoding? -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64 ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
Lino Mastrodomenico writes: > It's a know problem with Shift-JIS and was fixed in UTF-8. It was fixed in EUC before Shift-JIS was invented by Microsoft or Big5 was invented by the Taiwanese clone makers. Guido's not the only language designer with a time machine ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
"Martin v. Löwis" writes: > > Yeah, yeah, this is the same old same old from PEP 3131. Anything > > that handles the various attacks based on ASCII-alike characters > > should at least rule out invalid Unicode, too! > > > > And where is this U+DC2F supposed to be coming from, anyway? The > > user's *local* environment or the user's *local* filesystem! > > Why is that not a threat? Suppose you have a setuid application, and > you pass some string on the command line that decodes to /../. Then > the setuid application will be tricked into modifying files it didn't > mean to modify. Of course this is a threat, assuming that the application takes no precautions. But first, it should be stopped by any of several standard precautions. For example, applying os.path.realpath (come to think of it, PEP 383 should say something about realpath, shouldn't it?) and os.path.normpath (PEP 383 should definitely say something about this function; maybe PEP 3131 should, too) before checking access restrictions. If you're not running your paths through those, you're already vulnerable to symlink attacks, and maybe other forms of spoofing. Second, it's a threat already enabled by your restricted version of PEP 383. Access control applies to subdirectories as well as to parent directories. Since you can insert arbitrary non-ASCII bytes into the path using the current definition of 'utf8b', name-based access restrictions can be bypassed in exactly the same way for any directory whose name is not 100.00% ASCII, and the setuid application will be tricked into modifying files it didn't mean to modify. Also, on Mac OS X, system directories, including directories containing system libraries, frameworks, and executables, may be accessible via locale-specific names (I don't have a Japanese- localized Mac at hand to check, but I'm pretty sure in my old Mac the Japanese names appeared in ls in Terminal.app, which means it may be possible to access system directories containing libraries, frameworks, and executables this way). Those can be spoofed in exactly the same way. > Nothing is lost at the moment. Nothing is lost compared to 'strict', true, but under the PEP as it is a large fraction of Shift JIS and Big5 filenames cannot be read under ASCII-compatible file system encodings using 'utf8b'. Yet it is those users who are placed at risk by PEP 383. > In any case, Python 3.1b1 may get released today, so it's way too late > for new features in the PEP. They can wait for Python 3.2. You have convinced me that the PEP should wait as well. In its current form it is incomplete and dangerous. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
Stephen J. Turnbull xemacs.org> writes: > > Nothing is lost compared to 'strict', true, but under the PEP as it is > a large fraction of Shift JIS and Big5 filenames cannot be read under > ASCII-compatible file system encodings using 'utf8b'. You should really be more specific. I'm not sure about others, but I don't understand what filenames you are talking about. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
On Wed, 6 May 2009 at 13:40, Antoine Pitrou wrote: Stephen J. Turnbull xemacs.org> writes: Nothing is lost compared to 'strict', true, but under the PEP as it is a large fraction of Shift JIS and Big5 filenames cannot be read under ASCII-compatible file system encodings using 'utf8b'. You should really be more specific. I'm not sure about others, but I don't understand what filenames you are talking about. Seems to me that the best thing to do would be to file a bug report with test cases that demonstrate the problems when run against the current py3k trunk. Especially the security issues you cite (which I don't understand). --David ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
On May 6, 2009, at 7:33 AM, Stephen J. Turnbull wrote: You have convinced me that the PEP should wait as well. In its current form it is incomplete and dangerous. +1 on delaying PEP 383 I think PEP 383 is a good idea in principle, but I'm still struggling to understand it myself, and it seems to offer new hazards for the unwary programmer. On the other hand, maybe the wary programmers are waiting for Python 3.2 anyway . On the gripping hand, if PEP 383 is released in Python 3.1, will that obligate python-dev to support it indefinitely, at least in backwards- compatibility mode? I'm not thinking of API compatibility as much as data compatibility -- someone used Python 3.1 to write down some filenames, and now a few years later they are trying to use the latest and greatest Python release to read those filenames... Regards, Zooko ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
On May 6, 2009, at 5:39 AM, Stephen J. Turnbull wrote:
Now, with Python's file system encoding == UTF-8 or any packed EUC,
and more than a handful of Shift JIS or Big5 characters in file names,
one is *almost certain* to encounter ASCII as the second byte of a
multibyte sequence. PEP 383 can't handle this
Hm, I haven't tried the implementation, but I thought that what would
happen is:
'\x85a'.decode('utf-8', 'utf8b/surrogate-replace/whateveritscalled') -
> u'\uDC85a'
If that indeed doesn't happen, that's certainly a defect and should be
remedied.
, but it is sure to be
the most common use case for PEP 383 in East Asia.
Yes.
James
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Undocumented change / bug in Python3's PyMapping_Check
John Millikin wrote: > In Python 2, PyMapping_Check will return 0 for list objects. In Python > 3, it returns 1. Obviously, this makes it rather difficult to > differentiate between mappings and other sized iterables. In addition, > it differs from the behavior of the ``collections.Mapping`` ABC -- > isinstance([], collections.Mapping) returns False. > > I believe the new behavior is erroneous, but would like to confirm > that before filing a bug. It's not a bug. PyMapping_Check just tells you if a type has an entry in the tp_as_mapping->mp_subscript slot. In 2.x, it used to have an additional condition that the tp_as_sequence->sq_slice slot be empty, but that has gone away in Py3k because the sq_slice slot has been removed. Even in 2.x that test wasn't a reliable way of telling if something was a mapping or a sequence - it happened to get it right for lists and tuples (since they define __getslice__ and __setslice__), but this is not the case for new-style user defined sequences: >>> from operator import isMappingType >>> class MySeq(object): ... def __getitem__(self, idx): ... # Is this a mapping or an unsliceable sequence? ... return idx*2 ... >>> isMappingType(MySeq()) True Using the new collections module ABCs to check for sequences and mappings. That's what they're for, and they will give you a much more reliable answer than the C level checks (which are really just an implementation detail). Cheers, Nick. -- Nick Coghlan | [email protected] | Brisbane, Australia --- ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
Zooko Wilcox-O'Hearn zooko.com> writes: > > I'm not thinking of API compatibility as much as > data compatibility -- someone used Python 3.1 to write down some > filenames, and now a few years later they are trying to use the > latest and greatest Python release to read those filenames... Well, if the filenames are generated by Python (as opposed to read from an existing directory on disk), they should be regular unicode objects without any lone surrogates, so I don't see the compatibility problem. Regards Antoine. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
On approximately 5/6/2009 6:33 AM, came the following characters from the keyboard of Stephen J. Turnbull: "Martin v. Löwis" writes: > In any case, Python 3.1b1 may get released today, so it's way too late > for new features in the PEP. They can wait for Python 3.2. You have convinced me that the PEP should wait as well. In its current form it is incomplete and dangerous. I see nothing in this thread that suggests that the PEP is dangerous in its current form. While I (still) think that more readable transcodings could have been used, and while I had difficulty fully understanding the PEP at first, now that I think I do understand the PEP, and it has been somewhat clarified and amended, I cannot see how it could be dangerous. A specific case of danger should be included with such a statement. Regarding incomplete, I agree it won't brush my teeth for me, but I think it does solve the problem it sets out to solve. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
On approximately 5/6/2009 3:08 AM, came the following characters from the keyboard of MRAB: M.-A. Lemburg wrote: Martin v. Löwis wrote: Judging by the existing names, I think that 'surrogate' would be reasonable. It already contains the meaning of substitute, it's not too long, and the codes which act as replacements are already called surrogates. I want to avoid any such confusion with Python codecs and don't understand why you are making a problem out of this. +1 for "surrogate" as the name for the error handler. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
On approximately 5/6/2009 12:53 AM, came the following characters from the keyboard of Martin v. Löwis: Sorry! I suggest substituting the paragraph above for the paragraph which begins "The encode error handler interface presentlyrequires..." at line 129. Ah, ok. This was Glen Linderman's text before - now it's yours :-) Which is fine by me. Stephen's is more explanatory than mine, but says the same thing. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
Glenn Linderman wrote: On approximately 5/6/2009 3:08 AM, came the following characters from the keyboard of MRAB: M.-A. Lemburg wrote: Martin v. Löwis wrote: Judging by the existing names, I think that 'surrogate' would be reasonable. It already contains the meaning of substitute, it's not too long, and the codes which act as replacements are already called surrogates. I want to avoid any such confusion with Python codecs and don't understand why you are making a problem out of this. +1 for "surrogate" as the name for the error handler. +1 from me also ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
On May 6, 2009, at 10:54 AM, Antoine Pitrou wrote: Zooko Wilcox-O'Hearn zooko.com> writes: I'm not thinking of API compatibility as much as data compatibility -- someone used Python 3.1 to write down some filenames, and now a few years later they are trying to use the latest and greatest Python release to read those filenames... Well, if the filenames are generated by Python (as opposed to read from an existing directory on disk), they should be regular unicode objects without any lone surrogates, so I don't see the compatibility problem. I meant that the application reads filenames from an existing directory on disk, saves those filenames, and then later, using a future version of Python, wants to read them and use them. I'm not saying that I know this would be a problem. I'm saying that I personally can't tell whether it would be a problem or not, and the extensive discussions so far have not convinced me that there is anyone who both understands PEP 383 and considers this use case. Many people who apparently understand encoding issues well have said something to the effect that there is no problem, but those people haven't yet managed to get through my thick skull how I would use PEP 383 safely for this sort of use case -- the one where data generated by os.listdir() travels forward in time or the one were that data travels sideways to other systems, including Windows or other systems that validate incoming unicode. That's why I am a bit uncomfortable about PEP 383 being quickly implemented and deployed in Python 3.1. By the way, much of the detailed discussion about what Tahoe requires and how that may or may not benefit from PEP 383 has now moved to the tahoe-dev mailing list: http://allmydata.org/cgi-bin/mailman/listinfo/ tahoe-dev . Regards, Zooko ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
On approximately 5/6/2009 12:18 PM, came the following characters from the keyboard of Zooko Wilcox-O'Hearn: On May 6, 2009, at 10:54 AM, Antoine Pitrou wrote: Zooko Wilcox-O'Hearn zooko.com> writes: I'm not thinking of API compatibility as much as data compatibility -- someone used Python 3.1 to write down some filenames, and now a few years later they are trying to use the latest and greatest Python release to read those filenames... Well, if the filenames are generated by Python (as opposed to read from an existing directory on disk), they should be regular unicode objects without any lone surrogates, so I don't see the compatibility problem. I meant that the application reads filenames from an existing directory on disk, saves those filenames, and then later, using a future version of Python, wants to read them and use them. Regarding future versions of Python. In the worst case, even if Python's default behavior changes, the transcoding done by PEP 383 can be done in other software too... it is a straightforward, fully specified, 1-to-1, reversible transcoding process, affecting and generating only invalid byte encodings on one side, and invalid Unicode sequences on the other. So if Python's default behavior should change, the transcoding implemented by PEP 383 could be easily reimplemented to enable a future version of a Python application to manipulate the transcoded, saved, filenames. By easily, I mean that I could code it in a couple hours, max. I'm not saying that I know this would be a problem. I'm saying that I personally can't tell whether it would be a problem or not, and the extensive discussions so far have not convinced me that there is anyone who both understands PEP 383 and considers this use case. Does the above help? Many people who apparently understand encoding issues well have said something to the effect that there is no problem, but those people haven't yet managed to get through my thick skull how I would use PEP 383 safely for this sort of use case -- the one where data generated by os.listdir() travels forward in time or the one were that data travels sideways to other systems, including Windows or other systems that validate incoming unicode. Regarding data traveling sideways, some comments: 1) PEP 383's effect could be recoded in other languages as easily as it is in Python (or the C in which Python is implmented). So that could be a solution. 2) You mention "Windows" and "other systems that validate incoming unicode" in the same phrase, as if you think that "Windows" qualifies as an "other systems that validate incoming unicode", but it does not (at least not universally). That's why I am a bit uncomfortable about PEP 383 being quickly implemented and deployed in Python 3.1. Does the above help? By the way, much of the detailed discussion about what Tahoe requires and how that may or may not benefit from PEP 383 has now moved to the tahoe-dev mailing list: http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev . I have no background with Tahoe, nor particular interest, although it sounds like a useful project... so I won't be joining that list. I have no idea if there is an installed base of existing Tahoe file systems, my suggestions below assume that there is not, and that you are presently inventing them. Therefore, I provide no migration path, although I could invent one, but it would take longer to describe. However, since I'm responding here, and have read what you have posted here, it seems like the following could be true. Assumptions from your emails: A) Tahoe wants to provide a UTF-8 file name system B) Tahoe wants to interface to POSIX systems that use (and do not validate) byte interfaces. C) Tahoe wants to interface to non-POSIX systems that use 16-bit file name interfaces, with no validation. D) Tahoe wants to interface to non-POSIX systems that use 16-bit file name interfaces, with validation. Uncertainties: I'm not clear on what your goals are for Tahoe filenames. There seem to be 2 possibilities: 1) you want to reject attempts to use non-validating Unicode, be it from a 16-bit interface, or a bytes interface. 2) you don't want to reject non-validating Unicode, but you want to convert it to valid Unicode for (D) systems. 3) Orthogonally, you might want to store only Valid Unicode in the names, or you might not care, if you can meet the other goals. Truisms: If you want to support (D), and (2), then you must transform names at some point, using some scheme, because not all names supplied by (B) systems will be acceptable to (D) systems. You can choose to do this transformation when a (B) system provides an invalid (per Unicode) name, or you can choose to do the transformation when a (D) system accesses a file with an invalid (per Unicode) name. If the (B) and (D) systems talk to each other outside of Tahoe, they will have to do similar transf
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
>>> The name "utf8b" suggested in the PEP is not in line with the codec >>> design >> Where is that design documented, and how exactly violates the name >> the design (chapter and verse, please). > > Martin, I designed the whole Python codec machinery Not true. PEP 293 was written and designed by Walter Dörwald. > so even if > this is not explicitly written down somewhere, you can take my > word for it. If the design was specified in writing somewhere, I would probably challenge it as obsolete. If it isn't described anywhere, I'll have to ignore it. > I want to avoid any such confusion with Python codecs and don't > understand why you are making a problem out of this. Because utf8b (or, perhaps "UTF-8b") is the official name for this algorithm: http://hyperreal.org/~est/utf-8b/ Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
> I'm sorry for the lack of clarity of my posts, but somehow you're > completely missing the point. The point is precisely that Python > *won't* use Shift JIS as the file system encoding (if it did there > would be no problem with reading Shift JIS), but the people who > created the media *did*. > > Now, with Python's file system encoding == UTF-8 or any packed EUC, > and more than a handful of Shift JIS or Big5 characters in file names, > one is *almost certain* to encounter ASCII as the second byte of a > multibyte sequence. PEP 383 can't handle this Not true. PEP 383 handles this very example just fine, with no problems that I can see. Can you propose a specific example that you think might cause problems? By "specific", I mean: what file names (exact bytes, please), what locale charset, what API calls. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
> Judging by the existing names, I think that 'surrogate' would be > reasonable MAL's list of existing names is incomplete. "surrogates" is already an existing name, also, and it means something different (similar, but different). Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
Terry Reedy wrote: > Glenn Linderman wrote: >> On approximately 5/6/2009 3:08 AM, came the following characters from >> the keyboard of MRAB: >>> M.-A. Lemburg wrote: Martin v. Löwis wrote: >> >>> Judging by the existing names, I think that 'surrogate' would be >>> reasonable. It already contains the meaning of substitute, it's not too >>> long, and the codes which act as replacements are already called >>> surrogates. >>> I want to avoid any such confusion with Python codecs and don't understand why you are making a problem out of this. >> >> >> +1 for "surrogate" as the name for the error handler. >> >> > +1 from me also Despite there being also an error handler called "surrogates". Are you serious? Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
> Is it only usable with utf8 as an encoding? No, it applies to any codec which potentially cannot decode all bytes >127. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
Martin v. Löwis v.loewis.de> writes: > > Despite there being also an error handler called "surrogates". People, perhaps we could end all the bikeshedding and call one of those handlers "surrogates-pass" and the other "surrogates-escape", which sounds quite faithful to what they actually /do/? Regards Antoine. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
> But first, it should be stopped by any of several > standard precautions. For example, applying os.path.realpath (come to > think of it, PEP 383 should say something about realpath, shouldn't > it?) Why do you think so? I think the existing documentation of realpath is correct and complete. > and os.path.normpath (PEP 383 should definitely say something > about this function Precisely what? > maybe PEP 3131 should, too) How can this be of relevance? > > Nothing is lost at the moment. > > Nothing is lost compared to 'strict', true, but under the PEP as it is > a large fraction of Shift JIS and Big5 filenames cannot be read under > ASCII-compatible file system encodings using 'utf8b'. Yet it is those > users who are placed at risk by PEP 383. I think this statement is incorrect. Those filenames *can* be read just fine. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
Antoine Pitrou wrote: > Martin v. Löwis v.loewis.de> writes: >> Despite there being also an error handler called "surrogates". > > People, perhaps we could end all the bikeshedding and call one of those > handlers > "surrogates-pass" and the other "surrogates-escape", which sounds quite > faithful > to what they actually /do/? The problem with these bike-shedding discussions is that you cannot stop them with a proposal. People will counter-propose. I would be willing to accept a ruling from someone who a) is a native speaker of English, and b) has demonstrated to fully understand what these do, and c) has understood why I insist on calling it utf8b. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
Martin v. Löwis wrote: +1 for "surrogate" as the name for the error handler. +1 from me also Despite there being also an error handler called "surrogates". Given that additional information which MAL apparently omitted, I would revise. Are you serious? Are you? ;-? You are the one naming a codec-agnostic error handler (if I understand correctly, and correct me if I do not) after a particular codec, and denying that that could cause confusion. See other message. Terry Jan Reedy ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
Martin v. Löwis wrote: Because utf8b (or, perhaps "UTF-8b") is the official name for this algorithm: http://hyperreal.org/~est/utf-8b/ Thank you for the link. It starts: "This directory contains a C implementation of a UTF-8b codec. A Python codec based on it is provided as well." 'RTF-8b' consists, obviously, 'UTF-8' plus 'b', with the 'b' signifying a variation of or addition to UTF-8. The 'b', and only the 'b', refers to the innovative error-handler that was added to the existing 'UTF-8' codec/algorithm. The name of the combined whole is not the name of the part. If you were incorporating the Python-wrapped utf-8b *codec* as a codec, which is what I once thought *because you used that name*, then calling it 'utf-8b' would be fine. But you apparently instead proposed and implemented an *error-handler*, which seems to me to be something else, and which will not be specific to utf-8 but usable with any codec. Hence some of us think it should have a different name. I gather that you lifted the error-handler part of the algorithm and propose to use it with *any* ascii-respecting codec. I could claim that the 'official name' of that part is 'b', but I think we can find a better name. Terry Jan Reedy ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
2009/5/6 Antoine Pitrou : > Martin v. Löwis v.loewis.de> writes: >> >> Despite there being also an error handler called "surrogates". > > People, perhaps we could end all the bikeshedding and call one of those > handlers > "surrogates-pass" and the other "surrogates-escape", which sounds quite > faithful > to what they actually /do/? We could also stop the bikeshedding by sticking with the name utf8b. Martin's comment that it is the official name for this algorithm seems compelling to me (even if it is confusing because of its similarity with utf-8). Paul. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
Martin v. Löwis wrote: Antoine Pitrou wrote: Martin v. Löwis v.loewis.de> writes: Despite there being also an error handler called "surrogates". People, perhaps we could end all the bikeshedding and call one of those handlers "surrogates-pass" and the other "surrogates-escape", which sounds quite faithful to what they actually /do/? The problem with these bike-shedding discussions is that you cannot stop them with a proposal. People will counter-propose. I would be willing to accept a ruling from someone who a) is a native speaker of English, and b) has demonstrated to fully understand what these do, and c) has understood why I insist on calling it utf8b. I qualify with a). I believe I understand c) but, as explained in my other post, I do not think your reason applies. In fact, I think concern for naming rights might suggest that you *not* reuse the name for something different. I would have to learn more about the existing 'surrogates' handler to judge Antione's suggestion 'surrogates-pass'. 'Surrogates-escape' is pretty good for the new handler since, to my understanding, it 'escapes' 'bad bytes' by prefixing them with bits that push them to the surrogates plane. I have been supportive of the idea and, as well as I understood them, the particulars of your proposal, from the beginning. Reusing the name of a codec as the name of an error-handler confused me and I believe it will confuse others, even though, but also because, the error handler was extracted and generalized from the codec. Terry Jan Reedy ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
>> Are you serious? > > Are you? ;-? You are the one naming a codec-agnostic error handler (if > I understand correctly, and correct me if I do not) after a particular > codec, and denying that that could cause confusion. See other message. I can only repeat what I said before: I call it utf8b because that's the established name for the algorithm it implements. That algorithm was originally designed with UTF-8 in mind (and only meant to be applied for UTF-8), however, it remains the same algorithm even though PEP 383 widens its application. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
Antoine Pitrou wrote: Martin v. Löwis v.loewis.de> writes: Despite there being also an error handler called "surrogates". People, perhaps we could end all the bikeshedding and call one of those handlers "surrogates-pass" and the other "surrogates-escape", which sounds quite faithful to what they actually /do/? After having read about the existing error handler called "surrogates" and having thought about it, I've decided that calling one just "surrogates" isn't very helpful to the user; it has something to do with surrogates, but what? So +1 for Antoine's suggestion from me. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
> I qualify with a). I believe I understand c) but, as explained in my
> other post, I do not think your reason applies. In fact, I think
> concern for naming rights might suggest that you *not* reuse the name
> for something different. I would have to learn more about the existing
> 'surrogates' handler to judge Antione's suggestion 'surrogates-pass'.
> 'Surrogates-escape' is pretty good for the new handler since, to my
> understanding, it 'escapes' 'bad bytes' by prefixing them with bits that
> push them to the surrogates plane.
See issue 3672. In essence, in python 2.5:
py> u"\ud800".encode("utf-8")
'\xed\xa0\x80'
py> '\xed\xa0\x80'.decode("utf-8")
u'\ud800'
In 3.1,
py> "\ud800".encode("utf-8")
Traceback (most recent call last):
File "", line 1, in
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed
py> "\ud800".encode("utf-8","surrogates")
b'\xed\xa0\x80'
py> b'\xed\xa0\x80'.decode("utf-8")
Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2:
illegal encoding
py> b'\xed\xa0\x80'.decode("utf-8","surrogates")
'\ud800'
Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
Martin v. Löwis v.loewis.de> writes:
> py> b'\xed\xa0\x80'.decode("utf-8","surrogates")
> '\ud800'
The point is, "surrogates" does not mean anything intuitive for an /error
handler/. You seem to be the only one who finds this name explicit enough,
perhaps because you chose it.
Most other handlers' names have verbs in them ("ignore", "replace",
"xmlcharrefreplace", etc.).
Regards
Antoine.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed: add support for UNC paths to all functions in ntpath
Eric Smith wrote: Mark: I've reviewed this and it looks okay to me. Thanks Eric - I've now applied that patch. As you mentioned in a followup to the bug: | Thanks for looking at this, Mark. If we could only assign issues to | Python 3.2 and 3.3 to change the pending deprecation warning to a real | one, and to remove the function entirely, we'd be all set! I'm always | worried we'll forget these things. (for reference; the patch introduces a PendingDeprecationWarning for ntpath.uncpath) The bug tracker doesn't have these future versions available yet - is there some other way these things should be tracked? I fear simply opening a new bug without a reasonable 'trigger' will linger way beyond the next few versions... Thanks, Mark ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
On Wed, May 6, 2009 at 15:42, "Martin v. Löwis" wrote:
> Despite there being also an error handler called "surrogates".
Not that I have to be, but I'm not sold on the previous UTF-8 codec
behavior becoming an error handler of the name "surrogates" for two
reasons (I do respect the obvious PBP argument for the implementation,
and have no better name - "lenient"?).
First, unless there's a way to stack error handlers, there's no way to
access the old behavior combined with the "replace" handler. Second,
errors="surrogates" reads like surrogates should be an error, not an
additionally allowed pattern. Neither of these are deal breakers or
hard to learn, but they are non-obvious. I think the utf8b behavior
makes a lot more sense with the name "surrogates", through the
mnemonic that errors become surrogates.
The stacking argument also applies to the new utf8b behavior on encode
(only, as it handles all errors on decode). This may be a YAGNI, but
for a non-UTF-8 encode, it may be useful to allow "xmlcharrefreplace"
handling for unavailable non-surrogate-escaped characters. But without
stacking that's unmaintainable, as we clearly don't want ${codec}b for
all current codecs.
I'd be perfectly happy with utf8b or UTF-8b, as either a codec or an
error handler (do we want both? YAGNI?). So what if it smells a little
inaccurate as a handler when used with codecs other than UTF-8, no big
deal. I could also see something like errors="roundtrip" which
explains the intention of the handler rather than the algorithm, but
is awkward on encode when it encounters unavailable Unicode
characters.
--
Michael Urman
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
Martin v. Löwis wrote: The name "utf8b" suggested in the PEP is not in line with the codec design >>> Where is that design documented, and how exactly violates the name >>> the design (chapter and verse, please). >> Martin, I designed the whole Python codec machinery > > Not true. PEP 293 was written and designed by Walter Dörwald. Walter added the generic error handler callback mechanism and we both worked on their design. I designed and wrote the codec implementation back in 2000, which included the whole idea of having codec error handlers in the first place. The original implementation only allowed per-codec error handlers. Walter extended this to build general-purpose handlers that could be used by many codecs. His original motivation was to be able to do XML character reference escaping. If you don't believe me, go look this up in the repository, the mailing list archives and the trackers. >> so even if >> this is not explicitly written down somewhere, you can take my >> word for it. > > If the design was specified in writing somewhere, I would probably > challenge it as obsolete. If it isn't described anywhere, I'll have > to ignore it. Ah, lovely attitude. >> I want to avoid any such confusion with Python codecs and don't >> understand why you are making a problem out of this. > > Because utf8b (or, perhaps "UTF-8b") is the official name for this > algorithm: > > http://hyperreal.org/~est/utf-8b/ That's a codec implementing the escaping idea proposed by Markus Kuhn, not an official reference. AFAIK, the term "UTF-8B" originated from a "UTF-8 + binary" codec written for iconv: http://mail.nl.linux.org/linux-utf8/2006-04/msg2.html If it were the official name of an escape algorithm, as you are suggesting, the inventor Markus Kuhn would probably have chosen it, but he hasn't... the only reference to it is an email where it is described as option D for ways of dealing with malformed UTF-8 data in a decoder: http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html Note that this escape method is not applicable for data that you decode from UTF-8 and then e.g. encode as Latin-1. It only works as general purpose method if you are decoding and encoding using the same codec, since it is specifically designed to assure round-trip safety. Martin, please stop being silly and just change the name. Or drop the idea of using an error handler altogether and just let people use the utf-8b codec you referenced above to solve their problems whereever and if needed. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 07 2009) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2009-06-29: EuroPython 2009, Birmingham, UK52 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] test - please ignore
Some of my messages appear not to have gotten through. -- Regards, Benjamin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] [RELEASED] Python 3.1 beta 1
On behalf of the Python development team, I'm thrilled to announce the first and only beta release of Python 3.1. Python 3.1 focuses on the stabilization and optimization of features and changes Python 3.0 introduced. For example, the new I/O system has been rewritten in C for speed. File system APIs that use unicode strings now handle paths with undecodable bytes in them. [1] Other features include an ordered dictionary implementation and support for ttk Tile in Tkinter. For a more extensive list of changes in 3.1, see http://doc.python.org/dev/py3k/whatsnew/3.1.html or Misc/NEWS in the Python distribution. Please note that this is a beta release, and as such is not suitable for production environments. We continue to strive for a high degree of quality, but there are still some known problems and the feature sets have not been finalized. This beta is being released to solicit feedback and hopefully discover bugs, as well as allowing you to determine how changes in 3.1 might impact you. If you find things broken or incorrect, please submit a bug report at http://bugs.python.org For more information and downloadable distributions, see the Python 3.1 website: http://www.python.org/download/releases/3.1/ See PEP 375 for release schedule details: http://www.python.org/dev/peps/pep-0375/ Enjoy, -- Benjamin Benjamin Peterson benjamin at python.org Release Manager (on behalf of the entire python-dev team and 3.1's contributors) ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
"Martin v. Löwis" writes: > > Now, with Python's file system encoding == UTF-8 or any packed EUC, > > and more than a handful of Shift JIS or Big5 characters in file names, > > one is *almost certain* to encounter ASCII as the second byte of a > > multibyte sequence. PEP 383 can't handle this Ah, I see. Of course, the algorithm not only has to handle the ASCII octet which is erroneous because it can't be a trailing byte, but *also the leading byte that signalled to expect a trailing byte >127*. So the algorithm backs up to the character boundary (which is well-defined for all the "sane" encodings), encode the high byte(s) in the character with lone surrogates, and encode the ASCII as itself (promoted to a Unicode code point). Sorry, you're right, I was just confused. I withdraw the objection as completely mistaken, and apologize for not thinking more carefully in the first place. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
Martin v. Löwis wrote: Are you serious? Are you? ;-? You are the one naming a codec-agnostic error handler (if I understand correctly, and correct me if I do not) after a particular codec, and denying that that could cause confusion. See other message. I can only repeat what I said before: I call it What, specifically, is 'it'? utf8b because that's the established name for the algorithm Which algorithm? it implements. Again, what is 'it'? As *I* read the sentence above, it is not true. I went to the site you referred to as the source of your reasoning and specifically http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/utf_8b.c The algorithm called utf-8b *IS* utf-8 with the addition or replacement (of an error return) of essentially one line in each direction: # encode if 0xDC00 <= codepoint <= 0xDCFF: byte = codepoint - 0xDC00 #encode Note: for security concerns, you are increasing the lower limit to 0xDC80. The comment at the top of the utf_8b.c, suggests that that is what it should be and should have been in the file, with the other half of that surrogate area an error along with the other surrogate area. #decode if (0x80 <= byte <= 0xFF) and utf-8-invalid(byte): codepoint = byte + 0xDC00 # decode That algorithm was originally designed with UTF-8 in mind (and only meant to be applied for UTF-8), however, it remains the same algorithm even though PEP 383 widens its application. The error handler designed with utf-8 in mind has no name in the encode direction and is called "utf_8b_decoder_invalid_bytes" in the decode direction. By your reasoning, *that* should be its name in Python. The encoding error handler would then be named analogously "utf_8b_encoder_invalid_codepoints". Even these, to me, would be better than confusing giving them the same name as the codec. Terry Jan Reedy ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
On approximately 5/6/2009 6:06 PM, came the following characters from the keyboard of M.-A. Lemburg: Martin, please stop being silly and just change the name. Yes, please. If indeed Marc-Andre invented the codec business as he claims, he would be an appropriate person to give a fiat name to the error handler. Or drop the idea of using an error handler altogether and just let people use the utf-8b codec you referenced above to solve their problems whereever and if needed. The design as an error handler is clever in leveraging the same error handler for multiple codecs, which cannot be done by using utf-8b alone, if I understand correctly. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
Michael Urman wrote:
> On Wed, May 6, 2009 at 15:42, "Martin v. Löwis" wrote:
>> Despite there being also an error handler called "surrogates".
>
> Not that I have to be, but I'm not sold on the previous UTF-8 codec
> behavior becoming an error handler of the name "surrogates" for two
> reasons (I do respect the obvious PBP argument for the implementation,
> and have no better name - "lenient"?).
PBP?
> First, unless there's a way to stack error handlers, there's no way to
> access the old behavior combined with the "replace" handler.
Well, there is a way to stack error handlers, although it's not pretty:
_surrogates = codecs.lookup_errors("surrogates")
_replace = codecs.lookup_errors("replace")
def surrogates_then_replace(exc):
try:
return _surrogates(exc)
except UnicodeError:
return _replace(exc)
codecs.register_error("surrogates_then_replace",
surrogates_then_replace)
> The stacking argument also applies to the new utf8b behavior on encode
> (only, as it handles all errors on decode). This may be a YAGNI
Indeed - in particular, as, in the primary application of this error
handler (i.e. file IO operations), there is no way of specifying
an addition error handler anyway.
Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
> By the way, what are the ASCII characters that are not suppported by
> Shift-JIS?
> Not many I suppose? (if I read the Wikipedia entry correctly, it's only the
> backslash and the tilde).
The problem with this encoding is that bytes below 128 appear as second
bytes of a two-byte encoding:
py> "\x81@".decode("shift-jis")
u'\u3000'
py> "\x81A".decode("shift-jis")
u'\u3001'
So in on decoding, it may be the second byte (i.e. the ASCII byte) that
causes a problem:
py> "\x81/".decode("shift-jis")
Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'shift_jis' codec can't decode bytes in position
0-1: illegal multibyte sequence
For the shift-jis codec, that's actually not a problem, though:
py> b"\x81/".decode("shift-jis","utf8b")
'\udc81/'
so the utf8b error handler will escape the first of the two bytes,
and then pass the second byte to the codec again, which then decodes
as ASCII.
Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
>> So are you proposing that I should rename the PEP 383 handler >> to "utf_8b_encoder_invalid_codepoints"? > > > No, he's saying that your algorithm for choosing the PEP 383 handler > should have come up with that name, rather than utf8b. But since PEP > 383 applies to other codecs besides UTF-8, it should have a different > name. And one that is less cumbersome than > "utf_8b_encoder_invalid_codepoints" I'm still at a loss what name to give it, though. I understand that I have to rename both error handlers, but I'm uncertain what I should rename them to. So proposals that rename only one of them aren't that helpful. It would be helpful if people would indicate support for Antoine's proposal. Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
On approximately 5/6/2009 10:53 PM, came the following characters from the keyboard of Martin v. Löwis: The error handler designed with utf-8 in mind has no name in the encode direction and is called "utf_8b_decoder_invalid_bytes" in the decode direction. By your reasoning, *that* should be its name in Python. The encoding error handler would then be named analogously "utf_8b_encoder_invalid_codepoints". Even these, to me, would be better than confusing giving them the same name as the codec. So are you proposing that I should rename the PEP 383 handler to "utf_8b_encoder_invalid_codepoints"? No, he's saying that your algorithm for choosing the PEP 383 handler should have come up with that name, rather than utf8b. But since PEP 383 applies to other codecs besides UTF-8, it should have a different name. And one that is less cumbersome than "utf_8b_encoder_invalid_codepoints" -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
> Wouldn't renaming the existing "surrogates" handler be an incompatible > change, and thus inappropriate? No - it's new in Python 3.1. So what do you think about Antoine's proposal? Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 update: utf8b is now the error handler
> The error handler designed with utf-8 in mind has no name in the encode > direction and is called "utf_8b_decoder_invalid_bytes" in the decode > direction. By your reasoning, *that* should be its name in Python. The > encoding error handler would then be named analogously > "utf_8b_encoder_invalid_codepoints". Even these, to me, would be better > than confusing giving them the same name as the codec. So are you proposing that I should rename the PEP 383 handler to "utf_8b_encoder_invalid_codepoints"? Regards, Martin ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
