Re: Best ways of managing text encodings in source/regexes?
Please see the correction from Cliff pasted here after this excerpt. Tim the byte string is ASCII which is a subset of Unicode (IS0-8859-1 isn't).) The one comment I'd make is that ASCII and ISO-8859-1 are both subsets of Unicode, (which relates to the abstract code-points) but ASCII is also a subset of UTF-8, on the bytestream level, while ISO-8859 is not a subset of UTF-8, nor, as far as I can tell, any other unicode *encoding*. Thus a file encoded in ascii *is* in fact a utf-8 file. There is no way to distinguish the two. But an ISO-8859-1 file is not the same (on the bytestream level) as a file with identical content in UTF-8 or any other unicode encoding. -- http://mail.python.org/mailman/listinfo/python-list
Re: Best ways of managing text encodings in source/regexes?
OK, for those interested in this sort of thing, this is what I now think is necessary to work with Unicode in python. Thanks to those who gave feedback, and to Cliff in particular (but any remaining misconceptions are my own!) Here are the results of my attempts to come to grips with this. Comments/corrections welcome... (Note this is not about which characters one expects to match with \w etc. when compiling regexes with python's re.UNICODE flag. It's about the encoding of one's source strings when building regexes, in order to match against strings read from files of various encodings.) I/O: READ TO/WRITE FROM UNICODE STRING OBJECTS. Always use codecs to read from a specific encoding to a python Unicode string, and use codecs to encode to a specific encoding when writing the processed data. codecs.read() delivers a Unicode string from a specific encoding, and codecs.write() will put the Unicode string into a specific encoding (be that an encoding of Unicode such as UTF-8). SOURCE: Save the source as UTF-8 in your editor, tell python with # - *- coding: utf-8 -*-, and construct all strings with u'' (or ur'' instead of r''). Then, when you're concatenating strings constructed in your source with strings read with codecs, you needn't worry about conversion issues. (When concatenating byte strings from your source with Unicode strings, python will, without an explicit decode, assume the byte string is ASCII which is a subset of Unicode (IS0-8859-1 isn't).) Even if you save the source as UTF-8, tell python with # -*- coding: utf-8 -*-, and say myString = blah, myString is a byte string. To construct a Unicode string you must say myString = ublah or myString = unicode(blah), even if your source is UTF-8. Typing 'u' when constructing all strings isn't too arduous, and less effort than passing selected non-ASCII source strings to unicode() and needing to remember where to do it. (You could easily slip a non- ASCII char into a byte string in your code because most editors and default system encodings will allow this.) Doing everything in Unicode simplifies life. Since the source is now UTF-8, and given Unicode support in the editor, it doesn't matter whether you use Unicode escape sequences or literal Unicode characters when constructing strings, since uá == u\u00E1 True REGEXES: I'm a bit less certain about regexes, but this is how I think it's going to work: Now that my regexes are constructed from Unicode strings, and those regexes will be compiled to match against Unicode strings read with codecs, any potential problems with encoding conversion disappears. If I put an en-dash into a regex built using u'', and I happen to have read the file in the ASCII encoding which doesn't support en-dashes, the regex simply won't match because the pattern doesn't exist in the /Unicode/ string served up by codecs. There's no actual problem with my string encoding handling, it just means I'm looking for the wrong chars in a Unicode string read from a file not saved in a Unicode encoding. tIM -- http://mail.python.org/mailman/listinfo/python-list
Re: Best ways of managing text encodings in source/regexes?
On Tue, Nov 27, 2007 at 01:30:15AM +0200, tinker barbet wrote regarding Re: Best ways of managing text encodings in source/regexes?: Hi Thanks for your responses, as I said on the reply I posted I thought later it was a bit long so I'm grateful you held out! I should have said (but see comment about length) that I'd tried joining a unicode and a byte string in the interpreter and got that working, but wondered whether it was safe (I'd seen the odd warning about mixing byte strings and unicode). Anyway what I was concerned about was what python does with source files rather than what happens from the interpreter, since I don't know if it's possible to change the encoding of the terminal without messing with site.py (and what else). Aren't both ASCII and ISO-8859-1 subsets of UTF-8? Can't you then use chars from either of those charsets in a file saved as UTF-8 by one's editor, with a # -*- coding: utf-8 -*- pseudo-declaration for python, without problems? You seem to disagree. I do disagree. Unicode is a superset of ISO-8859-1, but UTF-8 is a specific encoding, which changes many of the binary values. UTF-8 was designed specifically not to change the values of ascii characters. 0x61 (lower case a) in ascii is encoded with the bits 0110 0001. In UTF-8 it is also encoded 0110 0001. However, ?, latin small letter n with tilde, is unicode/iso-8859-1 character 0xf1. In ISO-8859-1, this is represented by the bits 0001. UTF-8 gets a little tricky here. In order to be extensible beyond 8 bits, it has to insert control bits at the beginning, so this character actually requires 2 bytes to represent instead of just one. In order to show that UTF-8 will be using two bytes to represent the character, the first byte begins with 110 (1110 is used when three bytes are needed). Each successive byte begins with 10 to show that it is not the beginning of a character. Then the code-point value is packed into the remaining free bits, as far to the right as possible. So in this case, the control bits are 110x 10xx . The character value, 0xf1, or: 0001 gets inserted as follows: 110x xx{11} 10{11 0001} and the remaining free x-es get replaced by zeroes. 1100 0011 1011 0001. Note that the python interpreter agrees: py x = u'\u00f1' py x.encode('utf-8') '\xc3\xb1' (Conversion from binary to hex is left as an exercise for the reader) So while ASCII is a subset of UTF-8, ISO-8859-1 is definitely not. As others have said many times when this issue periodically comes up: UTF-8 is not unicode. Hopefully this will help explain exactly why. Note that with other encodings, like UTF-16, even ascii is not a subset. See the wikipedia article on UTF-8 for a more complete explanation and external references to official documentation (http://en.wikipedia.org/wiki/UTF-8). The reason all this arose was that I was using ISO-8859-1/Latin-1 with all the right declarations, but then I needed to match a few chars outside of that range. So I didn't need to use u before, but now I do in some regexes, and I was wondering if this meant that /all/ my regexes had to be constructed from u strings or whether I could just do the selected ones, either using literals (and saving the file as UTF-8) or unicode escape sequences (and leaving the file as ASCII -- I already use hex escape sequences without problems but that doesn't work past the end of ISO-8859-1). Do you know about unicode escape sequences? py u'\xf1' == u'\u00f1' True Thanks again for your feedback. Best wishes Tim No problem. It took me a while to wrap my head around it, too. Cheers, Cliff -- http://mail.python.org/mailman/listinfo/python-list
Re: Best ways of managing text encodings in source/regexes?
On Nov 26, 2007 4:27 PM, Martin v. Löwis [EMAIL PROTECTED] wrote: myASCIIRegex = re.compile('[A-Z]') myUniRegex = re.compile(u'\u2013') # en-dash then read the source file into a unicode string with codecs.read(), then expect re to match against the unicode string using either of those regexes if the string contains the relevant chars? Or do I need to do make all my regex patterns unicode strings, with u? It will work fine if the regular expression restricts itself to ASCII, and doesn't rely on any of the locale-specific character classes (such as \w). If it's beyond ASCII, or does use such escapes, you better make it a Unicode expression. yes, you have to be careful when writing unicode-senstive regular expressions: http://effbot.org/zone/unicode-objects.htm You can apply the same pattern to either 8-bit (encoded) or Unicode strings. To create a regular expression pattern that uses Unicode character classes for \w (and \s, and \b), use the (?u) flag prefix, or the re.UNICODE flag: pattern = re.compile((?u)pattern) pattern = re.compile(pattern, re.UNICODE) I'm not actually sure what precisely the semantics is when you match an expression compiled from a byte string against a Unicode string, or vice versa. I believe it operates on the internal representation, so \xf6 in a byte string expression matches with \u00f6 in a Unicode string; it won't try to convert one into the other. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Best ways of managing text encodings in source/regexes?
Hi I've read around quite a bit about Unicode and python's support for it, and I'm still unclear about how it all fits together in certain scenarios. Can anyone help clarify? * When I say # -*- coding: utf-8 -*- and confirm my IDE is saving the source file as UTF-8, do I still need to prefix all the strings constructed in the source with u as in myStr = ublah, even when those strings contain only ASCII or ISO-8859-1 chars? (It would be a bother for me to do this for the complete source I'm working on, where I rarely need chars outside the ISO-8859-1 range.) * Will python figure it out if I use different encodings in different modules -- say a main source file which is # -*- coding: utf-8 -*- and an imported module which doesn't say this (for which python will presumably use a default encoding)? This seems inevitable given that standard library modules such as re don't declare an encoding, presumably because in that case I don't see any non-ASCII chars in the source. * If I want to use a Unicode char in a regex -- say an en-dash, U+2013 -- in an ASCII- or ISO-8859-1-encoded source file, can I say myASCIIRegex = re.compile('[A-Z]') myUniRegex = re.compile(u'\u2013') # en-dash then read the source file into a unicode string with codecs.read(), then expect re to match against the unicode string using either of those regexes if the string contains the relevant chars? Or do I need to do make all my regex patterns unicode strings, with u? I've been trying to understand this for a while so any clarification would be a great help. Tim -- http://mail.python.org/mailman/listinfo/python-list
Re: Best ways of managing text encodings in source/regexes?
* When I say # -*- coding: utf-8 -*- and confirm my IDE is saving the source file as UTF-8, do I still need to prefix all the strings constructed in the source with u as in myStr = ublah, even when those strings contain only ASCII or ISO-8859-1 chars? (It would be a bother for me to do this for the complete source I'm working on, where I rarely need chars outside the ISO-8859-1 range.) Depends on what you want to achieve. If you don't prefix your strings with u, they will stay byte string objects, and won't become Unicode strings. That should be fine for strings that are pure ASCII; for ISO-8859-1 strings, I recommend it is safer to only use Unicode objects to represent such strings. In Py3k, that will change - string literals will automatically be Unicode objects. * Will python figure it out if I use different encodings in different modules -- say a main source file which is # -*- coding: utf-8 -*- and an imported module which doesn't say this (for which python will presumably use a default encoding)? Yes, it will. The encoding declaration is per-module. * If I want to use a Unicode char in a regex -- say an en-dash, U+2013 -- in an ASCII- or ISO-8859-1-encoded source file, can I say myASCIIRegex = re.compile('[A-Z]') myUniRegex = re.compile(u'\u2013') # en-dash then read the source file into a unicode string with codecs.read(), then expect re to match against the unicode string using either of those regexes if the string contains the relevant chars? Or do I need to do make all my regex patterns unicode strings, with u? It will work fine if the regular expression restricts itself to ASCII, and doesn't rely on any of the locale-specific character classes (such as \w). If it's beyond ASCII, or does use such escapes, you better make it a Unicode expression. I'm not actually sure what precisely the semantics is when you match an expression compiled from a byte string against a Unicode string, or vice versa. I believe it operates on the internal representation, so \xf6 in a byte string expression matches with \u00f6 in a Unicode string; it won't try to convert one into the other. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: Best ways of managing text encodings in source/regexes?
On Nov 27, 12:27 am, Martin v. Löwis [EMAIL PROTECTED] wrote: * When I say # -*- coding: utf-8 -*- and confirm my IDE is saving the source file as UTF-8, do I still need to prefix all the strings constructed in the source with u as in myStr = ublah, even when those strings contain only ASCII or ISO-8859-1 chars? (It would be a bother for me to do this for the complete source I'm working on, where I rarely need chars outside the ISO-8859-1 range.) Depends on what you want to achieve. If you don't prefix your strings with u, they will stay byte string objects, and won't become Unicode strings. That should be fine for strings that are pure ASCII; for ISO-8859-1 strings, I recommend it is safer to only use Unicode objects to represent such strings. In Py3k, that will change - string literals will automatically be Unicode objects. * Will python figure it out if I use different encodings in different modules -- say a main source file which is # -*- coding: utf-8 -*- and an imported module which doesn't say this (for which python will presumably use a default encoding)? Yes, it will. The encoding declaration is per-module. * If I want to use a Unicode char in a regex -- say an en-dash, U+2013 -- in an ASCII- or ISO-8859-1-encoded source file, can I say myASCIIRegex = re.compile('[A-Z]') myUniRegex = re.compile(u'\u2013') # en-dash then read the source file into a unicode string with codecs.read(), then expect re to match against the unicode string using either of those regexes if the string contains the relevant chars? Or do I need to do make all my regex patterns unicode strings, with u? It will work fine if the regular expression restricts itself to ASCII, and doesn't rely on any of the locale-specific character classes (such as \w). If it's beyond ASCII, or does use such escapes, you better make it a Unicode expression. I'm not actually sure what precisely the semantics is when you match an expression compiled from a byte string against a Unicode string, or vice versa. I believe it operates on the internal representation, so \xf6 in a byte string expression matches with \u00f6 in a Unicode string; it won't try to convert one into the other. Regards, Martin Thanks Martin, that's a very helpful response to what I was concerned might be an overly long query. Yes, I'd read that in Py3k the distinction between byte strings and Unicode strings would disappear -- I look forward to that... Tim -- http://mail.python.org/mailman/listinfo/python-list