subject:"Best ways of managing text encodings in source\/regexes\?"

Re: Best ways of managing text encodings in source/regexes?

2007-12-09 Thread tvn

Please see the correction from Cliff pasted here after this excerpt.
Tim

 the byte string is ASCII which is a subset of Unicode (IS0-8859-1
 isn't).)

The one comment I'd make is that ASCII and ISO-8859-1 are both subsets
of Unicode, (which relates to the abstract code-points) but ASCII is
also a subset of UTF-8, on the bytestream level, while ISO-8859 is not
a
subset of UTF-8, nor, as far as I can tell, any other unicode
*encoding*.

Thus a file encoded in ascii *is* in fact a utf-8 file.  There is no
way
to distinguish the two.  But an ISO-8859-1 file is not the same (on
the
bytestream level) as a file with identical content in UTF-8 or any
other
unicode encoding.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Best ways of managing text encodings in source/regexes?

2007-12-06 Thread tinkerbarbet

OK, for those interested in this sort of thing, this is what I now
think is necessary to work with Unicode in python.  Thanks to those
who gave feedback, and to Cliff in particular (but any remaining
misconceptions are my own!)  Here are the results of my attempts to
come to grips with this. Comments/corrections welcome...

(Note this is not about which characters one expects to match with \w
etc. when compiling regexes with python's re.UNICODE flag. It's about
the encoding of one's source strings when building regexes, in order
to match against strings read from files of various encodings.)

I/O: READ TO/WRITE FROM UNICODE STRING OBJECTS. Always use codecs to
read from a specific encoding to a python Unicode string, and use
codecs to encode to a specific encoding when writing the processed
data.  codecs.read() delivers a Unicode string from a specific
encoding, and codecs.write() will put the Unicode string into a
specific encoding (be that an encoding of Unicode such as UTF-8).

SOURCE: Save the source as UTF-8 in your editor, tell python with # -
*- coding: utf-8 -*-, and construct all strings with u'' (or ur''
instead of r'').  Then, when you're concatenating strings constructed
in your source with strings read with codecs, you needn't worry about
conversion issues. (When concatenating byte strings from your source
with Unicode strings, python will, without an explicit decode, assume
the byte string is ASCII which is a subset of Unicode (IS0-8859-1
isn't).)

Even if you save the source as UTF-8, tell python with # -*- coding:
utf-8 -*-, and say myString = blah, myString is a byte string.  To
construct a
Unicode string you must say myString = ublah or myString =
unicode(blah), even if your source is UTF-8.

Typing 'u' when constructing all strings isn't too arduous, and less
effort than passing selected non-ASCII source strings to unicode() and
needing to remember where to do it.  (You could easily slip a non-
ASCII char into a byte string in your code because most editors and
default system encodings will allow this.) Doing everything in Unicode
simplifies life.

Since the source is now UTF-8, and given Unicode support in the
editor, it
doesn't matter whether you use Unicode escape sequences or literal
Unicode
characters when constructing strings, since

 uá == u\u00E1
True

REGEXES: I'm a bit less certain about regexes, but this is how I think
it's going to work: Now that my regexes are constructed from Unicode
strings, and those regexes will be compiled to match against Unicode
strings read with codecs, any potential problems with encoding
conversion disappears.  If I put an
en-dash into a regex built using u'', and I happen to have read the
file in
the ASCII encoding which doesn't support en-dashes, the regex simply
won't
match because the pattern doesn't exist in the /Unicode/ string served
up by
codecs.  There's no actual problem with my string encoding handling,
it just means I'm looking for the wrong chars in a Unicode string read
from a file not saved in a Unicode encoding.

tIM
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Best ways of managing text encodings in source/regexes?

2007-11-27 Thread J. Clifford Dyer

On Tue, Nov 27, 2007 at 01:30:15AM +0200, tinker barbet wrote regarding Re: 
Best ways of managing text encodings in source/regexes?:
 
 Hi
 
 Thanks for your responses, as I said on the reply I posted I thought
 later it was a bit long so I'm grateful you held out!
 
 I should have said (but see comment about length) that I'd tried
 joining a unicode and a byte string in the interpreter and got that
 working, but wondered whether it was safe (I'd seen the odd warning
 about mixing byte strings and unicode).  Anyway what I was concerned
 about was what python does with source files rather than what happens
 from the interpreter, since I don't know if it's possible to change
 the encoding of the terminal without messing with site.py (and what
 else).
 
 Aren't both ASCII and ISO-8859-1 subsets of UTF-8?  Can't you then use
 chars from either of those charsets in a file saved as UTF-8 by one's
 editor, with a # -*- coding: utf-8 -*- pseudo-declaration for python,
 without problems?  You seem to disagree.
 
I do disagree.  Unicode is a superset of ISO-8859-1, but UTF-8 is a specific 
encoding, which changes many of the binary values.  UTF-8 was designed 
specifically not to change the values of ascii characters.  0x61 (lower case a) 
in ascii is encoded with the bits 0110 0001.  In UTF-8 it is also encoded 0110 
0001.  However, ?, latin small letter n with tilde, is unicode/iso-8859-1 
character 0xf1.  In ISO-8859-1, this is represented by the bits  0001.  

UTF-8 gets a little tricky here.  In order to be extensible beyond 8 bits, it 
has to insert control bits at the beginning, so this character actually 
requires 2 bytes to represent instead of just one.  In order to show that UTF-8 
will be using two bytes to represent the character, the first byte begins with 
110 (1110 is used when three bytes are needed).  Each successive byte begins 
with 10 to show that it is not the beginning of a character.  Then the 
code-point value is packed into the remaining free bits, as far to the right as 
possible.  So in this case, the control bits are 

110x  10xx .  

The character value, 0xf1, or:
 0001

gets inserted as follows:

110x xx{11}  10{11 0001}

and the remaining free x-es get replaced by zeroes.

1100 0011  1011 0001.

Note that the python interpreter agrees:

py x = u'\u00f1'
py x.encode('utf-8')
'\xc3\xb1'

(Conversion from binary to hex is left as an exercise for the reader)

So while ASCII is a subset of UTF-8, ISO-8859-1 is definitely not.  As others 
have said many times when this issue periodically comes up: UTF-8 is not 
unicode.  Hopefully this will help explain exactly why.

Note that with other encodings, like UTF-16, even ascii is not a subset.

See the wikipedia article on UTF-8 for a more complete explanation and external 
references to official documentation (http://en.wikipedia.org/wiki/UTF-8).



 The reason all this arose was that I was using ISO-8859-1/Latin-1 with
 all the right declarations, but then I needed to match a few chars
 outside of that range.  So I didn't need to use u before, but now I
 do in some regexes, and I was wondering if this meant that /all/ my
 regexes had to be constructed from u strings or whether I could just
 do the selected ones, either using literals (and saving the file as
 UTF-8) or unicode escape sequences (and leaving the file as ASCII -- I
 already use hex escape sequences without problems but that doesn't
 work past the end of ISO-8859-1).
 

Do you know about unicode escape sequences?

py u'\xf1' == u'\u00f1'
True

 Thanks again for your feedback.
 
 Best wishes
 Tim
 

No problem.  It took me a while to wrap my head around it, too.

Cheers,
Cliff
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Best ways of managing text encodings in source/regexes?

2007-11-27 Thread Kumar McMillan

On Nov 26, 2007 4:27 PM, Martin v. Löwis [EMAIL PROTECTED] wrote:
  myASCIIRegex = re.compile('[A-Z]')
  myUniRegex = re.compile(u'\u2013') # en-dash
 
  then read the source file into a unicode string with codecs.read(),
  then expect re to match against the unicode string using either of
  those regexes if the string contains the relevant chars?  Or do I need
  to do make all my regex patterns unicode strings, with u?

 It will work fine if the regular expression restricts itself to ASCII,
 and doesn't rely on any of the locale-specific character classes (such
 as \w). If it's beyond ASCII, or does use such escapes, you better make
 it a Unicode expression.

yes, you have to be careful when writing unicode-senstive regular expressions:
http://effbot.org/zone/unicode-objects.htm

You can apply the same pattern to either 8-bit (encoded) or Unicode
strings. To create a regular expression pattern that uses Unicode
character classes for \w (and \s, and \b), use the (?u) flag prefix,
or the re.UNICODE flag:

pattern = re.compile((?u)pattern)
pattern = re.compile(pattern, re.UNICODE)





 I'm not actually sure what precisely the semantics is when you match
 an expression compiled from a byte string against a Unicode string,
 or vice versa. I believe it operates on the internal representation,
 so \xf6 in a byte string expression matches with \u00f6 in a Unicode
 string; it won't try to convert one into the other.

 Regards,
 Martin

 --
 http://mail.python.org/mailman/listinfo/python-list

-- 
http://mail.python.org/mailman/listinfo/python-list

Best ways of managing text encodings in source/regexes?

2007-11-26 Thread tinkerbarbet

Hi

I've read around quite a bit about Unicode and python's support for
it, and I'm still unclear about how it all fits together in certain
scenarios.  Can anyone help clarify?

* When I say # -*- coding: utf-8 -*- and confirm my IDE is saving
the source file as UTF-8, do I still need to prefix all the strings
constructed in the source with u as in myStr = ublah, even when
those strings contain only ASCII or ISO-8859-1 chars?  (It would be a
bother for me to do this for the complete source I'm working on, where
I rarely need chars outside the ISO-8859-1 range.)

* Will python figure it out if I use different encodings in different
modules -- say a main source file which is # -*- coding: utf-8 -*-
and an imported module which doesn't say this (for which python will
presumably use a default encoding)?  This seems inevitable given that
standard library modules such as re don't declare an encoding,
presumably because in that case I don't see any non-ASCII chars in the
source.

* If I want to use a Unicode char in a regex -- say an en-dash, U+2013
-- in an ASCII- or ISO-8859-1-encoded source file, can I say

myASCIIRegex = re.compile('[A-Z]')
myUniRegex = re.compile(u'\u2013') # en-dash

then read the source file into a unicode string with codecs.read(),
then expect re to match against the unicode string using either of
those regexes if the string contains the relevant chars?  Or do I need
to do make all my regex patterns unicode strings, with u?

I've been trying to understand this for a while so any clarification
would be a great help.

Tim
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Best ways of managing text encodings in source/regexes?

2007-11-26 Thread Martin v. Löwis

 * When I say # -*- coding: utf-8 -*- and confirm my IDE is saving
 the source file as UTF-8, do I still need to prefix all the strings
 constructed in the source with u as in myStr = ublah, even when
 those strings contain only ASCII or ISO-8859-1 chars?  (It would be a
 bother for me to do this for the complete source I'm working on, where
 I rarely need chars outside the ISO-8859-1 range.)

Depends on what you want to achieve. If you don't prefix your strings
with u, they will stay byte string objects, and won't become Unicode
strings. That should be fine for strings that are pure ASCII; for
ISO-8859-1 strings, I recommend it is safer to only use Unicode
objects to represent such strings.

In Py3k, that will change - string literals will automatically be
Unicode objects.

 * Will python figure it out if I use different encodings in different
 modules -- say a main source file which is # -*- coding: utf-8 -*-
 and an imported module which doesn't say this (for which python will
 presumably use a default encoding)?

Yes, it will. The encoding declaration is per-module.

 * If I want to use a Unicode char in a regex -- say an en-dash, U+2013
 -- in an ASCII- or ISO-8859-1-encoded source file, can I say
 
 myASCIIRegex = re.compile('[A-Z]')
 myUniRegex = re.compile(u'\u2013') # en-dash
 
 then read the source file into a unicode string with codecs.read(),
 then expect re to match against the unicode string using either of
 those regexes if the string contains the relevant chars?  Or do I need
 to do make all my regex patterns unicode strings, with u?

It will work fine if the regular expression restricts itself to ASCII,
and doesn't rely on any of the locale-specific character classes (such
as \w). If it's beyond ASCII, or does use such escapes, you better make
it a Unicode expression.

I'm not actually sure what precisely the semantics is when you match
an expression compiled from a byte string against a Unicode string,
or vice versa. I believe it operates on the internal representation,
so \xf6 in a byte string expression matches with \u00f6 in a Unicode
string; it won't try to convert one into the other.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Best ways of managing text encodings in source/regexes?

2007-11-26 Thread tinkerbarbet

On Nov 27, 12:27 am, Martin v. Löwis [EMAIL PROTECTED] wrote:
  * When I say # -*- coding: utf-8 -*- and confirm my IDE is saving
  the source file as UTF-8, do I still need to prefix all the strings
  constructed in the source with u as in myStr = ublah, even when
  those strings contain only ASCII or ISO-8859-1 chars?  (It would be a
  bother for me to do this for the complete source I'm working on, where
  I rarely need chars outside the ISO-8859-1 range.)

 Depends on what you want to achieve. If you don't prefix your strings
 with u, they will stay byte string objects, and won't become Unicode
 strings. That should be fine for strings that are pure ASCII; for
 ISO-8859-1 strings, I recommend it is safer to only use Unicode
 objects to represent such strings.

 In Py3k, that will change - string literals will automatically be
 Unicode objects.

  * Will python figure it out if I use different encodings in different
  modules -- say a main source file which is # -*- coding: utf-8 -*-
  and an imported module which doesn't say this (for which python will
  presumably use a default encoding)?

 Yes, it will. The encoding declaration is per-module.

  * If I want to use a Unicode char in a regex -- say an en-dash, U+2013
  -- in an ASCII- or ISO-8859-1-encoded source file, can I say

  myASCIIRegex = re.compile('[A-Z]')
  myUniRegex = re.compile(u'\u2013') # en-dash

  then read the source file into a unicode string with codecs.read(),
  then expect re to match against the unicode string using either of
  those regexes if the string contains the relevant chars?  Or do I need
  to do make all my regex patterns unicode strings, with u?

 It will work fine if the regular expression restricts itself to ASCII,
 and doesn't rely on any of the locale-specific character classes (such
 as \w). If it's beyond ASCII, or does use such escapes, you better make
 it a Unicode expression.

 I'm not actually sure what precisely the semantics is when you match
 an expression compiled from a byte string against a Unicode string,
 or vice versa. I believe it operates on the internal representation,
 so \xf6 in a byte string expression matches with \u00f6 in a Unicode
 string; it won't try to convert one into the other.

 Regards,
 Martin

Thanks Martin, that's a very helpful response to what I was concerned
might be an overly long query.

Yes, I'd read that in Py3k the distinction between byte strings and
Unicode strings would disappear -- I look forward to that...

Tim
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Best ways of managing text encodings in source/regexes?

Re: Best ways of managing text encodings in source/regexes?

Re: Best ways of managing text encodings in source/regexes?

Re: Best ways of managing text encodings in source/regexes?

Best ways of managing text encodings in source/regexes?

Re: Best ways of managing text encodings in source/regexes?

Re: Best ways of managing text encodings in source/regexes?

7 matches

Site Navigation

Mail list logo

Footer information