Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-14 Thread Hrvoje Nikšić
On Fri, 2007-05-11 at 13:06 -0700, Guido van Rossum wrote:
  attribution_pattern = re.compile(ur'(---?(?!-)|\u2014) *(?=[^ \n])')
 
 But wouldn't it be just as handy to teach the re module about \u and
 \U, just as it already knows about \x (and \123 octals)?

And \n, \r, etc.  Implementing \u in re is not only useful, but also
consistent.



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-13 Thread M.-A. Lemburg
On 2007-05-12 02:42, Andrew McNabb wrote:
 On Sat, May 12, 2007 at 01:30:52AM +0200, M.-A. Lemburg wrote:
 I wonder how we managed to survive all these years with
 the existing consistent and concise definition of the
 raw-unicode-escape codec ;-)

 There are two options:

  * no one really uses Unicode raw strings nowadays

  * none of the existing users has ever stumbled across the
problem case that triggered all this

 Both ways, we're discussing a non-issue.
 
 
 Sure, it's a non-issue for Python 2.x.  However, when Python 3 comes
 along, and all strings are Unicode, there will likely be a lot more
 users stumbling into the problem case.

In the first case, changing the codec won't affect much code when
ported to Py3k.

In the second case, a change to the codec is not necessary.

Please also consider the following:

* without the Unicode escapes, the only way to put non-ASCII
  code points into a raw Unicode string is via a source code encoding
  of say UTF-8 or UTF-16, pretty much defeating the original
  requirement of writing ASCII code only

* non-ASCII code points in text are not uncommon, they occur
  in most European scripts, all Asian scripts,
  many scientific texts and in also texts meant for the web
  (just have a look at the HTML entities, or think of Word
  exports using quotes)

* adding Unicode escapes to the re module will break code
  already using ...\u... in the regular expressions for
  other purposes; writing conversion tools that detect this
  usage is going to be hard

* OTOH, writing conversion tools that simply work on string
  literals in general is easy

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 13 2007)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-13 Thread Martin v. Löwis
 * without the Unicode escapes, the only way to put non-ASCII
   code points into a raw Unicode string is via a source code encoding
   of say UTF-8 or UTF-16, pretty much defeating the original
   requirement of writing ASCII code only

That's no problem, though - just don't put the Unicode character
into a raw string. Use plain strings if you have a need to include
Unicode characters, and are not willing to leave ASCII.

For Python 3, the default source encoding is UTF-8, so it is
much easier to use non-ASCII characters in the source code.
The original requirement may not be as strong anymore as it
used to be.

 * non-ASCII code points in text are not uncommon, they occur
   in most European scripts, all Asian scripts,
   many scientific texts and in also texts meant for the web
   (just have a look at the HTML entities, or think of Word
   exports using quotes)

And you are seriously telling me that people who commonly
use non-ASCII code points in their source code are willing
to refer to them by Unicode ordinal number (which, of course,
they all know by heart, from 1 to 65536)?

 * adding Unicode escapes to the re module will break code
   already using ...\u... in the regular expressions for
   other purposes; writing conversion tools that detect this
   usage is going to be hard

It's unlikely to occur in code today - \u just means the same
as u (so \u1234 matches u1234); if you want a backslash
followed by u in your regular expression, you should write
\\u.

It would be possible to future-warn about \u in 2.6, catching
these cases. Authors then would either have to remove the
backslash, or duplicate it, depending on what they want to
express.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-13 Thread M.-A. Lemburg
On 2007-05-13 18:04, Martin v. Löwis wrote:
 * without the Unicode escapes, the only way to put non-ASCII
   code points into a raw Unicode string is via a source code encoding
   of say UTF-8 or UTF-16, pretty much defeating the original
   requirement of writing ASCII code only
 
 That's no problem, though - just don't put the Unicode character
 into a raw string. Use plain strings if you have a need to include
 Unicode characters, and are not willing to leave ASCII.
 
 For Python 3, the default source encoding is UTF-8, so it is
 much easier to use non-ASCII characters in the source code.
 The original requirement may not be as strong anymore as it
 used to be.

You can do that today: Just put the # coding: utf-8 marker
at the top of the file.

However, in some cases, your editor may not be capable of
displaying or letting you enter the Unicode text you have
in mind.

In other cases, there may be a corporate coding standard in
place that prohibits using non-ASCII text in source code,
or fixes the encoding to e.g. Latin-1.

In all those cases, it's necessary to be able to enter the
Unicode code points which do cannot be used in the source
code using other means and the easiest way to do this is
by using Unicode escapes.

 * non-ASCII code points in text are not uncommon, they occur
   in most European scripts, all Asian scripts,
   many scientific texts and in also texts meant for the web
   (just have a look at the HTML entities, or think of Word
   exports using quotes)
 
 And you are seriously telling me that people who commonly
 use non-ASCII code points in their source code are willing
 to refer to them by Unicode ordinal number (which, of course,
 they all know by heart, from 1 to 65536)?

No, I'm not. I'm saying that non-ASCII code points are in
common use and (together with the above bullet) that there
are situations where you can't put the relevant code point
directly into your source code.

Using Unicode escapes for these will always be a cludge,
but it's still better than not being able to enter the
code points at all.

 * adding Unicode escapes to the re module will break code
   already using ...\u... in the regular expressions for
   other purposes; writing conversion tools that detect this
   usage is going to be hard
 
 It's unlikely to occur in code today - \u just means the same
 as u (so \u1234 matches u1234); if you want a backslash
 followed by u in your regular expression, you should write
 \\u.
 
 It would be possible to future-warn about \u in 2.6, catching
 these cases. Authors then would either have to remove the
 backslash, or duplicate it, depending on what they want to
 express.

Good idea.

The re module would then have to implement the same escaping
scheme as the raw-unicode-escape code (only an odd number of
backslashes causes the escaping code to trigger).

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 13 2007)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-13 Thread Greg Ewing
M.-A. Lemburg wrote:
 * non-ASCII code points in text are not uncommon, they occur
   in most European scripts, all Asian scripts,

In an Asian script, almost every character is likely to
be non-ascii, which is going to be pretty hard to read
as a string of unicode escapes.

Maybe what we want is a new kind of string literal in
which *everything* is a unicode escape. A sufficiently
smart editor could then display it using the appropriate
characters, yet it could still be dealt with as ascii-
only in a pinch.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-11 Thread Ron Adam
Martin v. Löwis wrote:
 This is what prompted my question, actually: in Py3k, in the
 str/unicode unification branch, r\u1234 changes meaning: before the
 unification, this was an 8-bit string, where the \u was not special,
 but now it is a unicode string, where \u *is* special.
 
 That is true for non-raw strings also: the meaning of \u1234 also
 changes.
 
 However, traditionally, there was *no* escaping mechanism in raw strings
 in Python, and I feel that this is a good principle, because it is
 easy to learn (if you leave out the detail that \ can't be the last
 character in a raw string - which should get fixed also, IMO). So I
 think in Py3k, \u1234 should continue to be a string with 6
 characters. Otherwise, people will complain that
 os.stat(rc:\windows\system32\user32.dll) fails. Telling them to write
 os.stat(rc:\windows\system32\u005Cuser32.dll) will just cause puzzled
 faces.
 
 Windows path names are one of the two primary applications of raw
 strings (the other being regexes).


I think regular expressions become easier to read if they don't also 
contain python escape characters because then you don't have to mentally 
parse which ones are part of the regular expression and which ones are 
evaluated by python.  The re module can still evaluate r\u, r\', 
and r'\' sequences even if python doesn't.

I experimented with tokanize.c to see if the trailing '\' could be special 
cased in raw strings.  The minimum change I could come up with was to have 
it not respect slash-quote sequences, (for finding the end of a string), if 
the quote is the same type as the quote used to define the string.  The 
following strings in the library needed to be adjusted after that change.

I don't think this is the best solution, but the list of strings needing 
changed might be useful for the discussion.


-r'(\'[^\']*\'|[^]*|[][\-a-zA-Z0-9./,:;+*%?!$\(\)_#=~\'@]*))?')
+r'''(\'[^\']*\'|[^]*|[][\-a-zA-Z0-9./,:;+*%?!$\(\)_#=~\'@]*))?''')


-_declstringlit_match = re.compile(r'(\'[^\']*\'|[^]*)\s*').match
+_declstringlit_match = re.compile(r'''(\'[^\']*\'|[^]*)\s*''').match


-r'(?=[\w\!\\'\\.\,\?])-{2,}(?=\w))')   # em-dash
+r'''(?=[\w\!\\'\\.\,\?])-{2,}(?=\w))''')   # em-dash


- r'[\\']?'   # optional end-of-quote
+ r'''[\\']?'''   # optional 
end-of-quote


-_wordchars_re = re.compile(r'[^\\\'\%s ]*' % string.whitespace)
+_wordchars_re = re.compile(r'''[^\\\'\%s ]*''' % string.whitespace)


-HEADER_QUOTED_VALUE_RE = re.compile(r^\s*=\s*\([^\\\]*(?:\\.[^\\\]*)*)\)
+HEADER_QUOTED_VALUE_RE = 
re.compile(r'''^\s*=\s*\([^\\\]*(?:\\.[^\\\]*)*)\''')

-HEADER_JOIN_ESCAPE_RE = re.compile(r([\\\]))
+HEADER_JOIN_ESCAPE_RE = re.compile(r'([\\\])')

-quote_re = re.compile(r([\\\]))
+quote_re = re.compile(r'([\\\])')


-return re.sub(r'((\\[\\abfnrtv\']|\\[0-9]..|\\x..|\\u)+)',
+return re.sub(r'''((\\[\\abfnrtv\']|\\[0-9]..|\\x..|\\u)+)''',


-_OPTION_DIRECTIVE_RE = re.compile(r'#\s*doctest:\s*([^\n\']*)$',
+_OPTION_DIRECTIVE_RE = re.compile(r'''#\s*doctest:\s*([^\n\']*)$''',
re.MULTILINE)


-s = unicode(r'\x00=\'a\\b\x80\xff\u\u0001\u1234', 
'unicode-escape')
+s = unicode(r'''\x00=\'a\\b\x80\xff\u\u0001\u1234''', d


-_escape = re.compile(r[\\x80-\xff]+) # 1.5.2
+_escape = re.compile(r'[\\x80-\xff]+') # 1.5.2


-r'(\'[^\']*\'|[^]*|[-a-zA-Z0-9./,:;+*%?!$\(\)[EMAIL PROTECTED]))?')
+r'''(\'[^\']*\'|[^]*|[-a-zA-Z0-9./,:;+*%?!$\(\)[EMAIL PROTECTED]))?''')



I also noticed that python handles the '\' escape character differently 
than re does in regular strings.  In regular expressions, a single '\' is 
always an escape character.  If the following character is not a special 
character, then the two character combination becomes the second 
non-special character.

 \'  -- '
 \\  -- \
 \q  -- q  ('q' not special so '\q' is 'q')

This isn't how python does it.

  '\''
'
  \\
'\\'
  \q('q' not special, so Back slash is not an escape.)
'\q'


So it might be good to have it always be an escape in regular strings, and 
never be an escape in raw strings.

Ron

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-11 Thread M.-A. Lemburg
On 2007-05-11 07:52, Martin v. Löwis wrote:
 This is what prompted my question, actually: in Py3k, in the
 str/unicode unification branch, r\u1234 changes meaning: before the
 unification, this was an 8-bit string, where the \u was not special,
 but now it is a unicode string, where \u *is* special.
 
 That is true for non-raw strings also: the meaning of \u1234 also
 changes.
 
 However, traditionally, there was *no* escaping mechanism in raw strings
 in Python, and I feel that this is a good principle, because it is
 easy to learn (if you leave out the detail that \ can't be the last
 character in a raw string - which should get fixed also, IMO). So I
 think in Py3k, \u1234 should continue to be a string with 6
 characters. Otherwise, people will complain that
 os.stat(rc:\windows\system32\user32.dll) fails. Telling them to write
 os.stat(rc:\windows\system32\u005Cuser32.dll) will just cause puzzled
 faces.

Using double backslashes won't cause that reaction:

os.stat(c:\\windows\\system32\\user32.dll)

Also note that Windows is smart enough nowadays to parse
the good old Unix forward slash:

os.stat(c:/windows/system32/user32.dll)

 Windows path names are one of the two primary applications of raw
 strings (the other being regexes).

IMHO the primary use case are regexps and for those you'd
definitely want to be able to put Unicode characters into your
expressions.

BTW, if you use ur... for your expressions today (which you should
if you parse text), then nothing will change when removing the
'u' prefix in Py3k.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 11 2007)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-11 Thread Georg Brandl
M.-A. Lemburg schrieb:

 Windows path names are one of the two primary applications of raw
 strings (the other being regexes).
 
 IMHO the primary use case are regexps and for those you'd
 definitely want to be able to put Unicode characters into your
 expressions.

Except if sre_parse would recognize \u and \U escapes, just like it
does now with \x escapes.

Georg

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-11 Thread M.-A. Lemburg
On 2007-05-11 13:05, Thomas Heller wrote:
 M.-A. Lemburg schrieb:
 On 2007-05-11 07:52, Martin v. Löwis wrote:
 This is what prompted my question, actually: in Py3k, in the
 str/unicode unification branch, r\u1234 changes meaning: before the
 unification, this was an 8-bit string, where the \u was not special,
 but now it is a unicode string, where \u *is* special.
 That is true for non-raw strings also: the meaning of \u1234 also
 changes.

 However, traditionally, there was *no* escaping mechanism in raw strings
 in Python, and I feel that this is a good principle, because it is
 easy to learn (if you leave out the detail that \ can't be the last
 character in a raw string - which should get fixed also, IMO). So I
 think in Py3k, \u1234 should continue to be a string with 6
 characters. Otherwise, people will complain that
 os.stat(rc:\windows\system32\user32.dll) fails. Telling them to write
 os.stat(rc:\windows\system32\u005Cuser32.dll) will just cause puzzled
 faces.
 Using double backslashes won't cause that reaction:

 os.stat(c:\\windows\\system32\\user32.dll)
 
 Sure.  But I want to use raw strings for Windows path names; it's much easier
 to type.

But think of the price to pay if we disable use of Unicode
escapes in raw strings. And all of this just because of the
one special case: having a file name that starts with a U
and needs to be referenced literally in a Python application
together with a path leading up to it.

BTW, there's an easy work-around for this special case:

os.stat(os.path.join(rc:\windows\system32, user32.dll))

 Also note that Windows is smart enough nowadays to parse
 the good old Unix forward slash:

 os.stat(c:/windows/system32/user32.dll)
 
 In my opinion this is a windows bug and not a features.  Especially because 
 there
 are Windows api functions (the shell functions, IIRC) that do NOT accept
 forward slashes.
 
 Would you say that *nix is dumb because it doesn't parse \\usr\\include?

Sorry, I wasn't trying to imply that Windows is/was a dumb system.

I think it's nice that you can use forward slashes on Windows -
makes writing code that works in both worlds (Unix and Windows)
a lot easier.

 Windows path names are one of the two primary applications of raw
 strings (the other being regexes).

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 11 2007)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-11 Thread Thomas Heller
M.-A. Lemburg schrieb:
 On 2007-05-11 07:52, Martin v. Löwis wrote:
 This is what prompted my question, actually: in Py3k, in the
 str/unicode unification branch, r\u1234 changes meaning: before the
 unification, this was an 8-bit string, where the \u was not special,
 but now it is a unicode string, where \u *is* special.
 
 That is true for non-raw strings also: the meaning of \u1234 also
 changes.
 
 However, traditionally, there was *no* escaping mechanism in raw strings
 in Python, and I feel that this is a good principle, because it is
 easy to learn (if you leave out the detail that \ can't be the last
 character in a raw string - which should get fixed also, IMO). So I
 think in Py3k, \u1234 should continue to be a string with 6
 characters. Otherwise, people will complain that
 os.stat(rc:\windows\system32\user32.dll) fails. Telling them to write
 os.stat(rc:\windows\system32\u005Cuser32.dll) will just cause puzzled
 faces.
 
 Using double backslashes won't cause that reaction:
 
 os.stat(c:\\windows\\system32\\user32.dll)

Sure.  But I want to use raw strings for Windows path names; it's much easier
to type.

 Also note that Windows is smart enough nowadays to parse
 the good old Unix forward slash:
 
 os.stat(c:/windows/system32/user32.dll)

In my opinion this is a windows bug and not a features.  Especially because 
there
are Windows api functions (the shell functions, IIRC) that do NOT accept
forward slashes.

Would you say that *nix is dumb because it doesn't parse \\usr\\include?

 Windows path names are one of the two primary applications of raw
 strings (the other being regexes).
 

Thomas

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-11 Thread Guido van Rossum
On 5/10/07, Martin v. Löwis [EMAIL PROTECTED] wrote:
 Windows path names are one of the two primary applications of raw
 strings (the other being regexes).

I disagree with this use case; the r... notation was not invented
for this purpose. I won't compromise the escaping of quotes to
accommodate it. Nevertheless I think that \u and \U should lose their
special-ness in 3.0.

I'd like to hear from anyone who has access to *real code* that uses
\u or \U in a raw unicode string.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-11 Thread David Goodger
Guido van Rossum guido at python.org writes:
 I'd like to hear from anyone who has access to *real code* that uses
 \u or \U in a raw unicode string.

Docutils uses it in the docutils.parsers.rst.states module, Body class:

patterns = {
  'bullet': ur'[-+*\u2022\u2023\u2043]( +|$)',
...

attribution_pattern = re.compile(ur'(---?(?!-)|\u2014) *(?=[^ \n])')

-- David Goodger http://python.net/~goodger

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-11 Thread Guido van Rossum
On 5/11/07, David Goodger [EMAIL PROTECTED] wrote:
 Guido van Rossum guido at python.org writes:
  I'd like to hear from anyone who has access to *real code* that uses
  \u or \U in a raw unicode string.

 Docutils uses it in the docutils.parsers.rst.states module, Body class:

 patterns = {
   'bullet': ur'[-+*\u2022\u2023\u2043]( +|$)',
 ...

 attribution_pattern = re.compile(ur'(---?(?!-)|\u2014) *(?=[^ \n])')

But wouldn't it be just as handy to teach the re module about \u and
\U, just as it already knows about \x (and \123 octals)?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-11 Thread David Goodger
 Guido van Rossum guido at python.org writes:
  I'd like to hear from anyone who has access to *real code* that uses
  \u or \U in a raw unicode string.

David Goodger goodger at python.org writes:
 Docutils uses it in the docutils.parsers.rst.states module, Body class:
 
 patterns = {
   'bullet': ur'[-+*\u2022\u2023\u2043]( +|$)',
 ...
 
 attribution_pattern = re.compile(ur'(---?(?!-)|\u2014) *(?=[^ \n])')

Although admittedly, these don't *have* to be raw strings, since they don't
contain backslashes as regexp syntax.  They were made raw by reflex, because
they contain regular expressions.

-- DG

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-11 Thread David Goodger
 On 5/11/07, David Goodger [EMAIL PROTECTED] wrote:
  Docutils uses it in the docutils.parsers.rst.states module, Body class:
 
  patterns = {
'bullet': ur'[-+*\u2022\u2023\u2043]( +|$)',
  ...
 
  attribution_pattern = re.compile(ur'(---?(?!-)|\u2014) *(?=[^ \n])')

On 5/11/07, Guido van Rossum [EMAIL PROTECTED] wrote:
 But wouldn't it be just as handy to teach the re module about \u and
 \U, just as it already knows about \x (and \123 octals)?

Could be. I'm just providing examples, as requested.
I leave the heavy thinking to others ;-)

-- 
David Goodger http://python.net/~goodger
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-11 Thread Michael Foord
Martin v. Löwis wrote:
 This is what prompted my question, actually: in Py3k, in the
 str/unicode unification branch, r\u1234 changes meaning: before the
 unification, this was an 8-bit string, where the \u was not special,
 but now it is a unicode string, where \u *is* special.
 

 That is true for non-raw strings also: the meaning of \u1234 also
 changes.

 However, traditionally, there was *no* escaping mechanism in raw strings
 in Python, and I feel that this is a good principle, because it is
 easy to learn (if you leave out the detail that \ can't be the last
 character in a raw string - which should get fixed also, IMO).
+1

Michael Foord
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-11 Thread Martin v. Löwis
 Using double backslashes won't cause that reaction:
 
 os.stat(c:\\windows\\system32\\user32.dll)

Please refer to the subject. We are talking about raw strings.

 Windows path names are one of the two primary applications of raw
 strings (the other being regexes).
 
 IMHO the primary use case are regexps

It's not a matter of opinion. It's a statistical fact that these
are the two cases where people use raw strings most.

 and for those you'd
 definitely want to be able to put Unicode characters into your
 expressions.

For regular expressions, you don't need them as part of the
string literal syntax: The re parser itself could support \u,
just like it supports \x today.

 BTW, if you use ur... for your expressions today (which you should
 if you parse text), then nothing will change when removing the
 'u' prefix in Py3k.

How do you know? Py3k hasn't been released, yet.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-11 Thread Martin v. Löwis
 BTW, there's an easy work-around for this special case:
 
 os.stat(os.path.join(rc:\windows\system32, user32.dll))

No matter what the decision is, there are always work-arounds.
The question is what language suits the users most. Being
able to specify characters by ordinal IMO has much less value
than the a consistent, concise definition of raw strings has.

 I think it's nice that you can use forward slashes on Windows -
 makes writing code that works in both worlds (Unix and Windows)
 a lot easier.

But, as Thomas says: you can't. You may be able to do so
when using the API directly, however, it fails if you
pass the file name in a command line of some tool that
takes /foo to mean a command line option foo.

Regards.
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-11 Thread Guido van Rossum
I think I'm going to break my own rules and ask Martin to write up a
PEP. Given the pragmatics that Windows pathnames *are* a common use
case, I'm willing to let allow the trailing \ in the string. A regular
expression containing a quote could be written using triple quotes,
e.g. r(['])[^']*\1. (A single  in a regular expression can
always be rewritten as [] AFAIK.)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-11 Thread Andrew McNabb
On Sat, May 12, 2007 at 01:30:52AM +0200, M.-A. Lemburg wrote:
 
 I wonder how we managed to survive all these years with
 the existing consistent and concise definition of the
 raw-unicode-escape codec ;-)
 
 There are two options:
 
  * no one really uses Unicode raw strings nowadays
 
  * none of the existing users has ever stumbled across the
problem case that triggered all this
 
 Both ways, we're discussing a non-issue.


Sure, it's a non-issue for Python 2.x.  However, when Python 3 comes
along, and all strings are Unicode, there will likely be a lot more
users stumbling into the problem case.

-- 
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868


pgpBqD8o8hnO1.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-10 Thread Guido van Rossum
I just discovered that, in all versions of Python as far back as I
have access to (2.0), \u escapes are interpreted inside raw
unicode strings. Thus:

 a = ur\u1234
 len(a)
1


Contrast this with:

 a = ur\x12
 len(a)
4


The \U escape has the same behavior, in versions that support it.

Does anyone remember why it is done this way? The reference manual
describes this behavior, but doesn't give an explanation:


When an r or R prefix is used in conjunction with a u or U
prefix, then the \u and \U escape sequences are processed
while all other backslashes are left in the string. For example, the
string literal ur\u0062\n consists of three Unicode characters:
`LATIN SMALL LETTER B', `REVERSE SOLIDUS', and `LATIN SMALL LETTER N'.
Backslashes can be escaped with a preceding backslash; however, both
remain in the string. As a result, \u escape sequences are only
recognized when there are an odd number of backslashes.


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-10 Thread Paul Moore
On 10/05/07, Guido van Rossum [EMAIL PROTECTED] wrote:
 I just discovered that, in all versions of Python as far back as I
 have access to (2.0), \u escapes are interpreted inside raw
 unicode strings. Thus:
[...]
 Does anyone remember why it is done this way? The reference manual
 describes this behavior, but doesn't give an explanation:

My memory is so dim as to be more speculation than anything else, but
I suspect it's simply because there's no other way of including
characters outside the ASCII range in a raw string.

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-10 Thread M.-A. Lemburg
On 2007-05-10 20:53, Paul Moore wrote:
 On 10/05/07, Guido van Rossum [EMAIL PROTECTED] wrote:
 I just discovered that, in all versions of Python as far back as I
 have access to (2.0), \u escapes are interpreted inside raw
 unicode strings. Thus:
 [...]
 Does anyone remember why it is done this way? The reference manual
 describes this behavior, but doesn't give an explanation:
 
 My memory is so dim as to be more speculation than anything else, but
 I suspect it's simply because there's no other way of including
 characters outside the ASCII range in a raw string.

This is per design (see PEP 100) and was done for the reason given
by Paul. The motivation for the chosen approach was to make Python's
raw Unicode strings compatible to Java's raw Unicode strings:

http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 10 2007)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-10 Thread Guido van Rossum
On 5/10/07, M.-A. Lemburg [EMAIL PROTECTED] wrote:
 On 2007-05-10 20:53, Paul Moore wrote:
  On 10/05/07, Guido van Rossum [EMAIL PROTECTED] wrote:
  I just discovered that, in all versions of Python as far back as I
  have access to (2.0), \u escapes are interpreted inside raw
  unicode strings. Thus:
  [...]
  Does anyone remember why it is done this way? The reference manual
  describes this behavior, but doesn't give an explanation:
 
  My memory is so dim as to be more speculation than anything else, but
  I suspect it's simply because there's no other way of including
  characters outside the ASCII range in a raw string.

 This is per design (see PEP 100) and was done for the reason given
 by Paul. The motivation for the chosen approach was to make Python's
 raw Unicode strings compatible to Java's raw Unicode strings:

 http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html

I'm not sure what Java compatibility buys us. It is also far from
perfect -- IIUC, in Java if you write \u0022 (that's the  character)
it counts as an opening or closing quote, and if you write \u005c (a
backslash) it can be used to escape the following character. OTOH, in
Python, you can write urC:\Program Files\u005c and voila, a raw
string terminating in a backslash. (In Java this would escape the 
instead.)

However, I understand the other reason (inclusion of non-ASCII
characters in raw strings) and I reluctantly agree with it.
Reluctantly, because it means I can't create a raw string containing a
\ followed by u or U -- I needed one of those today.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-10 Thread M.-A. Lemburg
On 2007-05-11 00:11, Guido van Rossum wrote:
 On 5/10/07, M.-A. Lemburg [EMAIL PROTECTED] wrote:
 On 2007-05-10 20:53, Paul Moore wrote:
 On 10/05/07, Guido van Rossum [EMAIL PROTECTED] wrote:
 I just discovered that, in all versions of Python as far back as I
 have access to (2.0), \u escapes are interpreted inside raw
 unicode strings. Thus:
 [...]
 Does anyone remember why it is done this way? The reference manual
 describes this behavior, but doesn't give an explanation:
 My memory is so dim as to be more speculation than anything else, but
 I suspect it's simply because there's no other way of including
 characters outside the ASCII range in a raw string.
 This is per design (see PEP 100) and was done for the reason given
 by Paul. The motivation for the chosen approach was to make Python's
 raw Unicode strings compatible to Java's raw Unicode strings:

 http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html
 
 I'm not sure what Java compatibility buys us. It is also far from
 perfect -- IIUC, in Java if you write \u0022 (that's the  character)
 it counts as an opening or closing quote, and if you write \u005c (a
 backslash) it can be used to escape the following character. OTOH, in
 Python, you can write urC:\Program Files\u005c and voila, a raw
 string terminating in a backslash. (In Java this would escape the 
 instead.)

http://mail.python.org/pipermail/python-dev/1999-November/001346.html
http://mail.python.org/pipermail/python-dev/1999-November/001392.html
and all the other postings in that month related to this.

 However, I understand the other reason (inclusion of non-ASCII
 characters in raw strings) and I reluctantly agree with it.
 Reluctantly, because it means I can't create a raw string containing a
 \ followed by u or U -- I needed one of those today.

 print ur\u005cu
\u

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 11 2007)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-10 Thread Martin v. Löwis
 However, I understand the other reason (inclusion of non-ASCII
 characters in raw strings) and I reluctantly agree with it.

I actually disagree with that. It is fairly easy to include non-ASCII
characters in a raw Unicode string - just type them in. Or, if that
fails, use string concatenation with a non-raw string:

rfoo\uhallo \u20ac rwelt

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-10 Thread Guido van Rossum
On 5/10/07, Martin v. Löwis [EMAIL PROTECTED] wrote:
  However, I understand the other reason (inclusion of non-ASCII
  characters in raw strings) and I reluctantly agree with it.

 I actually disagree with that. It is fairly easy to include non-ASCII
 characters in a raw Unicode string - just type them in.

That violates the convention used in many places that source code
should only contain printable ASCII, and all non-ASCII or unprintable
characters should be written using \x or \u escapes.

 Or, if that
 fails, use string concatenation with a non-raw string:

 rfoo\uhallo \u20ac rwelt

That makes for pretty unreadable source code though.

Looking for a third opinion,

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-10 Thread Martin v. Löwis
 I actually disagree with that. It is fairly easy to include non-ASCII
 characters in a raw Unicode string - just type them in.
 
 That violates the convention used in many places that source code
 should only contain printable ASCII, and all non-ASCII or unprintable
 characters should be written using \x or \u escapes.

Following that convention: How do you get a non-ASCII byte into
a raw byte string in Python 2.x?

You can't - so why should you be able to get a non-ASCII character
into a raw Unicode string?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-10 Thread Guido van Rossum
On 5/10/07, Martin v. Löwis [EMAIL PROTECTED] wrote:
  I actually disagree with that. It is fairly easy to include non-ASCII
  characters in a raw Unicode string - just type them in.
 
  That violates the convention used in many places that source code
  should only contain printable ASCII, and all non-ASCII or unprintable
  characters should be written using \x or \u escapes.

 Following that convention: How do you get a non-ASCII byte into
 a raw byte string in Python 2.x?

 You can't - so why should you be able to get a non-ASCII character
 into a raw Unicode string?

Fair enough.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-10 Thread Greg Ewing
Martin v. Löwis wrote:
 why should you be able to get a non-ASCII character
 into a raw Unicode string?

The analogous question would be why can't you get a
non-Unicode character into a raw Unicode string. That
wouldn't make sense, since Unicode strings can't even
hold non-Unicode characters (or at least they're not
meant to).

But it doesn't seem unreasonable to want to put
Unicode characters into a raw Unicode string. After
all, if it only contains ASCII characters there's
no need for it to be a Unicode string in the first
place.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-10 Thread Guido van Rossum
On 5/10/07, Greg Ewing [EMAIL PROTECTED] wrote:
 Martin v. Löwis wrote:
  why should you be able to get a non-ASCII character
  into a raw Unicode string?

 The analogous question would be why can't you get a
 non-Unicode character into a raw Unicode string. That
 wouldn't make sense, since Unicode strings can't even
 hold non-Unicode characters (or at least they're not
 meant to).

 But it doesn't seem unreasonable to want to put
 Unicode characters into a raw Unicode string. After
 all, if it only contains ASCII characters there's
 no need for it to be a Unicode string in the first
 place.

This is what prompted my question, actually: in Py3k, in the
str/unicode unification branch, r\u1234 changes meaning: before the
unification, this was an 8-bit string, where the \u was not special,
but now it is a unicode string, where \u *is* special.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-10 Thread Martin v. Löwis
Greg Ewing schrieb:
 Martin v. Löwis wrote:
 why should you be able to get a non-ASCII character
 into a raw Unicode string?
 
 The analogous question would be why can't you get a
 non-Unicode character into a raw Unicode string.

No, that would not be analogous. The string type in Python
is not an ASCII string type, but a byte string type. It
does not necessarily only hold ASCII characters, but
can (and, in hundreds of applications) does hold arbitrary
bytes. There is (in the non-raw form) support of filling
arbitrary bytes into a byte string literal.

So no, this is not analogous.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] \u and \U escapes in raw unicode string literals

2007-05-10 Thread Martin v. Löwis
 This is what prompted my question, actually: in Py3k, in the
 str/unicode unification branch, r\u1234 changes meaning: before the
 unification, this was an 8-bit string, where the \u was not special,
 but now it is a unicode string, where \u *is* special.

That is true for non-raw strings also: the meaning of \u1234 also
changes.

However, traditionally, there was *no* escaping mechanism in raw strings
in Python, and I feel that this is a good principle, because it is
easy to learn (if you leave out the detail that \ can't be the last
character in a raw string - which should get fixed also, IMO). So I
think in Py3k, \u1234 should continue to be a string with 6
characters. Otherwise, people will complain that
os.stat(rc:\windows\system32\user32.dll) fails. Telling them to write
os.stat(rc:\windows\system32\u005Cuser32.dll) will just cause puzzled
faces.

Windows path names are one of the two primary applications of raw
strings (the other being regexes).

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com